CN108170671A

CN108170671A - A kind of method for extracting media event time of origin

Info

Publication number: CN108170671A
Application number: CN201711378174.1A
Authority: CN
Inventors: 李坤宏; 林淑金; 周凡
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2017-12-19
Filing date: 2017-12-19
Publication date: 2018-06-15

Abstract

The embodiment of the invention discloses a kind of method for extracting media event time of origin, wherein, this method includes：News report article is obtained, each word in the news report article is extracted, is calculated with the title of news report article, obtain the degree of correlation between the two；The word in the news report article is obtained, is calculated, obtains the position of word distribution；To the word in the news report article by identifying processing, the keyword in the news report article is obtained；Keyword tissue in the news report article summarizes the main contents of entire chapter news report article, and the time of origin node of the news report article is obtained using search engine.Implement the embodiment of the present invention, obtain so as to extract that media event time of origin is more intelligent, accuracy rate is improved；The operation is more convenient, and greatly reduces job costs.

Description

A kind of method for extracting media event time of origin

Technical field

The present invention relates to information retrieval, text mining, natural language processing, field of artificial intelligence more particularly to one The method of kind extraction media event time of origin.

Background technology

With the rapid development of Internet technology, information and information are brought all in data application field to people’s lives It is mostly convenient, in the case that information data is in explosive increase, user's fast browsing in the data of magnanimity is helped to want what is obtained Information is particularly important.News is exactly message, reflects the event occurred all the time, each media event having time attribute, One of the characteristics of real-time is the maximum of news.According to the expression characteristic of news, news time of origin be media event can not or How a scarce part, extract relevant temporal information from media event, it is referred to as most important the problem of.

Temporal information extraction belongs to the research field of information extraction, by prolonged research and development, to this problem Solution achieves certain development：Research is expanded in terms of temporal information expression；Serial achievement is obtained in terms of timestamp 's；Research has been carried out in terms of time-event relation.However, the temporal information about media event is extracted, at present only Research have focused largely on news report time and news briefing time, how accurately to extract the time of origin of media event, Research both domestic and external is also fewer.

In the prior art, have and method is stabbed between to analyze the technology of news, which passes through the time in news report The time of origin for stabbing to analyze and derive media event, which, which implements, mainly the defects of following two aspects： First, complicated for operation, time word needs are formulated specific rule or template and can be just mapped in calendar date in this method, and And each temporal information is done and is extracted based on covering downwards, heavy workload is not easy to operate；Second, for a news report, Relevant timestamp information can not be found in most cases or find the timestamp information of great quantities of spare, so being difficult to effectively Realize the extraction of media event time of origin.

Either based on URL time formats and the direct extracting time information of headline, this method is too simple, merely Information is obtained by headline, frequently can lead to bit error rate height；Structure and linking URL time format library, job costs increase, Not only cost higher, but also be difficult to accurately extract the real time of origin of media event.

Invention content

It is an object of the invention to overcome the deficiencies in the prior art, and the present invention provides during a kind of extraction media event generation Between method, accurately and effectively obtain the time of origin of magnanimity news messages, and overcome existing method in accuracy and extension The deficiency of property etc..

To solve the above-mentioned problems, the present invention proposes a kind of method for extracting media event time of origin, the method Including：

News report article is obtained, each word in the news report article is extracted, with news report article Title calculated, obtain the degree of correlation between the two；

The word in the news report article is obtained, is calculated, obtains the position of word distribution；

To the word in the news report article by identifying processing, the key in the news report article is obtained Word；

Keyword tissue in the news report article summarizes the main contents of entire chapter news report article, utilizes Search engine obtains the time of origin node of the news report article.

Preferably, the step of title with the news report article is calculated is mainly used to judge the word Importance degree, the word of the degree of correlation of word-headline in word w and the title title of the news report article Coincidence number represent.Wherein, i-th of word is w in set of words W_i, according to word w_iWith m news report article d_m's The degree of correlation of headline obtains additional score accordinglyIts publicity is expressed as： According to word and the matching result of headline, give word and add corresponding score, such as word w_iIn headline There is 1 word correlation, then word w_iAdditional scoreEqual to 1, word w_iIt is related to there is 2 words in headline, Then word w_iAdditional scoreEqual to 2, and so on.

Preferably, the step of word in the acquisition news report article is calculated mainly needs to consider word The position occurred in article judges the importance of word, then sets m news report article d_mHeadline in, n d_m In include the sum of word, according to word w_iAppear in d_mArticle position, obtain additional score accordingly Mathematic(al) representation it is as follows：

Wherein, s represents the sentence of news report document, s_jThe jth word of the news report is represented, for example, when setting threshold During value j≤3, then it represents that first three sentence of news report is important content distribution region, w_iIt appears within the region, phase can be obtained The additional score answered.

Preferably, the word in the news report article is by identifying processing, including：

To name, place name, institution term, time, number in article etc. name group of entities into name entity sets NE, If word w_iBelong to name entity sets NE, then can obtain an additional fractional value accordingly,For document d_mIn Word w_iThe additional score obtained about name Entity recognition.M news report document d in collection of document D_m, it is assumed that Document d_mIn each word w_iFeature it is mutual indepedent, i-th words of some theme z in set of words is set as z_i；P is closes The probability that keyword occurs, each word correspond to the score SS obtained by this news report_miIt is the linear combination of each characteristic component, It is shown below：

Wherein, the weights omega of r ∈ { title, loc, ner }, factor alpha and each characteristic component_jSeek most on data set Excellent combination；P is probability, to word according to score SS_miIt sorts from high to low, chooses n word composition keyword before score sequence Collect Q, as the keyword in the news report article.

Preferably, the keyword tissue in the news report article summarizes the master of entire chapter news report article Content is wanted, the time of origin node that the news report article is obtained using search engine is specifically included：

Based on the keyword acquired in LDA language models, each news d_mThere is a corresponding keyword set Q_m={ w₁, w₂,w₃,...,w_n, summarize the main contents of a news report using several keywords in set；

Using the media event knowledge base that news search engine is perfect, a news search engine interface is connected, input is new News term is Q_m={ w₁,w₂,w₃,...,w_n, such as " the http under search engine Google news engine interfaces:// Google.search_news (query=Q_m, num, start, country_code) ", input with media event is relevant looks into Ask sentence Q_mIt is retrieved, a query statement is the query statement for representing media event, such as query statement " Trump Speech ", and return to several news documents is crawled from several news documents of return using analytical tool before being ordered as U news documents and corresponding issuing time respectively constitute initial return news list X={ x₁,x₂,x₃,...,x_UAnd it is corresponding Issuing time sequence T={ t₁,t₂,t₃,...t_U, x_iIt represents initial and returns to i-th of news documents in news list X, t_iIt represents In issuing time sequence T with i-th of news documents x_iCorresponding issuing time, 1≤i≤U；

To each timing node t in the issuing time sequence T that parses_iFrequency f_iIt is counted, obtains frequency most High timing node, such as t_iAppearance frequency f_iIt is the maximum value in issuing time sequence T, then t_iIt is then media event d_m's Time of origin.

In embodiments of the present invention, by constantly training, study and iterative process, news content keyword side is being extracted Face, it is more intelligent, and accuracy rate greatly improves.Method by connecting news knowledge base is returned new by search engine The method for hearing temporal information obtains the initial time of origin of media event, implement it is easier, greatly reduce work into This.

Description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention, for those of ordinary skill in the art, without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.

Fig. 1 is the method flow schematic diagram of the extraction news time time of origin of the embodiment of the present invention；

Fig. 2 is the time of origin node for obtaining the news report article in the embodiment of the present invention using search engine Flow diagram.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without creative efforts Embodiment shall fall within the protection scope of the present invention.

Fig. 1 is a kind of method flow schematic diagram of extraction media event time of origin of the embodiment of the present invention, such as Fig. 1 institutes Show, this method includes：

S1 obtains news report article, each word in the news report article is extracted, with the Xin Wen Bao The title of road article is calculated, and obtains the degree of correlation between the two；

S2, the word obtained in the news report article are calculated, and obtain the position of word distribution；

S3 to the word in the news report article by identifying processing, obtains the pass in the news report article Keyword；

S4, the keyword tissue in the news report article summarize the main contents of entire chapter news report article, The time of origin node of the news report article is obtained using search engine.

The embodiment of the present invention is to employ bayesian probability theory to extract keyword to news report article.Its In, we carry out establishing LDA language models based on bayesian probability theory, and the model includes three layers, and master is obtained from article Topic, extracts some theme from theme branch, extracts a word from the word distribution corresponding to the theme, repeats The above process finally obtains the keyword in the article until each word of traversal.

Specifically, being calculated with the title of the news report article described in S1, is mainly used to judge the word Importance degree, the word of the degree of correlation of word-headline in word w and the title title of the news report article Coincidence number represent.Wherein, i-th of word is w in set of words W_i, according to word w_iWith m news report article d_m's The degree of correlation of headline obtains additional score accordinglyIts formula is expressed as：

According to word and the matching result of headline, give word and add corresponding score, such as word w_iWith it is new Hearing in title has 1 word correlation, then word w_iAdditional scoreEqual to 1, word w_iAnd have 2 words in headline Language is related, then word w_iAdditional scoreEqual to 2, and so on.

S2 is described further：

Method based on word position is that field is relevant, such as in some fields, and a word of paragraph includes theme Information, and some fields then appear in last sentence.In news report, the high sentence of information content typically occurs in former sentences With section head, the position that consideration word occurs in article is needed to judge the importance of word, then sets m news report articles d_mHeadline in, n d_mIn include the sum of word, according to word w_iAppear in d_mArticle position, obtain corresponding attached Bonus point numberThen its formula is expressed as：

Wherein, s represents the sentence of news report document, s_lThe l words of the news report are represented, for example, when setting threshold During value l≤3, then it represents that first three sentence of news report is important content distribution region, as word w_iIt appears within the region, it can be with Obtain additional score accordingly.

Wherein, described in S3 by identifying processing it is to employ name entity to the word in the news report article Identifying processing names entity to name, place name, institution term, time, number in article etc..

S3 is described further：

S4 is described further：

As shown in Fig. 2, using the perfect media event knowledge base of news search engine, one news search engine of connection connects Mouthful, input news retrieval word is Q_m={ w₁,w₂,w₃,...,w_n, such as under search engine Google news engine interfaces “http://Google.search_news (query=Q_m, num, start, country_code) ", input and media event Relevant query statement Q_mIt is retrieved, a query statement is the query statement such as query statement " Trump for representing media event Speech ", and return to several news documents is crawled from several news documents of return using analytical tool before being ordered as U news documents and corresponding issuing time respectively constitute initial return news list X={ x₁,x₂,x₃,...,x_UAnd it is corresponding Issuing time sequence T={ t₁,t₂,t₃,...t_U, x_iIt represents initial and returns to i-th of news documents in news list X, t_iIt represents In issuing time sequence T with i-th of news documents x_iCorresponding issuing time, 1≤i≤U；

One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can be stored in a computer readable storage medium, storage Medium can include：Read-only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD etc..

In addition, a kind of method of extraction media event time of origin provided above the embodiment of the present invention has carried out in detail Thin to introduce, specific case used herein is expounded the principle of the present invention and embodiment, and above example is said The bright method and its core concept for being merely used to help understand the present invention；Meanwhile for those of ordinary skill in the art, foundation The thought of the present invention, there will be changes in specific embodiments and applications, in conclusion the content of the present specification is not It is interpreted as limitation of the present invention.

Claims

A kind of 1. method for extracting media event time of origin, which is characterized in that the method includes：

News report article is obtained, extracts each word in the news report article, the mark with news report article Topic is calculated, and obtains the degree of correlation between the two；

The word in the news report article is obtained, is calculated, obtains the position of word distribution；

To the word in the news report article by identifying processing, the keyword in the news report article is obtained；

Keyword tissue in the news report article summarizes the main contents of entire chapter news report article, utilizes search Engine obtains the time of origin node of the news report article.
2. a kind of method for extracting media event time of origin as described in claim 1, which is characterized in that it is described with it is described The step of title of news report article is calculated is mainly used to judge the importance degree of the word, word-headline The degree of correlation represented with word w with the number that overlaps of the word in the title title of the news report article.Wherein, word collection It is w to close i-th of word in W_i, according to word w_iWith m news report article d_mHeadline the degree of correlation, obtain corresponding Additional scoreIts publicity is expressed as：According to word and headline Matching result, give word and add corresponding score, such as word w_iIt is related to there is 1 word in headline, then word Language w_iAdditional scoreEqual to 1, word w_iIt is related to there is 2 words in headline, then word w_iAdditional scoreEqual to 2, and so on.
A kind of 3. method for extracting media event time of origin as described in claim 1, which is characterized in that the acquisition institute Stating the step of word in news report article is calculated mainly needs to consider that grammatical term for the character is carried out in the position that word occurs in article The importance of language then sets m news report article d_mHeadline in, n d_mIn include the sum of word, according to word w_iAppear in d_mArticle position, obtain additional score accordinglyThe following institute of mathematic(al) representation Show：

Wherein, s represents the sentence of news report document, s_jThe jth word of the news report is represented, for example, when setting threshold value j≤3 When, then it represents that news report first three sentence is important content distribution region, w_iIt appears within the region, can obtain corresponding attached Bonus point number.
4. a kind of method for extracting media event time of origin as described in claim 1, which is characterized in that described to described new The word in report article is heard by identifying processing, including：

If word w_iBelong to name entity sets NE, then can obtain an additional fractional value accordingly,For document d_mIn word w_iThe additional score obtained about name Entity recognition.M news report document d in collection of document D_m, Assuming that document d_mIn each word w_iFeature it is mutual indepedent, i-th words of some theme z in set of words is set as z_i；Its In, p is the probability that keyword occurs, and each word corresponds to the score SS obtained by this news report_miIt is each characteristic component Linear combination is shown below：

Wherein, the weights omega of r ∈ { title, loc, ner }, factor alpha and each characteristic component_jSeek optimal set on data set It closes；P is probability, to word according to score SS_miIt sorts from high to low, chooses n word composition keyword set Q before score sequence, As the keyword in the news report article.