CN113159363B

CN113159363B - Event trend prediction method based on historical news reports

Info

Publication number: CN113159363B
Application number: CN202011607205.8A
Authority: CN
Inventors: 冯翱; 宋馨宇; 张学磊; 王维宽; 张举; 蔡佳志; 赵韦程; 吴锡
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-04-19
Anticipated expiration: 2040-12-30
Also published as: CN113159363A

Abstract

The invention relates to an event trend prediction method based on historical news reports, which comprises the steps of firstly determining the field of a new event to be predicted, and acquiring the same kind of event based on a public data set or data acquired by a network; clustering news describing the same specific event into a sub-event in each similar event, and acquiring subsequent event distribution information of each sub-event according to the relevance and time sequence of the event content; calculating the similarity of the similar event and the new event to be predicted to obtain a similar event; and then the similarity of the current sub-event and the sub-events in the similar event is obtained by calculating the similarity in the similar event, and the development trend of the current sub-event is predicted according to the two similarities and the event distribution information.

Description

Event trend prediction method based on historical news reports

Technical Field

The invention relates to the technical field of networks, in particular to an event trend prediction method based on historical news reports.

Background

With the development of the internet, a large amount of news is reported in a web text mode, efficient mining of web news is an important requirement for economic and social development, and prediction of the future trend of a certain news event is an important and difficult problem and has great economic and social values. The existing method usually infers by field experts according to own experience to predict the development of subsequent events, but due to the difference of the background and the inconsistency of the viewpoint of each person, the prediction result often has larger difference, and the accuracy rate cannot be ensured.

The human forecast of the event development is usually based on personal knowledge accumulation and historical event records, and the forecast by adopting an algorithm model also usually adopts a similar idea, and the trend of the current event is forecasted based on the subsequent development of the historical similar event.

The existing trend prediction method is mainly based on subjective judgment of domain experts, lacks systematic algorithm and model support, has the defects that the domain experts are not unique, different judgments are probably made according to respective backgrounds, standpoints and tendencies, and can not give more reliable consistent opinions. For the opinion of experts in the field of subsequent prediction inquiry of events, the judgment of the experts is used as a criterion, and systematic algorithm and model support are lacked.

There are several critical issues. Firstly, no two events are completely the same, and the historical events are judged to be similar, so that the historical events have larger ambiguity, and the events with larger differences do not necessarily have reference value for the current events; secondly, the trend of an event has certain uncertainty, and various possible subsequent changes along with external influence factors can lead to different results, so that a systematic prediction model is lacked.

There are researches to find a scene almost consistent with the field, time, place, content and current development of the current event from the historical event, and to judge the future development of the current event by the subsequent trend of the event. However, it is difficult to find a historical event that is consistent with various factors of the current event, so that the historical information cannot be used for judgment.

Therefore, on the premise of accumulation of a large number of historical events, how to reduce the subjectivity of event trend prediction to achieve higher accuracy rate is very important in the field of public opinion analysis.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an event trend prediction method based on historical news reports, which comprises the following steps:

step 1: the method comprises the steps of firstly determining the field of an event to be predicted, downloading the existing public news text data set and label information in the specified field, and downloading the field news of a specific news website by using a web crawler if the public data set does not exist.

Step 2: when the data set downloaded by the web crawler has no labeling information, the labeling of the main events is completed by a method of manual labeling and news classification/clustering;

and step 3: similar event calculation, namely using a set similarity threshold value in news with main event labeling, performing similarity calculation after removing key 3W information, finding out a similarity event, and labeling the similarity event as a similar event after manual verification;

and 4, step 4: clustering sub-events, wherein in each similar event, the similarity between every two news is calculated by taking key 3W information as a main part, and news describing the same specific event is clustered into one sub-event;

and 5: establishing context relation between the sub-events according to the relevance and time sequence of the event content in a semi-manual labeling mode among the sub-events obtained in the step 4, representing the context relation by using directed edges, pointing the subsequent events from the attributed events, and obtaining the subsequent event distribution information of each sub-event;

step 6: for a new event to be predicted, obtaining core description news, or collecting news reports related to the new event to be predicted from an open information source, and labeling the news reports without labeling information. Extracting key words from a new event to be predicted, and crawling a search result in an open information source according to the extracted key words;

and 7: determining the current sub-event of the new event to be predicted, carrying out the operation of step 4 on the new event to be predicted, establishing the sub-event, and finding the current sub-event;

and 8: calculating similar events, after removing critical 3W information, calculating the similarity between the new event to be predicted and all similar events, and taking the similar events with the similarity exceeding a set second threshold value as alternative similar events to form an alternative similar event library;

and step 9: and calculating similar sub-events, calculating the similarity of the current sub-event and the sub-events in the similar events, and discarding the sub-events which are lower than a third threshold.

Step 10: integrating the similarity between the new event to be predicted and the similar event obtained in the step 8, the similarity between the current sub-event and the sub-event in the similar event obtained in the step 9 and the distribution of the subsequent events of the sub-events in the similar event, and calculating the subsequent event score of the current sub-event;

step 11: the possible subsequent events are ranked according to the probability from large to small, and the top 5 possible subsequent events are listed as the subsequent development trend prediction of the current sub-event.

According to a preferred embodiment, the method for manual annotation comprises:

step 21: randomly extracting a small number (such as 1000 pieces) from downloaded news, and reading the news by a special annotation person to mark a main event related to the news;

step 22: if the event of the news is mentioned in the previous news, the event is incorporated into the same event, otherwise, a separate seed event is newly created;

step 23: using news in the marked events as a reference, calculating the similarity between the unmarked news and the unmarked news, classifying the news with the similarity reaching a first threshold value into the same event, and taking the event with the highest similarity when a plurality of events are sufficiently similar; the first threshold for general similarity is set to 0.75.

Step 24: clustering news which are not divided into other events by adopting a clustering method;

step 25: manually selecting a larger category from the clustering results for manual secondary labeling, and selecting a proper event to add into the existing set;

step 26: and when the number of the rest news is less than the set proportion or the number of the news is less than the set number, stopping clustering and artificial secondary labeling, otherwise, repeating the steps 24 to 26 and adjusting clustering parameters.

According to a preferred embodiment, step 6 specifically further includes:

step 61: extracting key descriptors from the event description;

step 62: using the key words to query a mainstream search engine to obtain URLs of related reports;

and step 63: using a crawler to obtain the report content;

step 64: and (3) repeating the step 2 to clean the data, but only processing the content relevant to the current event and ignoring noise data returned by other search engines.

The invention has the beneficial effects that:

1. based on historical news reports, certain manual labeling information with definite judgment standards is used as the basis for predicting subsequent events, on one hand, the tendency and the randomness of subjective judgment of field experts are avoided, on the other hand, different development trends in similar scenes can be fully utilized, and more comprehensive trend prediction is given.

2. Similar scene information is extracted from a plurality of historical events by combining with an automatic algorithm, possible subsequent evolution possibilities are found more completely, various possible corresponding scores or probabilities are calculated quantitatively, and uncertainty analysis is carried out on future development by benefit-related parties.

Drawings

FIG. 1 is a flow chart of an event prediction method of the present invention;

FIG. 2 is a diagram of events of the same type and new events to be predicted according to the present invention; and

FIG. 3 is an example of a contextual relationship diagram between sub-events of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

Effective data mining is carried out based on network news, the prediction of the development trend of future events by using historical information is a work with great practical application value, the existing method is generally carried out by field experts based on experience, and the method has the defects of strong subjectivity, difficulty in obtaining consistent opinions, uncertain reliability and the like. Based on historical news reports, the method extracts similar scene information from a plurality of historical events by using certain artificial labeling information with definite judgment standards and combining an automatic algorithm, thereby completely finding out possible subsequent evolution possibilities and quantitatively calculating various possible corresponding scores or probabilities. Compared with a manual analysis mode, the method can better cover various event development trends and obtain higher prediction accuracy so as to better cope with various subsequent events which may occur.

The technical solution of the present invention is described in detail below with reference to the accompanying drawings.

FIG. 1 is a flow chart of an event prediction method of the present invention; as shown in fig. 1, the event trend prediction method based on historical news reports provided by the present invention includes:

step 1: the method comprises the steps of firstly determining the field of an event to be predicted, downloading the existing public news text data set and label information in the specified field, and downloading the field news of a specific news website by using a web crawler if the public data set does not exist. For public newsletters, there is typically annotation information as to whether they belong to the same event. The event to be predicted means: a larger scale event containing all relevant news for future development prediction.

Step 2: when the data set downloaded by the web crawler has no labeling information, the labeling of the main events needs to be completed by a method of manual labeling and news classification/clustering. The specific labeling method is as follows:

step 21: randomly extracting a small number of news reports from the downloaded news, and reading the news reports by a special marking person to mark main events related to the news reports; typically 1000 pieces of data are selected.

Step 22: events such as this news item, which were mentioned in the previous news item, are incorporated into the same event, otherwise a separate seed event is created.

Step 24: and clustering news which are not divided into other events by adopting a clustering method.

Such as K-means clustering and hierarchical clustering.

Step 25: manually selecting a larger category from the clustering results for manual secondary labeling, and selecting a proper event to add into the existing set; the larger category refers to events that are larger than 100 pieces of data.

Less than the set ratio means not more than 10% of the total amount, and the set number is generally set to 10. Adjusting the clustering parameters refers to obtaining fewer categories or merging more news into existing categories by lowering the parameter K of K-means or the similarity threshold of hierarchical clustering.

Fig. 2 is a schematic diagram of the same kind of events and new events to be predicted according to the present invention.

And step 3: and (4) similar event calculation, wherein a set similarity threshold is used in news with main event labeling. Typically 0.8 may be desirable. Performing similarity calculation after removing the key 3W information to find out a similarity event, and marking the similarity event as a similar event after manual verification; a homogeneous event refers to a large event containing several news stories, and may contain several sub-events with different 3W information inside.

The similarity calculation formula uses the cosine of the included angle of the vector, and the value range of the similarity is 0-1.

Wherein D1 and D2 respectively refer to two documents of which the similarity is to be calculated,

and

for the purpose of its embedded vector representation,

is the vector inner product calculation, and is the modulus of the vector expression.

The text feature expression mode uses Word embedding expression, and usable models comprise SkipGram and GloVe of Word2Vec and the like, and the vector dimension is 200.

The critical 3W information includes: who people, When time, Where location.

And 4, step 4: and sub-event clustering, wherein in each similar event, the similarity between every two news is calculated by taking key 3W information as a main part, and the news describing the same specific event is clustered into a sub-event. For the accuracy of trend prediction, the number of news items of the same kind of events cannot be too small, and generally should be not less than 20.

And 5: establishing context relation between the sub-events according to the relevance and time sequence of the event content in a semi-manual labeling mode among the sub-events obtained in the step 4, representing the context relation by using directed edges, pointing the subsequent events from the attributed events, and obtaining the subsequent event distribution information of each sub-event; attributed events refer to events that occur earlier in time and subsequent times refer to events that occur later in time. The function of the step 5 is as follows: and establishing a sub-event network in one event or one class of events, and then predicting other sub-events which possibly occur and the occurrence possibility in the new event according to the information of the sub-events which have occurred based on the information of the historical events.

An example of a contextual relationship graph between sub-events is shown in fig. 3.

Step 6: for the new event to be predicted, obtaining core description news, or collecting news reports related to the new event to be predicted from an open information source, and labeling the news reports without labeling information. The specific method comprises the steps of extracting keywords from a new event to be predicted, and crawling search results in an open information source according to the extracted keywords. Core description news refers to: of the news stories associated with an event, the relevant story that is the earliest or describes the most important sub-event. Mainly obtained by a manual mode. Specifically, the method comprises the following steps:

step 61: extracting key descriptors from the event description;

and step 63: using a crawler to obtain the report content;

step 64: and (3) repeating the step 2 to clean the data, but only processing the content relevant to the current event and ignoring noise data returned by other search engines. Therefore, the efficiency of data cleaning can be improved, and the labeling time is shortened.

The functions of steps 61 to 64 are: and obtaining the current existing news reports of the event to be predicted by a method similar to the historical event, clustering the news reports, and using the news reports as a preamble operation for establishing the sub-event.

And 7: and determining the current sub-event of the new event to be predicted, carrying out the operation of the step 4 on the new event to be predicted, establishing the sub-event, and finding the current sub-event.

The current sub-event refers to the sub-event that occurs most recently. The method comprises the following steps: it is determined which event is currently subject to a subsequent development prediction.

And 8: and calculating similar events, after removing the critical 3W information, calculating the similarity between the new event to be predicted and all similar events, and taking the similar events with the similarity exceeding a set second threshold value as alternative similar events to form an alternative similar event library. The second threshold value is typically set to 0.5.

The invention predicts the similar events in the current and future based on the evolution situation of the similar events.

The third threshold value is typically set to 0.5. The function of this step is: reducing the number of sub-events that need to be processed by step 10.

Step 10: and (3) integrating the similarity between the new event to be predicted and the similar event obtained in the step (8), the similarity between the current sub-event and the sub-event in the similar event obtained in the step (9) and the distribution of the sub-event subsequent events in the similar event, and calculating the subsequent event score of the current sub-event, wherein the mathematical expression of the score is as follows:

the Event is an Event to be analyzed currently and comprises all sub-events and corresponding news report contents, the Current refers to the latest Current sub-Event in the Current Event, the Event i refers to all historical events similar to the Event enough, the Event j refers to a plurality of sub-events similar to the Current in the Event i, the subsequence refers to the Subsequent sub-events of one or more of the sub-events, sim refers to the similarity between the events or the sub-events, Out _ degree refers to the number of the Subsequent sub-events of the sub-Event j in the historical Event, the parameter alpha refers to the weighting coefficient of the Event, the historical Event without the same kind of events is 1, if more similar events exist, higher weight can be taken, and 2 can be taken for simplification.

Step 11: and sequencing the possible subsequent events according to the probability from large to small, and listing the top n possible subsequent events as the subsequent development trend prediction of the current sub-event. n can be adjusted according to actual conditions, and in one embodiment of the invention, n is 5.

If the probability of occurrence of the subsequent sub-event is required to be output instead of the score, the Softmax operator can be used to convert the score of each possible subsequent sub-event into a probability.

It should be noted that the above-mentioned embodiments are exemplary, and that those skilled in the art, having benefit of the present disclosure, may devise various arrangements that are within the scope of the present disclosure and that fall within the scope of the invention. It should be understood by those skilled in the art that the present specification and figures are illustrative only and are not limiting upon the claims. The scope of the invention is defined by the claims and their equivalents.

Claims

1. A method for predicting event trends based on historical news stories, the method comprising:

step 1: firstly, determining the field of an event to be predicted, downloading the existing public news text data set and tag information in the specified field, and if no public data set exists, downloading the field news of a specific news website by using a web crawler;

and step 3: similar event calculation, namely using a set similarity threshold value in news with main event labeling, performing similarity calculation after removing key 3W information to find out a similarity event, and labeling the similarity event as a similar event after manual verification, wherein the key 3W information comprises: who people, When time and Where;

step 6: for a new event to be predicted, acquiring core description news, or acquiring news reports related to the new event to be predicted from an open information source, and labeling the news reports without labeling information;

and step 9: calculating similar sub-events, calculating the similarity of the current sub-event and the sub-events in the similar events, and discarding the sub-events lower than a third threshold;

step 10: and (3) integrating the similarity between the new event to be predicted and the similar event obtained in the step (8), the similarity between the current sub-event and the sub-event in the similar event obtained in the step (9) and the distribution of the subsequent events of the sub-event in the similar event, and calculating the score of the subsequent event of the current sub-event, wherein the score is expressed mathematically as follows:

the Event is an Event to be analyzed currently and comprises all sub-events and corresponding news report contents, the Current refers to the latest Current sub-Event in the Current Event, the Event i refers to all historical events which are sufficiently similar to the Event, the Event j refers to a plurality of sub-events which are most similar to the Current in the Event i, the Subsequent refers to the Subsequent sub-events of one or more of the sub-events, sim refers to the similarity between the events or the sub-events, the Out _ degree refers to the number of the Subsequent sub-events of the sub-Event j in the historical Event, and the parameter alpha refers to the weighting coefficient of the Event;

2. The event trend prediction method of claim 1, wherein the manual labeling method comprises:

step 21: randomly extracting a small number of news reports from the downloaded news, and reading the news reports by a special marking person to mark main events related to the news reports;

step 23: using news in the marked events as a reference, calculating the similarity between the unmarked news and the unmarked news, classifying the news with the similarity reaching a first threshold value into the same event, and taking the event with the highest similarity when a plurality of events are sufficiently similar;

3. The event trend prediction method according to claim 2, wherein the step 6 further comprises:

step 61: extracting key descriptors from the event description;

and step 63: using a crawler to obtain the report content;