CN111950199A

CN111950199A - Earthquake data structured automation method based on earthquake news event

Info

Publication number: CN111950199A
Application number: CN202010799527.0A
Authority: CN
Inventors: 俞一奇; 邱彦林; 陈尚武
Original assignee: Hangzhou Xujian Science And Technology Co ltd
Current assignee: Hangzhou Xujian Science And Technology Co ltd
Priority date: 2020-08-11
Filing date: 2020-08-11
Publication date: 2020-11-17

Abstract

The invention provides an earthquake data structuralization automatic method based on earthquake news events, which utilizes a web crawler to crawl a large amount of news data for a preset earthquake related website; marking trigger words and event elements in the collected news data in a BIO marking mode; randomly dividing a data set into a training data set and a testing data set; constructing a seismic event extraction model, wherein the model is realized by adopting a mode of combining a bidirectional long-time memory network and a conditional random field; and training the seismic event extraction model by using the marked training set. In the training process, testing the model by using the test set data, and if the precision requirement is met, finishing the training; and deploying the trained seismic event extraction model into practical application. Crawling earthquake related websites through a web crawler, analyzing each crawled content in real time through an earthquake event model, and further extracting earthquake event elements and storing the earthquake event elements into a database if the earthquake event elements accord with the earthquake event types.

Description

Earthquake data structured automation method based on earthquake news event

Technical Field

The invention relates to the technical field of natural language processing, in particular to an earthquake data structuring automatic method based on earthquake news events.

Background

An earthquake news event generally refers to related news content acquired sometime and someplace due to an earthquake, and generally consists of a number of elements, generally including: occurrence time, epicenter position, seismic source depth, magnitude of earthquake, number of injured people, number of dead people, direct economic loss and the like. The earthquake occurs 10 thousands times each year in the world, and the number of earthquakes above grade 3.0 in 2018 is 542, while the related news reports about the earthquakes are not counted. Valuable element contents are extracted from massive earthquake news reports, and the integration and the structuralization can provide necessary basic information for the subsequent earthquake disaster analysis and prediction.

With the improvement of the degree of publicizing internet information and the development of natural language processing technology, a scheme of acquiring original earthquake news information through a network and then processing the earthquake news information by using a natural language model to obtain a corresponding result becomes practical. The method can realize automatic acquisition of relevant earthquake information, and is convenient for later retrieval and analysis; and manual searching and screening are not needed, so that the labor cost is greatly reduced, and the method has important large data value.

Disclosure of Invention

In view of the above, the invention provides an earthquake data structuring automatic method based on earthquake news events, which continuously crawls news of earthquake related websites through a web crawler, processes news contents by using a trained earthquake event extraction model to judge whether the news contents are earthquake events, and further extracts related elements in the news contents and stores the extracted related elements into a database if the news contents are earthquake events so as to provide necessary basic information for subsequent earthquake disaster analysis and prediction.

In order to achieve the purpose, the invention provides the following technical scheme:

an automated method for seismic data structuring based on seismic news events, substantially comprising the steps of:

step (1): crawling a large amount of news data for a preset earthquake related website by using a web crawler;

step (2): marking trigger words and event elements in the collected news data in a BIO marking mode;

and (3): randomly dividing a data set into a training data set and a testing data set;

and (4): constructing a seismic event extraction model, wherein the model is realized by adopting a mode of combining a bidirectional long-time memory network (Bi-LSTM) and a Conditional Random Field (CRF);

and (5): and training the seismic event extraction model by using the marked training set. In the training process, testing the model by using the test set data, and if the precision requirement is met, finishing the training;

and (6): and deploying the trained seismic event extraction model into practical application. Crawling earthquake related websites through a web crawler, analyzing each crawled content in real time through an earthquake event model, and further extracting earthquake event elements and storing the earthquake event elements into a database if the earthquake event elements accord with the earthquake event types.

Compared with the prior art, the invention has the beneficial effects that:

the method can automatically and accurately extract the earthquake events and the related event elements aiming at massive news internet data, is convenient for retrieval and analysis, and provides necessary basic information for subsequent earthquake disaster analysis and prediction; and manual searching and screening are not needed, so that the labor cost is greatly reduced, and the method has important big data application and research values.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is an overall flow chart of a seismic data structuring automation method based on seismic news events provided in an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a Bi-LSTM recurrent neural network provided in an embodiment of the present invention;

as shown in fig. 2, the Bi-LSTM is composed of 2 × n cells, each having the same structure, where n is equal to the length of the input data. Each unit consists of an input layer, a hidden layer and an output layer, the output of the first unit is used as the input of the second unit, and the rest is done in the same way until the last unit finishes forward calculation; then, the last unit is sequentially moved forward until the first unit finishes the reverse calculation; adding the forward result and the reverse result of the same input data to obtain each output;

FIG. 3 is a schematic diagram of a single LSTM structure provided in an embodiment of the present invention;

as shown in fig. 3, the cell includes 4 network layers, where the activation functions of two network layers are sigmoid functions (sigmoid functions), and the activation functions of the other two network layers are hyperbolic functions (tanh functions). In addition, 3 doors are provided to control the information circulation mode, as shown in FIG. 3

And

the "gate" is the most typical feature of the LSTM recurrent neural network, and serves to retain information and filter noise. x is the number ofⁱAs input to the ith cyclic unit, while inputting the unit coefficient c^i-1And an activation value a^i-1And outputs y after calculationⁱCoefficient of cell cⁱActivation value aⁱ，cⁱAnd aⁱAnd as the input of the (i + 1) th cycle unit, the whole process is as follows:

wherein, W_f、W_u、W_tWeight coefficients corresponding to the three steps, b_f、b_u、b_tThen the bias factor, labeled in FIG. 3

The intermediate variables generated in the operation process are respectively corresponded;

FIG. 4 is a schematic diagram of an example of BIO labeling provided in an embodiment of the present invention;

fig. 5 is a schematic diagram of an overall structure of a seismic event extraction model provided in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The overall flow chart of the seismic data structuring automation method based on the seismic news event provided in the embodiment of the invention is shown in fig. 1, and mainly comprises the following steps:

step (1): crawling relevant news of the earthquake website by using a web crawler; pre-selecting an earthquake news source website (such as a national earthquake bureau, an emergency management department and each earthquake-saving bureau) and setting a corresponding XPath, wherein a crawler can automatically download all news in a news list;

the triggering words are prerequisites, and event elements can be further extracted only if the triggering words are detected and considered as seismic events;

the triggering words are used for judging whether the triggering words are earthquake events and comprise 'earthquake' key words, and if the triggering words are detected, the triggering words are regarded as the earthquake events; the event elements comprise 7 types of contents of occurrence time, epicenter position, seismic source depth, seismic level, number of injured people, number of dead people and direct economic loss; wherein "B-event element" represents the beginning of an element, "I-event element" represents the middle of an element, and "O" represents a non-event element; the labeling example is shown in FIG. 4;

and (3): randomly dividing the annotated news data set into a training data set and a test data set, wherein the test data set accounts for 20%;

and (4): constructing a seismic event extraction model, wherein the seismic event extraction model is realized by adopting a Bi-LSTM and CRF combined mode, and the structure of the seismic event extraction model is shown in figure 5;

(4.1) inputting characters of news contents into the seismic event extraction model, wherein the length of the contents is arbitrary and is marked as n; firstly, converting each character into a corresponding vector x through a word2vec module_i(ii) a The word2vec module is a trained open-source character vector library, wherein common characters such as Chinese characters, English letters, punctuation marks and the like are recorded, and a vector x corresponding to each character_iThe dimension is 100; finding the vector corresponding to each character of the news content, and finally outputting the word2vec module as n multiplied by 100 (x)₁,x₂,…,x_n) Where Λ represents a vector of length 100, this step is aimed at counting news contentPerforming word formation;

(4.2) corresponding the vector x of each character in the last step (4.1)_iSequentially used as the input of the Bi-LSTM module, and subjected to cyclic calculation to obtain the output vector y of each LSTM unit_iVector y_iHas a dimension of 17(7 types of event elements and 1 type of trigger words, each type of event elements comprises two labels of 'B-' and 'I-' and is added with a label of 'O'), and a vector y_iIs the probability value corresponding to 17 labels, and the final output of the Bi-LSTM module is nx17 (y)₁,y₂,…,y_n) Wherein Λ represents a vector of length 17;

(4.3) obtaining a final result path by the probability value output by each unit in the previous step (4.2) through a CRF layer; the CRF layer can add some constraints to ensure that the final prediction result is valid (if 'B-Label 1I-Label 1' is valid and 'B-Label 1I-Label 2' is invalid), and the constraints can be automatically learned by the CRF layer when training data; CRF is trained and predicted by calculating the scores of all possible paths, with the score of each possible path being given as P_iIf there are N paths, the total score of the paths is

Wherein the content of the first and second substances,

representing the probability of the corresponding label output by the ith LSTM unit;

the jump probability from the ith label to the (i + 1) th label is represented, belongs to the parameter of a CRF layer, and can be automatically learned during training;

during training, the loss function is defined as follows:

wherein P is_RealPathScore representing the true path (results when annotated);

in the actual prediction, the path with the highest score is obtained as the final result, i.e.

P_predict＝max(P₁,P₂,…,P_N)；

And (5): training the seismic event extraction model constructed in the step (4);

(5.1) inputting the training samples into the seismic event extraction model in batches;

(5.2) in the training process, calculating a loss value according to the loss function defined in the step (4.3), and continuously updating the weight of the seismic event extraction model by adopting a random gradient descent method;

deep learning model weights are a generic call, usually random initially, and can be updated by sample learning.

The gradient descent method is also the most basic weight updating method in machine learning.

(5.3) after a large amount of iterative training, the loss value output by the seismic event extraction model is converged to be lower; then, after each iteration training is finished, testing the seismic event extraction model on the test data set, comparing the result predicted by the model with the result manually marked, and calculating the accuracy (the number of correct results/the total number of correct results); if the test accuracy rate exceeds 97%, the whole training process is finished, and if the test accuracy rate does not meet the requirement, the step (5.1) is returned to, and the training is continued;

and (6): deploying the trained seismic event extraction model into practical application;

(6.1) crawling an earthquake news source website through a web crawler, extracting the text of news by using an HTML (hypertext markup language) label, and filtering out irrelevant contents such as pictures and external links;

(6.2) inputting the processed news content into a seismic event extraction model, and outputting a label path with the maximum probability;

(6.3) analyzing the label path, judging whether the label path contains a trigger word label, if so, further extracting event element information contained in the trigger word label and storing the event element information into a database; if not, the news content is discarded, and the next news is processed continuously.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An automatic seismic data structuring method based on seismic news events is characterized by comprising the following steps:

step (1): crawling relevant news of the earthquake website by using a web crawler; selecting an earthquake news source website in advance and setting a corresponding XPath, wherein a crawler can automatically download all news in a news list;

and (4): constructing a seismic event extraction model, wherein the seismic event extraction model is realized by adopting a Bi-LSTM and CRF combined mode;

and (6): and deploying the trained seismic event extraction model into practical application.

2. An automated seismic data structuring method based on seismic news events according to claim 1, wherein the trigger word in step (1) is a prerequisite and the event elements are further extracted only if the trigger word is detected and considered as a seismic event;

the triggering words are used for judging whether the triggering words are earthquake events and comprise 'earthquake' key words, and if the triggering words are detected, the triggering words are regarded as the earthquake events; the event elements comprise 7 types of contents of occurrence time, epicenter position, seismic source depth, seismic level, number of injured people, number of dead people and direct economic loss; where "B-event element" represents the beginning of an element, "I-event element" represents the middle of an element, and "O" represents a non-event element.

3. An automated seismic data structuring method based on seismic news events as claimed in claim 1, wherein the specific flow of step (4) is as follows:

(4.1) inputting characters of news contents into the seismic event extraction model, wherein the length of the contents is arbitrary and is marked as n; firstly, converting each character into a corresponding vector x through a word2vec module_i(ii) a The word2vec module is an open source word which is trained to be completedA character vector library, in which the common characters of Chinese character, English letter and punctuation mark are recorded, and the vector x corresponding to every character_iThe dimension is 100; finding the vector corresponding to each character of the news content, and finally outputting the word2vec module as n multiplied by 100 (x)₁,x₂,…,x_n) Where Λ represents a vector of length 100, this step being aimed at digitizing the news content;

(4.2) corresponding the vector x of each character in the last step (4.1)_iSequentially used as the input of the Bi-LSTM module, and subjected to cyclic calculation to obtain the output vector y of each LSTM unit_iVector y_iHas a dimension of 17, vector y_iIs the probability value corresponding to 17 labels, and the final output of the Bi-LSTM module is nx17 (y)₁,y₂,…,y_n) Wherein Λ represents a vector of length 17;

(4.3) obtaining a final result path by the probability value output by each unit in the previous step (4.2) through a CRF layer; the CRF layer adds some constraints to ensure that the final prediction result is effective, and the constraints can be obtained by the automatic learning of the CRF layer when training data; CRF is trained and predicted by calculating the scores of all possible paths, with the score of each possible path being given as P_iIf there are N paths, the total score of the paths is

Wherein the content of the first and second substances,

during training, the loss function is defined as follows:

wherein P is_RealPathRepresents a true path score;

P_predict＝max(P₁,P₂,…,P_N)。

4. An automated seismic data structuring method based on seismic news events as claimed in claim 3, wherein the specific flow of step (5) is as follows:

(5.3) after a large amount of iterative training, the loss value output by the seismic event extraction model is converged to be lower; then, after each iteration training is finished, testing the seismic event extraction model on the test data set, comparing the result predicted by the model with the result manually marked, and calculating the accuracy; if the test accuracy rate exceeds 97%, the whole training process is completed, and if the test accuracy rate does not meet the requirement, the step (5.1) is returned to, and the training is continued.

5. An automated seismic data structuring method based on seismic news events according to claim 3 or 4, wherein the specific flow of step (6) is as follows:

(6.1) crawling an earthquake news source website through a web crawler, extracting the text of news by using a hypertext markup language tag, and filtering out irrelevant contents such as pictures and external links;