CN113515624B

CN113515624B - Text classification method for emergency news

Info

Publication number: CN113515624B
Application number: CN202110467773.0A
Authority: CN
Inventors: 孙锐; 谢红
Original assignee: Leshan Normal University
Current assignee: Leshan Normal University
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2023-07-21
Anticipated expiration: 2041-04-28
Also published as: CN113515624A

Abstract

The invention provides a text classification method for emergency news, which belongs to the field of natural language processing and comprises the following steps: collecting news documents, finishing data cleaning, and preprocessing operations such as word segmentation, dependency analysis, reference resolution and the like of the documents to obtain a news data set D; adding the news data set D into a background corpus, and learning the distributed representation of words after training by using Word2 Vec; carrying out event extraction on each news D in the news data set D and constructing an event dictionary; clustering all events in an event dictionary by adopting a Chinese whistle method without parameter clustering to obtain an event cluster; calculating the occurrence frequency and the inverted document frequency of each event cluster obtained after clustering to extract characteristic events; constructing a feature vector for each news document according to the feature event; and (5) adopting a classification algorithm of a support vector machine to complete the classification of the news document. The method has strong semantic characterization capability and class distinction.

Description

Text classification method for emergency news

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a text classification method for emergency news.

Background

Sudden events are natural disasters, accident disasters, public health events and social security events which are suddenly happened, cause or possibly cause serious social hazards and need emergency treatment measures to be taken for treatment. After an event, related news reports are rapidly spread over the network, and most of the news reports are focused on government departments and people. The news is classified according to the topics by using a text classification technology, so that people can know and analyze the reasons, processes and subsequent influences of the occurrence of the event, and convenience is provided for related departments to control, alleviate and eliminate serious social hazards caused by the emergency and make auxiliary decisions.

Many sub-events are often accompanied or derived during the occurrence or evolution of an incident. For example, the occurrence of the event "typhoon wei Ma Xun attack" generally also occurs the events such as "weather desk issue early warning", "personnel injury", "communication interruption" and "personnel transfer", while the occurrence of the event "yunnan earthquake" generally occurs the events such as "yunnan earthquake", "personnel death", "house collapse" and "civil administration report". By analyzing some events with significant features, news is easily categorized by different incident topics.

In the field of natural language processing, an event generally refers to the occurrence or change in state of an action, consisting of a trigger word and one or more arguments. The event itself contains the semantic relation among words, has stronger semantic characterization capability than the traditional bag-of-words model, and thus has better category distinction. Therefore, text classification using events as features should be simpler and more efficient for sudden event news.

With the deep application of IT technology, after an emergency, a large number of related news reports appear on the network, and most news texts are focused on government departments and people. The news is classified according to the topics by using a text classification technology, so that people can know and analyze the reasons, processes and subsequent influences of the occurrence of the event, and convenience is provided for related departments to control, alleviate and eliminate serious social hazards caused by the emergency and make auxiliary decisions. The prior art mainly adopts a classification model of basic word bags, namely adopts vocabulary characteristics to represent documents. The technology ignores semantic relations among words, and has weak semantic characterization capability.

Therefore, the application provides a text classification method for emergency news.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a text classification method aiming at emergency news.

In order to achieve the above object, the present invention provides the following technical solutions:

a text classification method for emergency news comprises the following steps:

the method comprises the following steps:

collecting news documents from the internet, finishing data cleaning, and carrying out preprocessing operations of word segmentation, dependency analysis and reference resolution on each document in the news documents by using a natural language processing tool to obtain a news data set D;

adding the preprocessed news data set D into a background corpus, and learning the distributed representation of words after training by using Word2 Vec;

extracting an event from each news D in the news data set D, and constructing an event dictionary;

clustering all events in an event dictionary by adopting a Chinese whistle method without parameter clustering to obtain an event cluster;

calculating the occurrence frequency and the inverted document frequency of each event cluster obtained after clustering to extract characteristic events;

constructing a feature vector for each news document according to the feature event;

and (5) adopting a classification algorithm of a support vector machine to complete the classification of the news document.

Preferably, the data cleaning of the news document is accomplished using existing natural language processing kits.

Preferably, the specific steps of extracting the event for each news D in the news data set D and constructing an event dictionary include:

scanning dependency analysis relations with the types of 'nsubj' and 'dobj' in each news d dependency analysis result to obtain a binary dependency relation set ea, wherein the binary relation is used for representing event argument relation;

scanning the binary dependency relationship set ea in turn, and merging into a candidate event if predicates of two event argument relationships are the same;

each of the remaining unmixed binary dependencies in the binary argument relation set ea is also represented as a candidate event, respectively;

obtaining an event set de of each news from all candidate events, namely, each document consists of a plurality of events;

repeating the four steps, and obtaining all event sets DE of the news data set D after event extraction in all documents in the news data set D is completed;

scanning event set DE, constructing event dictionary

ED＝{event ₁ ,event ₂ ,…,event _m },event _i The i-th event is represented, m represents the dictionary size, namely the number of event categories, and all events with the same argument are in the same category.

Preferably, the specific step of clustering all the events in the event dictionary by using the Chinese whistle method without the parameter cluster to obtain the event cluster includes:

the distributed representation of each event is calculated by adopting a mode of combining semantics:wherein subj, pred and obj represent subject, predicate and object of event, respectively, +.>Representing a kronecker product operation, representing a dot product operation;

using cosine similarity to calculate similarity sim (event) between each pair of events _i ,event _j )；

Clustering all events of an event dictionary ED by adopting a Chinese whistle algorithm to obtain different event clusters;

after the clustering is completed, an event cluster EC= { EC is obtained ₁ ,ec ₂ ,…,ec _x Each cluster ec _i Events with high similarity of semantics are contained, and i is the cluster number of the cluster.

Preferably, the step of clustering all the events of the event dictionary ED by using the chinese whistle algorithm to obtain different event clusters includes:

constructing an event graph g= (Vertex, edge), wherein Vertex represents a Vertex set of the graph, edge represents an Edge set of the graph, each event is initially a node and is independent in a cluster, namely vertex=ed= { event1, event2, …, event _m Edge= { }, i.e. no edges exist in the graph;

scanning each event node event in turn _i Finding out event node event with highest similarity and unconnected for each event node _j They are gathered in a cluster, if there are multiple nodes with highest similarity, then they are randomSelecting one;

repeating the scanning steps until convergence conditions are met, wherein the convergence conditions are set according to event similarity threshold values.

Preferably, the specific steps of calculating the occurrence frequency and the inverted document frequency of each event cluster obtained after clustering to extract the characteristic event include:

scanning all event sets DE of a news data set D and counting each event cluster ec _i Ecf of the frequency of occurrence of (a);

scanning event set de of each news and calculating each event cluster ec _i Is a reverse document frequency idf;

calculate each event cluster ec _i Is used to represent each event cluster ec _i Is characterized by the significance of (3);

ordering according to the feature significance of the event clusters from big to small, extracting the first K maximum feature values, and constructing a feature event dictionary FED= { FED ₁ ,fed ₂ ,…,fed _k },fed _i I=1, 2, …, K for the i-th feature significant event cluster.

Preferably, the specific step of constructing the feature vector for each news document according to the feature event includes:

scanning each event cluster FED in the feature event dictionary FED in turn _i Counting the occurrence frequency edf of the event cluster in each news d _i ；

Scanning each event cluster FED in the feature event dictionary FED in turn _i Calculating the feature value fd of the document in each feature dimension _i ＝ecf _i *idf _i *edf _i I.e. event cluster salient features ecf _i *idf _i Document feature edf of event cluster _i Is a product of (2);

after the feature event dictionary is scanned, a document feature vector fd= [ fd ] can be obtained ₁ ,fd ₂ ,…,fd _k ]。

The text classification method for the emergency news has the following beneficial effects:

1) The invention adopts atomic events as basic characteristics, and has stronger semantic representation capability and category distinction degree than traditional words;

2) According to the invention, the combination semantics of word vectors are introduced to represent atomic events and an event cluster is generated by adopting a non-parametric clustering algorithm, so that the sparse problem caused by similar event semantics but different expression forms is avoided;

3) The invention improves on the traditional TF.IDF algorithm, and introduces the corpus appearance frequency of the event, the document inverted frequency and the document appearance frequency of the event so as to generate a document vector with more discrimination.

Drawings

In order to more clearly illustrate the embodiments of the present invention and the design thereof, the drawings required for the embodiments will be briefly described below. The drawings in the following description are only some of the embodiments of the present invention and other drawings may be made by those skilled in the art without the exercise of inventive faculty.

Fig. 1 is a flowchart of a text classification method for emergency news according to embodiment 1 of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the drawings and the embodiments, so that those skilled in the art can better understand the technical scheme of the present invention and can implement the same. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

Example 1

The invention provides a text classification method aiming at emergency news, which collects thematic documents (comprising 92 thematic items of typhoon No. 9, ma Xun attack of the present year, 102 thematic items of Taiwan passenger plane forced landing heavy fire, 54 public bus longitudinal fire cases of Hangzhou, 117 Yunnan earthquake and the like) on a new wave network, and uses the thematic documents as training and testing corpus for verifying the effectiveness of the method provided by the invention. The embodiment on the data set shows that the method is simple and accurate in classification, has stronger classification distinction for emergency news by taking an atomic event as a basic feature, and specifically comprises the following steps as shown in fig. 1:

s1, collecting news document data from a new wave network, cleaning the data, and then carrying out preprocessing operations such as word segmentation, dependency analysis, reference resolution and the like on each document in the news corpus by using a natural language processing tool; the news document collection is denoted as news data set d= { D ₁ ,d ₂ ,…,d _n }, where d _i Representing an ith news document, n representing the total number of news in the document set; the example selects Stanford CoreNLP, a natural language processing kit disclosed by Stanford university;

the specific steps of S1 include: and (3) finishing data cleaning of the news document on the crawled thematic document, such as full-angle conversion and half-angle conversion, removing non-Chinese symbols such as redundant URL (uniform resource locator), and preprocessing each document by using the existing natural language processing tool package Stanford CoreNLP, such as word segmentation, dependency analysis, reference resolution and the like, so as to obtain a document D.

S2, adding the preprocessed document D into a background corpus, such as a people daily corpus, and learning a distributed representation of words after training by using a word embedding algorithm; common Word embedding algorithms include Word2Vec, glove, etc., and Word2Vec is selected as the Word embedding algorithm in this example.

S3, extracting an event from each news D in the news data set D, wherein the corresponding event is represented by a triplet atomic event of a main guest structure, and an event dictionary is constructed, and the method comprises the following specific steps of:

s31, scanning dependency analysis relations with the types of 'nsubj' and 'dobj' in each news d dependency analysis result to obtain a binary dependency relation set ea, wherein the binary relations can be used for representing event argument relations;

s32, scanning the binary dependency relationship set ea in turn, and merging the binary dependency relationship set ea into a candidate event if predicates of two event argument relationships are the same; for example, given the statement "gas picture stand 6-day release typhoon early warning", an event "gas picture stand, release, early warning" can be obtained from two dependency relationships "nsubj (release, gas picture stand)" and "dobj (release, early warning)";

s33, respectively representing each of the remaining unmixed binary dependencies in the event argument relation set ea as a candidate event;

s34, obtaining an event set de of each news from all candidate events, namely each document is composed of a plurality of events;

s35, repeating the four steps S31, S32, S33 and S34, and obtaining all event sets DE of the document set D after event extraction in all documents in the news document set D is completed;

s36, scanning an event set DE, and constructing an event dictionary ED= { event1, event2, … and event _m },event _i The i-th event is represented, m represents the dictionary size, namely the number of event categories, and all events with the same argument are in the same category.

S4, clustering all events in an event dictionary ED by adopting a non-parametric clustering method, wherein the common non-parametric clustering method comprises a Chinese whistle method, a DBSCAN, hierarchical clustering and the like, and the Chinese whistle method is selected as the example, and the implementation steps are as follows:

s41, the distributed representation of each event is related to each argument of the event, so that the distributed representation of each event is calculated by adopting a mode of combining semantics. Common combination modes include concatenation, addition, multiplication and the like, and the example adopts a multiplication operation mode. Specifically, the distributed representation calculation method of the event is as follows:wherein subj, pred and obj represent subject, predicate and object of event, respectively, +.>Representing a kronecker product operation, representing a dot product operation;

s42, calculating similarity sim (event) between each pair of events by adopting cosine similarity _i ,event _j )；

S43, clustering all events of an event dictionary ED by adopting a Chinese whistle algorithm to obtain different event clusters, wherein the method comprises the following specific steps of:

s431, constructing an event graph G= (Vertex, edge), wherein each event is a junction initiallyDots are clustered singly, i.e. vertex=ed= { event1, event2, …, event _m Edge= { }, i.e. no edges exist in the graph;

s432, scanning each event node event in turn _i Finding out event node event with highest similarity and unconnected for each event node _j Aggregating them in a cluster (i.e. adding an edge), and if there are multiple nodes with highest similarity, randomly selecting one node;

s433, repeating S432 until the convergence condition is satisfied, wherein the convergence condition is set according to the event similarity threshold (the threshold selected in this example is sim (event) _i ,event _j )＞0.6)。

S44, after the clustering is completed, obtaining an event cluster EC= { EC ₁ ,ec ₂ ,…,ec _x Each cluster ec _i Events with high similarity of semantics are contained, and i is the cluster number of the cluster. For example, events of "person, injury, nil", "person, severe injury, nil", "nil, injury, person", etc. are clustered together.

S5, clustering each event cluster ec _i The occurrence frequency and the inverted document frequency are calculated to extract characteristic events, and the implementation steps are as follows:

s51, scanning all event sets DE of a document set D, and counting each event cluster ec _i Ecf of the frequency of occurrence of (a);

s52, scanning event sets de of each news, and calculating each event cluster ec _i Is a reverse document frequency idf;

s53, calculating each event cluster ec _i Is used to represent each event cluster ec _i Is characterized by the significance of (3);

s54, sorting according to the feature significance of the event clusters from large to small, extracting the first K (the number of the feature number K can be set according to different embodiments, and the K value is set to 20 in the example) maximum feature values, and constructing a feature event dictionary FED= { FED ₁ ,fed ₂ ,…,fed _k },fed _i I=1, 2, …, K for the i-th feature significant event cluster. In particular embodiments, the content appears and frequently occurs in a plurality of news documentsHigher event clusters are extracted as characteristic events such as "people, injury, nil", "Yunnan, occurrence, earthquake" and "aircraft, forced landing, nil", etc.

S6, constructing a feature vector fd of each news document d, wherein the specific steps are as follows:

s61, scanning each event cluster FED in the feature event dictionary FED in turn _i Counting the occurrence frequency edf of the event cluster in each news d _i ；

S62, scanning each event cluster FED in the feature event dictionary FED in turn _i Calculating the feature value fd of the document in each feature dimension _i ＝ecf _i *idf _i *edf _i I.e. event cluster salient features ecf _i *idf _i Document feature edf of event cluster _i Is a product of (2);

s63, after the characteristic event dictionary scanning is completed, obtaining a document characteristic vector fd= [ fd ] ₁ ,fd ₂ ,…,fd _k ]。

And S7, classifying the news documents by adopting a Support Vector Machine (SVM) classification algorithm. Ten-fold cross-validation is performed on the news data set of the embodiment, and the usual method featuring word is Accury value of 0.83.

Aiming at the text classification of emergency news, the invention adopts atomic events as basic characteristics, extracts remarkable characteristic events by clustering and statistical analysis of the atomic events, and characterizes news document vectors by the characteristic events; the combined semantics of the word vectors are introduced to represent the atomic events, and the event clusters are generated by adopting a non-parametric clustering algorithm, so that the sparse problem caused by similar event semantics but different expression forms is avoided; the method is improved on the traditional TF.IDF algorithm, and feature event dictionary is constructed by introducing corpus occurrence frequency of events, document inverted frequency and document occurrence frequency of the events so as to generate document vectors with more discrimination. The atomic event contains semantic information among words, has stronger semantic representation capability than the traditional words, and solves the problem of low accuracy caused by poor classification and distinction of the traditional classification method based on vocabulary features.

The above embodiments are merely preferred embodiments of the present invention, the protection scope of the present invention is not limited thereto, and any simple changes or equivalent substitutions of technical solutions that can be obviously obtained by those skilled in the art within the technical scope of the present invention disclosed in the present invention belong to the protection scope of the present invention.

Claims

1. The text classification method for the emergency news is characterized by comprising the following steps of:

the classification of the news documents is completed by adopting a classification algorithm of a support vector machine;

the specific steps of extracting the event from each news D in the news data set D and constructing an event dictionary include:

scanning event set DE, constructing event dictionary

ED＝{event ₁ ,event ₂ ,…,event _m },event _i The i-th event is represented, m represents the size of a dictionary, namely the number of event categories, and all events with the same argument are in the same category;

the specific step of clustering all the events in the event dictionary by adopting the Chinese whistle method without the parameter cluster to obtain the event cluster comprises the following steps:

using cosine similarity to calculate similarity sim (event) between each pair of events _i ，event _j )；

after the clustering is completed, an event cluster EC= { EC is obtained ₁ ,ec ₂ ,…,ec _x Each cluster ec _i Events with high similarity of semantics are contained, i is the cluster number of the cluster;

the specific steps of clustering all the events of the event dictionary ED by adopting the Chinese whistle algorithm to obtain different event clusters include:

scanning each event node event in turn _i Finding out event node event with highest similarity and unconnected for each event node _j Gathering them in a cluster, if there are multiple nodes with highest similarity, randomly selecting one node;

2. The method for text categorization of emergency news according to claim 1, wherein the data cleansing of news documents is accomplished using existing natural language processing kits.

3. The text classification method for emergency news according to claim 1, wherein the specific step of calculating the occurrence frequency and the inverted document frequency of each event cluster obtained after clustering to extract the feature event comprises:

4. A method for classifying text for emergency news according to claim 3, wherein said specific step of constructing a feature vector for each news document according to a feature event comprises: