CN111597328A

CN111597328A - New event theme extraction method

Info

Publication number: CN111597328A
Application number: CN202010541567.5A
Authority: CN
Inventors: 云红艳; 贺英; 张秀华; 李正民
Original assignee: Qingdao University
Current assignee: Qingdao University
Priority date: 2020-05-27
Filing date: 2020-06-15
Publication date: 2020-08-28
Anticipated expiration: 2040-06-15
Also published as: CN111597328B

Abstract

The invention belongs to the technical field of network information, and relates to a new event theme extraction method, wherein a news event text data set is vectorized and represented based on BERT, the context of the news event text data set is more closely related, the expression mode is more accurate, the learning of the news text with large data volume in the network is realized by utilizing a bidirectional long and short memory network of an attention mechanism, the new event is found, the high-efficiency and accurate utilization of data is realized, a mode of combining a supervision method and an unsupervised method is adopted, the efficiency is higher than that of a single mode, the method is simple, semantic information can be extracted deeply, the news text in the network can be analyzed and mined, the discovery of the new event is realized, the real-time control of related supervision departments and individual users on the new event is facilitated, and the subsequent work is facilitated.

Description

New event theme extraction method

The technical field is as follows:

the invention belongs to the technical field of network information, relates to a new event theme extraction method, and particularly relates to a method for extracting a new event theme by using a bidirectional long and short memory network training new event discovery model based on a BERT (basic transcription) and attention mechanism and a theme modeling analysis of multi-feature fusion.

Background art:

with the development of the internet in the big data age, people are surrounded by a great amount of news information with wide sources, such as newspapers, networks and the like, wherein the most common carrier of news is text, and the text is the most easily-accessible way for obtaining valuable information. Because news information modes generated from different sources are various, the formats and the contained information of news texts are often disordered, the quantity of the generated news information is extremely large, and the detection of Chinese news events is almost impossible by completely depending on manual work. Meanwhile, a large amount of texts in the network contain the attention degree and influence of people on a certain event, so that mining research aiming at the network news texts is beneficial to discovering hot attention events as soon as possible.

The method for discovering the hot news events is mainly based on a manual monitoring method, and the method needs higher resource cost in the news event discovery and monitoring in the network. With the rise of machine learning, the event discovery method generally adopted at present is realized according to a clustering method, and the method clusters the news text to discover a new event, but the accuracy of the method in the aspect of discovering the new event is not high, and error identification is easy to cause. With the rise of the neural network, the neural network has achieved huge achievements in various fields, and the neural network not only overcomes the limitation of artificially constructing features, but also is more suitable for big data. CN201810696452.6 provides a Chinese text subject sentence generation method facing to the field, which is characterized by comprising the following steps: the method comprises the steps of establishing a corresponding domain knowledge map facing a domain text data set, extracting semantic information of a text by applying a deep neural network model, classifying the text according to a topic sentence pattern, and finally generating a topic sentence of the text. However, this method has the following disadvantages: firstly, the method can only be oriented to specific field data sets and is not suitable for general data sets in various fields; secondly, the method needs to create a domain knowledge graph, which has huge resource overhead and needs high professional literacy; finally, the method labels and classifies the text data by using a deep learning method, and the operation only aims at a specific field and is poor in performance of a new data model in a new field. Therefore, it is necessary to provide a new event topic extraction method, which uses a deep learning method to discover a new event and uses a topic modeling method to extract a new event topic.

The invention content is as follows:

the invention aims to overcome the defects in the prior art, designs and provides a method for training a new event discovery model and analyzing and extracting a new event theme by multi-feature fusion theme modeling based on a bidirectional long and short memory network of a BERT (belief transfer) and attention mechanism, and realizes the mining and processing of mass text data by using a neural network in deep learning, thereby realizing the efficient and accurate analysis and utilization of the text data.

In order to achieve the above object, the process of extracting a new event topic according to the present invention comprises the following steps:

step 1: acquiring a news event text data stream according to event keywords, constructing a news event text data set according to the acquired news event text data stream, wherein each record in the text comprises an event type label of a news text and a specific text description of an event, and dividing the news event text data set into a training set Train, a verification set Val and a Test set Test;

step 2: outputting high-dimensional dense vector representation to the training set Train, the verification set Val and the Test set Test divided in the step 1 on the basis of a BERT representation model to obtain high-dimensional dense vector representation of a news event text data set, wherein the number of model layers of the BERT representation model is 12, the hidden size is 768, and the attention head is 12;

and step 3: taking the high-dimensional dense vector representation of the news event text data set obtained in the step 2 as input, adopting Xavier to initialize neural network parameters according to a training set Train and a verification set Val, and adopting a dropout strategy and a gradient descent method as the updating of the neural network parameters and the input feature vectors to obtain a new event discovery model;

and 4, step 4: setting a threshold value of a new event discovery model, if the identification result is greater than the threshold value, judging that the event belongs to a known news event type and giving the subject of the event; if the prediction result threshold is smaller than the set threshold, the event is judged to be a new event, and the news text judged to be the new event is integrated and stored to obtain a new event text data set;

and 5: removing useless information contained in the new event text data set obtained in the step 4, keeping the description content of the news event text to the news event, and establishing a custom dictionary to improve the word segmentation precision after performing word segmentation by adopting a Chinese word segmentation tool; the useless information comprises marks without substantial value, such as special characters, stop words and the like;

step 6: extracting entity characteristics and LDA subject hot word characteristics from the preprocessed new event text data set obtained in the step 5, performing word-level splicing with the original text to form new news text description, and performing weighted representation on the entity characteristics and the LDA subject hot word characteristics in a mode of increasing word frequency of the characteristics; the entity characteristics comprise a person entity characteristic, a place entity characteristic and an organization name entity characteristic;

and 7: for the news text data set processed in the step 6, calculating the word frequency/inverse document rate of each word to measure the importance of each word relative to the current theme, and endowing each word with a corresponding weight value according to the calculation result;

and 8: clustering the new event text data set obtained in the step 7 according to a plurality of events by using a Kmeans algorithm according to the characteristics and the weighted values thereof obtained in the steps 6 and 7, and performing topic modeling analysis on the new events; and (3) combining the topic modeling analysis result with the expression of the word frequency/inverse document rate to the new event text set, extracting ten keywords from each event as the topic words of the new event, and completing the extraction of the new event topic.

The step 1 of the invention specifically comprises the following steps:

step 1.1: determining keywords of a specific news event according to the news event text data acquisition requirement;

step 1.2: for the determined news event keywords, constructing a data crawler system for acquiring news event text data links by a Baidu search engine on the basis of a Scapy frame, and acquiring news event text data streams;

step 1.3: carrying out standardization operation on text contents for the obtained news event text data stream, removing invalid contents such as spaces and the like, and splicing the remaining valid contents to form a standardized representation recorded as a news text to form a news event text set;

step 1.4: and (3) dividing the news event text set obtained in the step 1.3 into a training set Train, a verification set Val and a Test set Test according to the ratio of 7:2: 1.

Compared with the prior art, the method has the advantages that the text data set of the news event is vectorized and expressed based on the BERT, the context is more closely related, the expression mode is more accurate, the learning of the news text with large data volume in the network is realized by utilizing the bidirectional long and short memory network of the attention mechanism, the efficient and accurate utilization of the data is realized, the mode of combining the supervision and unsupervised methods is adopted, the efficiency is higher than that of a single mode, the method is simple, the semantic information can be extracted deeply, the news text in the network can be analyzed and mined, the discovery of the new event is realized, the real-time control of relevant supervision departments and individual users on the new event is facilitated, and the subsequent work is facilitated.

Description of the drawings:

fig. 1 is a schematic view of the working process of the present invention.

Fig. 2 is a diagram of a new event discovery model constructed in accordance with the present invention.

FIG. 3 is a diagram of an entity feature extraction model according to the present invention.

FIG. 4 is a flow chart of the inventive subject matter extraction process.

The specific implementation mode is as follows:

the invention is further described by way of example with reference to the accompanying drawings.

Example (b):

the process for extracting the new event theme in the embodiment of the invention comprises the following steps:

step 1: acquiring a news event text data stream according to event keywords, constructing a news event text data set according to the acquired news event text data stream, wherein each record in the text comprises an event type label of a news text and a specific text description of an event, and dividing the news event text data set into a training set Train, a verification set Val and a Test set Test, which specifically comprises the following steps:

step 1.4: for the news event text set obtained in the step 1.3, dividing a training set Train, a verification set Val and a Test set Test according to the ratio of 7:2: 1;

step 2: vectorizing the text based on a BERT representation model for the training set Train, the verification set Val and the Test set Test divided in the step 1, outputting high-dimensional dense vector representation, and obtaining the high-dimensional dense vector representation of the news event text data set, wherein the number of model layers of BERT representation model parameters is 12, the hidden size is 768, the attention head is 12, and the obtained high-dimensional dense vector representation dimension is 768, specifically: [8.3772335e-05,3.9696515e-05,3.854327e-05,0.0018235502,0.00028364992,3.3392924e-05,3.613378e-05,0.0011939545,8.937488e-06,0.00028550622,1.6984109e-06,0.014312873,4.2274103e-05,0.0057512685,0.008945758,2.318987e-05,1.9686187e-05,3.6920403e-05, … ]

And step 3: taking the high-dimensional dense vector representation of the news event text data set obtained in the step 2 as input, initializing neural network parameters by using Xavier according to a training set Train and a verification set Val, and updating the neural network parameters and input feature vectors by using a dropout strategy and a gradient descent method to obtain a new event discovery model of a bidirectional long and short memory network based on a BERT and attention mechanism;

and 4, step 4: setting the threshold value of the new event discovery model to be 0.9, and if the identification result is greater than the threshold value, judging that the event belongs to the known news event type and giving the subject of the event; if the prediction result threshold is smaller than the set threshold, the event is judged to be a new event, and the news text judged to be the new event is integrated and stored to obtain a new event text data set;

and 5: removing useless information contained in the new event text data set obtained in the step 4, keeping the description content of the news event text to the news event, and establishing a custom dictionary to improve the word segmentation precision after performing word segmentation by adopting a Chinese word segmentation tool; the useless information comprises preprocessing results obtained by marks without substantial values such as special characters, stop words and the like;

and 7: for the news text data set processed in the step 6, calculating the word frequency/inverse document rate of each word to measure the importance of each word relative to the current theme, and endowing each word with a corresponding weight vector according to the calculation result; the method comprises the following specific steps: 0.11178106295272044, 0.11178106295272044, 0.11178106295272044, 0.11178106295272044, 0.11178106295272044, 0.16767159442908067 …

And 8: clustering the new event text data set obtained in the step 7 according to a plurality of events by using a Kmeans algorithm according to the characteristics and the weighted values thereof obtained in the steps 6 and 7, and performing topic modeling analysis on the new events; combining the topic modeling analysis result with the expression of a word frequency/inverse document rate to a new event text set, extracting ten keywords from each event as a subject word of the new event, and completing the extraction of a new event topic, wherein the Kmeans new event topic extraction is a repeated iteration process and is divided into four steps, firstly, k objects in a news text set are selected as initial centers, and each object represents a cluster center; secondly, for the data objects in the sample, according to Euclidean distances between the data objects and the clustering centers, the data objects are classified into the class corresponding to the clustering center closest to the data objects according to the nearest principle; then, taking the mean value corresponding to all the objects in each category as the clustering center of the category, and calculating the value of the objective function; and finally, judging whether the values of the clustering center and the target function are changed or not, if not, outputting the result, and if so, returning to the second step. And finally, extracting keywords of each event category by combining the representation of the TF-IDF on the new event text after the clustering is finished.

Strategies, methods or algorithms not specifically described in this example are all available in the art.

Claims

1. A new event theme extraction method is characterized by comprising the following steps:

and 5: removing useless information contained in the new event text data set obtained in the step 4, keeping the description content of the news event text to the news event, and establishing a custom dictionary to improve the word segmentation precision after performing word segmentation by adopting a Chinese word segmentation tool; the useless information comprises special characters and marks of stop words without substantial value;

2. The method for extracting a new event topic according to claim 1, wherein the step 1 specifically comprises the following steps:

step 1.3: carrying out standardization operation on text contents for the obtained news event text data stream, removing invalid contents including spaces, and splicing the remaining valid contents to form a standardized representation recorded as a news text to form a news event text set;