CN113641888A

CN113641888A - Event-related news filtering learning method based on fusion topic information enhanced PU learning

Info

Publication number: CN113641888A
Application number: CN202110347488.5A
Authority: CN
Inventors: 余正涛; 王冠文; 线岩团; 张玉; 黄于欣
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-11-12
Anticipated expiration: 2041-03-31
Also published as: CN113641888B

Abstract

The invention relates to a learning method for filtering event-related news based on fusion topic information enhanced PU learning. According to the invention, the topic information of the marked and unmarked event related news data sets is extracted in an unsupervised pre-training mode, and then the extracted topic information is added into the primary training and subsequent iterative training processes of PU learning, so that more sample information can be utilized under the condition that the initial event related news samples are fewer, and the topic enhancement is carried out in the subsequent iterative training processes, so that a classifier trained in each iteration can obtain real reliable positive and negative sample data from unmarked data, and the performance of the final event related news classifier is improved. Compared with a baseline model of PU learning, the method has the advantages that the F1 value is improved by 1.8%, and the method leads more under the conditions of low initial samples and high iteration. The method for enhancing the PU learning by using the theme information can effectively solve the problem of lack of training data in the news filtering task related to the case.

Description

Event-related news filtering learning method based on fusion topic information enhanced PU learning

Technical Field

The invention relates to a learning method for filtering event-related news based on fusion topic information enhanced PU learning, belonging to the technical field of natural language processing.

Background

The event-related news filtering task can be generally regarded as a two-classification problem, and the common methods can be divided into two types, namely keyword retrieval and machine learning methods. Early researchers matched news text with a domain-related set of keywords, such as KMP, Sunday, etc. algorithms. At present, a machine learning algorithm is an effective scheme for solving event-related news filtering. Researchers make assumptions about data distribution by statistical methods to infer event-related news categories such as SVMs, decision trees, etc. Researchers have also used deep learning algorithms for news filtering and deep networks for extracting hidden features of text and for classification. Because the scenes of the event-related news are complex and changeable, a complete keyword set is difficult to construct, so that the event-related news filtering task cannot be carried out by using keyword retrieval, and meanwhile, because of the field and the particularity of the event-related news, the small-scale event-related news data can be collected only through the occurred cases, all case situations and scenes are difficult to cover, a large amount of unmarked event-related news is hidden in the historical news, and the situation of lacking of training data can make the text filtering method based on machine learning difficult to achieve ideal effects. Therefore, how to achieve better filtering performance in the case of only a few event-related news samples is the focus of the invention.

The invention mainly considers the event-related news classification by using topic information to enhance PU learning. Therefore, on the basis of PU learning methods proposed by Yu et al, Liu et al, Ren et al, Li et al and Xiao et al, the method makes full use of the topic information in news, incorporates the topic information to enhance PU learning, and explores the method for classifying news texts related to events.

Disclosure of Invention

The invention provides a learning method for filtering event-related news based on fusion topic information enhanced PU learning, which is used for fully utilizing the topic information implied in news and improving the accuracy of filtering the event-related news. While achieving superior results in event-related news filtering tasks over other baseline methods.

The technical scheme of the invention is as follows: the learning method for filtering event-related news based on fusion topic information enhanced PU learning comprises the following specific steps:

step1, training a classifier, and adding an unsupervised topic model VAE for enhancement;

step2, predicting the unlabeled data through a trained classifier model, and then sequencing the prediction results of the unlabeled news from high to low in probability;

step3, after the primary training and prediction process is finished, performing iteration of PU learning, namely retraining the classifier on a newly obtained training set and repeating the whole prediction and training process;

and Step4, putting all samples into a classifier for training to obtain an event related news classification model required by the text, and further more accurately filtering out required event related news.

As a preferred embodiment of the present invention, the Step1 specifically comprises the following steps:

step1.1, extracting non-event related news data by using an improved I-DNF algorithm, and acquiring a counterexample with the same scale as the initial event related news.

Step1.2, using variational self-encoding (VAE) as a topic model, in order to extract potential features from the word vector space of a document, the present invention understands as topic features. The present invention implements this VAE structure and uses the entire event-related news dataset for unsupervised pre-training, with reference to predecessor work and VAE principles. To train the initial classifier.

Step1.3, using Embedding and bidirectional Long-short term memory networks (BilSTM) network structure as the classifier.

As a preferable scheme of the invention, the step Step1.1 comprises the following specific steps:

step1.1.1, a text feature has the frequency of more than 90% in the regular example set, and the frequency of the text feature in the unidentified set is only 10%, and the feature is taken as the regular example feature;

step1.1.2, establishing a regular example feature set according to different frequencies of the features appearing in the regular example set and the unidentified set;

step1.1.3, extracting the sample document in the unidentified set U from the unidentified set U if the sample document does not contain any feature in the normal feature set, and identifying the sample document as a counter example.

As a preferred embodiment of the present invention, the step step1.2 comprises:

the step1.2.1, variational self-encoding (VAE) architecture is an encoder-decoder architecture. In the encoder, compressing the input into a latent distribution, and the decoder reconstructing the input signal by sampling from the distribution in the latent space of data;

step1.2.2, in general, the VAE model assumes that the posterior probability of the potential distribution of the input data approximately satisfies the Gaussian distribution, and then reconstructs through a decoding network;

step1.2.3, the implementation of the decoding network Decode of the present invention is implemented using a fully connected network (MLP).

As a preferable scheme of the invention, the step Step1.3 comprises the following specific steps:

step1.3.1, firstly, Embedding words in an input text by using an Embedding network layer to obtain word Embedding vectors. In addition, the input text passes through a VAE topic model to obtain a topic vector of the news text, and two kinds of coding information are obtained;

step1.3.2, using the news topic vector to guide the word embedding vector; the formed new matrix is a news coding vector blended into the theme vector;

step1.3.3, modeling the context relation of the news coding vector after the topic information is blended through a bidirectional long-short term memory network layer (BilSTM) to obtain a news semantic representation vector.

As a preferred embodiment of the present invention, the Step2 specifically comprises the following steps:

and Step2.1, performing probability prediction of classes on the residual unlabeled data samples in the data set through a classifier and a topic model. The prediction result is a probability value that news belongs to event-related news.

And Step2.2, sequencing the prediction results of the unmarked news from high to low in probability, acquiring data with the front probability as a reliable event related news sample and data with the back probability as a reliable negative sample according to a certain iteration step in each prediction, removing the samples from the unmarked samples, and adding the samples into training data for performing a subsequent iteration training process.

As a preferred embodiment of the present invention, the Step3 specifically comprises the following steps:

after Step3.1 completes the initial training and prediction process, the classifier is retrained on the newly obtained training set and the whole prediction and training process is repeated.

And Step3.2, after each iteration is finished, the number of the unlabeled data is reduced, the number of the training sets is increased, and when the unlabeled data is completely predicted to be a reliable sample, the whole iteration process is finished.

The invention has the beneficial effects that:

the invention applies the PU learning method to the event related news filtering task, and effectively solves the problem of filtering the event related news under the condition of small amount of manual labeling.

According to the invention, the unsupervised pre-training mode is adopted to extract the topic information of the news data related to the event, and the training process of PU learning is enhanced by using the topic information, so that the accuracy is improved compared with the common PU learning.

An event related news data set is constructed and an experiment is carried out by using the method, and the experimental result shows that the method provided by the invention obtains a better result in the experiment compared with a PU learning method without theme enhancement.

Drawings

FIG. 1 is a general model diagram of the present invention;

FIG. 2 is a diagram of the PU learning training process in the present invention;

FIG. 3 is a graph of experimental results on a validation set in the present invention;

FIG. 4 is a graph of experimental results on an unlabeled dataset in accordance with the present invention;

FIG. 5 is a graph of the results of comparative experiments on different scale initial data in accordance with the present invention;

fig. 6 is a graph of results of an iteration step comparison experiment in the present invention.

Detailed Description

Example 1: as shown in fig. 1 to 5, the learning method for filtering event-related news based on the fused topic information enhanced PU learning specifically includes the following steps:

step1, training a classifier, and adding an unsupervised topic model VAE for enhancement.

And Step2, predicting the unlabeled data through the trained classifier model, and then sequencing the prediction results of the unlabeled news from high to low in probability.

After Step3, the initial training and prediction process is completed, the iteration of PU learning is performed, i.e. the classifier is retrained on the newly obtained training set and the whole prediction and training process is repeated.

The specific steps of Step1 are as follows:

Example 2: as shown in fig. 1 to 5, the learning method for filtering event-related news based on fused topic information enhanced PU learning is the same as embodiment 1, where:

step1.1.1, a text feature appears more than 90% in the proper example set, and it appears only 10% in the unidentified set, and such feature is considered a proper example feature.

Step1.1.2, establishing a regular example feature set by the difference of the occurrence frequency of the features in the regular example set and the unidentified set.

As a preferred embodiment of the present invention, the step step1.2 comprises:

the step1.2.1, variational self-encoding (VAE) architecture is an encoder-decoder architecture. In the encoder, the input is compressed to a potential distribution Z, and the decoder reconstructs the input signal D from the distribution of Z in the potential space of data by sampling.

Where Z represents the potential distribution, P (D | Z) describes the probability that D is generated by Z.

Step1.2.2, in general, the VAE model assumes that the posterior probability of the potential distribution Z of the input data D approximately satisfies a Gaussian distribution, i.e.

logP(Z|d⁽ⁱ⁾)＝logN(z；μ⁽ⁱ⁾,δ²⁽ⁱ⁾I) (2)

Wherein d is⁽ⁱ⁾Representing a real sample in D, each of μ and δ²Are all formed by⁽ⁱ⁾Generated by a neural network. By obtaining mu⁽ⁱ⁾And delta²⁽ⁱ⁾Then each d can be obtained⁽ⁱ⁾Corresponding distribution P (Z)⁽ⁱ⁾|d⁽ⁱ⁾) Then through a decoding network

Is reconstructed to obtain

Step1.2.3, inventive p.mu.and d²The generation of (d) and the implementation of the decoding network Decode are implemented using a fully connected network (MLP).

Where m represents a preset number of potential topics. After the above calculation, the distribution of the event-related news potential subjects required by the present invention can be expressed as

In order to make the reconstructed data as close to the original data as possible, the final optimization goal of the VAE is to maximize d⁽ⁱ⁾Generation probability P (d)⁽ⁱ⁾) At the same time, the KL divergence is used to obtain the posterior probability P (Z) from the data⁽ⁱ⁾|d⁽ⁱ⁾) Approximating its theoretical variation probability, i.e., N (0, I), as close as possible, the final expression of the optimization objective is as follows.

step1.3.1, firstly, Embedding words in an input text by using an Embedding network layer to obtain word Embedding vectors

Where n represents the news text length and v is the word vector dimension. In addition, the input text passes through the VAE theme model again to obtain the theme of the news text(Vector)

Wherein m is the number of preset themes. Two kinds of coded information are obtained.

Step1.3.2 Using News topic vectors

To guide the word embedding vector X. Because the topic vector acquired by the case-involved topic model is a vector with the shape of 1X m, n copies of the topic vector are copied in the text and are respectively spliced to the word embedding vector X, and a new matrix X' is a news coding vector which is fused into the topic vector.

Step1.3.3, modeling the context relation of the news coding vector after the topic information is blended through a bidirectional long-short term memory network layer (BilSTM) to obtain a news semantic representation vector. The specific formula is as follows.

Wherein H is the sentence vector after BiLSTM coding, q is the hidden layer dimension of BiLSTM, and y represents the final probability output.

According to the invention, an event related news data set is constructed for carrying out experiments, three types of experiments are carried out by combining the method, one type is a comparison experiment with the performance of a PU classification algorithm without a theme, the predicted performance of the event related news data set and the PU classification algorithm without the theme in iterative training is analyzed, in addition, a comparison experiment of initial test data sets with different scales is carried out, finally, an iterative step comparison experiment is carried out, and the effectiveness of the method for comparing the PU classification algorithm without the theme under the condition of different steps is verified. The experimental result verifies the effectiveness of the method on the event-related news correlation analysis task, and simultaneously shows that the theme information is used for enhancing the improvement effect of the PU learning iterative process on the model performance.

The selection of experimental parameters directly affects the final experimental results. As the length of the news text in the data set is about 100-250 characters, all data are manually labeled for the convenience of experimental verification effect, wherein the data comprise 10000 events-related news and 20000 non-event-related news. The invention sets the maximum text length to 200 characters. Adopting Adam algorithm as an optimizer; the learning rate was set to 0.001; dropout for the single layer Bi-LSTM is set to loss 0.2; batch size was set to 128; the training round is set to 20; the iterative training times are the ratio of the total quantity of the unlabeled data to the number of positive and negative samples extracted each time. The evaluation indexes herein mainly employ accuracy (Acc.), precision (P), recall (R) and F1 values.

The method is compared with the traditional PU learning method through the three types of experiments, and the method is superior to the traditional PU method in each iteration when the initial data scale and the iteration step are fixed. The performance of the method is improved more and more stable when the initial data size is smaller or the iteration step is larger.

The comparison experiment with the performance of the PU classification algorithm without the theme is mainly used for verifying the effectiveness of the method on the event-related news filtering problem of only a few event-related news samples and verifying the enhancement effect of the theme information on the PU learning iterative process. Two sets of experiments were set up: one group is to use the reserved verification set to evaluate the generalization performance of the classifier trained by the iterative process, and the experimental result is shown in fig. 3; and the other group is used for evaluating the prediction results of the classifier trained in each iteration on the rest unlabeled samples, and the experimental results are shown in fig. 4. As can be seen from the analysis of FIG. 3, the upper limit of the F1 value of the data set and the classification model used in the present invention can reach 83.4% in the case of "supervision", and the F1 value is only 73.9% in the case of "PU learning", which are different by 13.7%, while the F1 value of the present invention can reach 75.7% in the same experimental setting, which is improved by 1.8% compared with the case of "PU learning", thereby showing the effectiveness of the present invention on the event-related news filtering problem of only a few event-related news samples and the enhancement effect of the subject model on PU learning.

As can be seen from the analysis of FIG. 4, the performance of the method of the present invention in predicting the unlabeled data is superior to the conventional PU learning scheme, and the difference between the two is larger and larger as the number of iterations increases. It is also demonstrated that the inventive method is effective against the improvement of the PU learning method.

In the initial data comparison experiments of different scales, compared with the PU learning method without subjects, the method disclosed by the invention has the advantage that the performance of the reserved verification set is improved. The results of the experiment are shown in FIG. 5 below. As can be seen from fig. 5, when the initial labeled data size is only 500, the conventional PU learning method has almost failed, which is caused by the fact that PU learning depends on the size of the initial labeled data, when the initial data size is too small, the accuracy of the trained classifier is too low, the error of the reliable positive and negative samples obtained in the subsequent prediction process is too large, and as the iteration process proceeds, the error is cumulatively amplified and finally causes PU learning to fail. As the initial data size increases, the error per iteration becomes smaller and smaller, and the final training result becomes better and better, which is a common phenomenon of PU learning. This common phenomenon is also followed by the present method. In contrast, the method herein shows better adaptability when the initial data set is smaller. When the initial data scale is only 750, the difference between the F1 value of the effect of the method and the traditional PU learning reaches 9.4%, and the difference between the two values is gradually reduced along with the increase of the initial data scale, because the unsupervised topic model brings more information to a small amount of case-involved training data when the initial data scale is small, the performance of the initial classifier can not be reduced too much, and the invention finally relieves the phenomenon of error accumulation.

In an iteration step comparison experiment, compared with a PU learning method without a theme, the method disclosed by the invention has the advantage that the performance of the reserved verification set is improved. The results of the experiment are shown in FIG. 6 below. As can be seen from fig. 6, the performance of both the conventional PU learning and the present invention can be maintained at a better level when the iteration steps are 300 and 500, and when the iteration steps are further extended, the performance of the conventional PU learning starts to be greatly reduced, while the method of the present invention is still maintained at a better level. However, when the iteration step reaches 1500, both the present invention and the conventional PU learning are invalid, because the accuracy of the classification model trained by the PU learning is limited when the initial data scale is only 1000, and the required accuracy cannot be achieved even if the theme information is added for enhancement.

Therefore, the learning method can better filter the news related to the event, effectively solves the problem that the news related to the event is lack of training data in a filtering task, and improves the accuracy of the filtering result of the news related to the event.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The learning method for filtering event-related news based on the theme information-blended enhanced PU learning is characterized by comprising the following steps: the method comprises the following specific steps:

step1, training a classifier, and adding an unsupervised topic model for enhancement;

and Step4, putting all samples into a classifier for training to obtain a required event related news classification model, and further more accurately filtering out required event related news.

2. The learning method for event-related news filtering based on subject information-incorporated PU learning enhancement according to claim 1, wherein: the specific steps of Step1 are as follows:

step1.1, extracting non-event related news data by using an improved I-DNF algorithm, and acquiring a counterexample with the same scale as the initial event related news to train an initial classifier;

step1.2, using Embedding and bidirectional long-short term memory network BilSTM network structure as classifier, and adding unsupervised topic model VAE for enhancement.

3. The learning method for event-related news filtering based on subject information-incorporated PU learning enhancement according to claim 2, wherein: the specific steps of the step Step1.1 are as follows:

4. The learning method for event-related news filtering based on subject information-incorporated PU learning enhancement according to claim 2, wherein: the specific steps of the step Step1.2 are as follows:

step1.2.1, firstly, using an Embedding network layer to embed words in an input text to obtain word embedded vectors;

step1.2.2, using the news topic vector to guide the word embedding vector;

step1.2.3, modeling the context relation of the news coding vector after the topic information is blended through a bidirectional long-short term memory network (BilSTM) to obtain a news semantic representation vector.

5. The learning method for event-related news filtering based on subject information-incorporated PU learning enhancement according to claim 1, wherein: the specific steps of Step2 are as follows:

step2.1, performing probability prediction of categories on residual unlabeled data samples in the data set through a classifier and a topic model; the prediction result is a probability value of news belonging to event related news;

6. The learning method for event-related news filtering based on subject information-incorporated PU learning enhancement according to claim 1, wherein: the specific steps of Step3 are as follows:

after the initial training and prediction process is finished by Step3.1, retraining the classifier on a newly obtained training set and repeating the whole prediction and training process;