CN113641888B

CN113641888B - Event-related news filtering learning method based on fusion topic information enhanced PU learning

Info

Publication number: CN113641888B
Application number: CN202110347488.5A
Authority: CN
Inventors: 余正涛; 王冠文; 线岩团; 张玉; 黄于欣
Original assignee: Kunming University of Science and Technology
Current assignee: Science & Technology Park Co ltd Of Kunming University Of Science And Technology
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2023-08-29
Anticipated expiration: 2041-03-31
Also published as: CN113641888A

Abstract

The invention relates to a learning method for event-related news filtering based on fused topic information enhanced PU learning. The present invention extracts topic information from marked and unlabeled event-related news data sets through unsupervised pre-training, and then adds the extracted topic information to the initial training and subsequent iterative training process of PU learning to ensure that In the case of fewer initial event-related news samples, more sample information can be utilized, and topic enhancement is carried out in the subsequent iterative training process, so that the classifier trained in each iteration can be obtained from unlabeled data. Really reliable positive and negative sample data to improve the performance of the final event-related news classifier. Compared with the baseline model of PU learning, the present invention improves the F1 value by 1.8%, and leads more in the case of low initial samples and high iterations. The method of the present invention for enhancing PU learning by using subject information can effectively solve the problem of lack of training data in case-related news filtering tasks.

Description

Learning of Event-Related News Filtering Based on Fusion Topic Information Enhanced PU Learning method

技术领域technical field

本发明涉及基于融合主题信息增强PU学习的事件相关新闻过滤的学习方法，属于自然语言处理技术领域。The invention relates to a learning method for event-related news filtering based on fused topic information enhanced PU learning, and belongs to the technical field of natural language processing.

背景技术Background technique

事件相关新闻过滤任务通常可以看作是一个二分类问题，常用的方法可以分为关键词检索和机器学习方法两类。早期研究者通过领域相关的关键词集合与新闻文本进行匹配，如KMP、Sunday等算法。目前机器学习算法是一种解决事件相关新闻过滤的有效方案。研究者通过统计方法来对数据分布进行假设来推断事件相关新闻类别，如SVM、决策树等。也有研究者使用深度学习算法来进行新闻过滤，使用深度网络来对文本的隐藏特征进行提取并用于分类。由于事件相关新闻场景复杂多变，很难构建出完备的关键词集合，因此不能使用关键词检索来进行事件相关新闻过滤任务，同时因为事件相关新闻的领域性和特殊性，仅可以通过已发生的案件收集到小规模事件相关新闻数据，很难覆盖所有的案件情况和场景，有大量未标注事件相关新闻隐含在历史新闻中，这种缺乏训练数据的情况会使得基于机器学习的文本过滤方法难以取得理想的效果。因此，如何在仅有少量事件相关新闻样本的情况下达到较优的过滤性能，是发明关注的重点。The event-related news filtering task can usually be regarded as a binary classification problem, and commonly used methods can be divided into two categories: keyword retrieval and machine learning methods. Early researchers matched news texts with field-related keyword sets, such as algorithms such as KMP and Sunday. Currently, machine learning algorithms are an effective solution to filter event-related news. Researchers use statistical methods to make assumptions about data distribution to infer event-related news categories, such as SVM and decision trees. Some researchers also use deep learning algorithms to filter news, and use deep networks to extract hidden features of text and use them for classification. Due to the complex and changeable event-related news scenarios, it is difficult to construct a complete set of keywords, so keyword retrieval cannot be used to filter event-related news. It is difficult to cover all case situations and scenarios, and a large number of unlabeled event-related news is hidden in historical news. This lack of training data will make text filtering based on machine learning difficult. method is difficult to achieve the desired effect. Therefore, how to achieve better filtering performance with only a small number of event-related news samples is the focus of the invention.

本发明主要考虑到利用主题信息增强PU学习来进行事件相关新闻分类。因此，本发明在Yu等人、Liu等人、Ren等人、Li等人和Xiao等人提出的PU学习方法的基础上，充分利用新闻中的主题信息，融入主题信息增强PU学习，探索事件相关新闻文本分类的方法。The present invention mainly considers using topic information to enhance PU learning to classify event-related news. Therefore, on the basis of the PU learning methods proposed by Yu et al., Liu et al., Ren et al., Li et al. A method for classifying relevant news texts.

发明内容Contents of the invention

本发明提供了基于融合主题信息增强PU学习的事件相关新闻过滤的学习方法，来充分利用新闻中隐含的主题信息，用以提高事件相关新闻过滤的准确率。同时相比其他基线方法在事件相关新闻过滤任务中取得更优的结果。The present invention provides a learning method for event-related news filtering based on fused topic information enhanced PU learning to make full use of hidden topic information in news to improve the accuracy of event-related news filtering. At the same time, it achieves better results in event-related news filtering tasks than other baseline methods.

本发明的技术方案是：基于融合主题信息增强PU学习的事件相关新闻过滤的学习方法，所述方法的具体步骤如下：The technical solution of the present invention is: a learning method for filtering event-related news based on fusion topic information enhanced PU learning, and the specific steps of the method are as follows:

Step1、训练分类器，同时加入无监督主题模型VAE进行增强；Step1, train the classifier, and add the unsupervised topic model VAE to enhance;

Step2、将未标注数据通过训练的分类器模型进行预测，再将对未标注新闻的预测结果进行概率由高到低的排序；Step2. Predict the unlabeled data through the trained classifier model, and then sort the prediction results of the unlabeled news from high to low in probability;

Step3、初次的训练和预测过程完成后，进行PU学习的迭代，即在新得到的训练集上重新训练分类器并重复整个预测和训练过程；Step3. After the initial training and prediction process is completed, perform PU learning iterations, that is, retrain the classifier on the newly obtained training set and repeat the entire prediction and training process;

Step4、将所有的样本放入分类器进行训练，得到本文所需要的事件相关新闻分类模型，进而更加准确的过滤出所需的事件相关新闻。Step4. Put all the samples into the classifier for training to obtain the event-related news classification model required in this article, and then filter out the required event-related news more accurately.

作为本发明的优选方案，所述步骤Step1的具体步骤为：As a preferred solution of the present invention, the specific steps of the step Step1 are:

Step1.1、使用改进版的I-DNF算法进行非事件相关新闻数据提取，获取到和初始事件相关新闻相同规模的反例。Step1.1. Use the improved version of the I-DNF algorithm to extract non-event-related news data, and obtain counterexamples of the same scale as the initial event-related news.

Step1.2、使用变分自编码(VAE)作主题模型，目的是为了从文档的词向量空间中抽取潜在特征，本发明理解为主题特征。本发明参考前人的工作以及VAE原理，实现了这种VAE结构并使用整个事件相关新闻数据集进行无监督预训练。来训练初始分类器。Step1.2, using variational autoencoder (VAE) as the topic model, the purpose is to extract potential features from the word vector space of the document, and the present invention is understood as topic features. The present invention refers to the work of predecessors and the principle of VAE, realizes this VAE structure and uses the entire event-related news data set for unsupervised pre-training. to train the initial classifier.

Step1.3、使用Embedding和双向长短期记忆网络(BiLSTM)的网络结构作为分类器。Step1.3, using the network structure of Embedding and bidirectional long short-term memory network (BiLSTM) as a classifier.

作为本发明的优选方案，所述步骤Step1.1的具体步骤为：As a preferred solution of the present invention, the specific steps of the step Step1.1 are:

Step1.1.1、一个文本特征在正例集合中出现频率大于90％，而其在未标识集合出现的频率仅有10％，就把这样的特征当成正例的特征；Step1.1.1. The frequency of a text feature in the positive set is greater than 90%, while the frequency of its appearance in the unlabeled set is only 10%, and such a feature is regarded as a feature of the positive example;

Step1.1.2、通过特征在正例集合和未标识集合中出现的频率不同，建立一个正例特征集合；Step1.1.2. Establish a positive example feature set by using the different frequencies of features in the positive example set and the unlabeled set;

Step1.1.3、未标识集合U中的样例文档未包含任何正例特征集合中的特征的，就把它从未标识集合U中抽取出来，标识成反例。Step1.1.3. If the sample document in the unlabeled set U does not contain any features in the positive feature set, it is extracted from the unlabeled set U and marked as a negative example.

作为本发明的优选方案，所述步骤Step1.2包括：As a preferred solution of the present invention, said step Step1.2 includes:

Step1.2.1、变分自编码(VAE)架构是一种编码器-解码器架构。在编码器中，将输入压缩为潜在分布，而解码器根据数据潜在空间中的分布通过采样重构出输入信号；Step1.2.1, Variational Autoencoder (VAE) architecture is an encoder-decoder architecture. In the encoder, the input is compressed into a latent distribution, and the decoder reconstructs the input signal by sampling according to the distribution in the data latent space;

Step1.2.2、通常情况下，VAE模型假设输入数据的潜在分布的后验概率近似满足高斯分布，然后通过解码网络重构；Step1.2.2. Usually, the VAE model assumes that the posterior probability of the potential distribution of the input data approximately satisfies the Gaussian distribution, and then reconstructs it through the decoding network;

Step1.2.3、本发明对解码网络Decode的实现使用的是全连接网络(MLP)来实现。Step 1.2.3, the present invention implements the decoding network Decode using a fully connected network (MLP).

作为本发明的优选方案，所述步骤Step1.3的具体步骤为：As a preferred solution of the present invention, the specific steps of the step Step1.3 are:

Step1.3.1、首先使用Embedding网络层对输入文本进行词嵌入，得到词嵌入向量。此外，将输入文本再通过VAE主题模型，得到新闻文本的主题向量，得到两种编码信息；Step1.3.1, first use the Embedding network layer to embedding the input text to get the word embedding vector. In addition, the input text is passed through the VAE topic model to obtain the topic vector of the news text and obtain two kinds of encoding information;

Step1.3.2、使用新闻主题向量来对词嵌入向量进行指导；形成的新矩阵就是融入主题向量的新闻编码向量；Step1.3.2. Use the news topic vector to guide the word embedding vector; the new matrix formed is the news encoding vector integrated into the topic vector;

Step1.3.3、融入主题信息后的新闻编码向量通过双向长短期记忆网络层(BiLSTM) 来对其上下文关系进行建模，得到新闻语义表征向量。Step1.3.3. The news encoding vector after integrating the topic information is used to model its context relationship through the bidirectional long-term short-term memory network layer (BiLSTM), and obtain the news semantic representation vector.

作为本发明的优选方案，所述步骤Step2的具体步骤为：As a preferred solution of the present invention, the specific steps of the step Step2 are:

Step2.1、将数据集中剩余的未标注数据样本通过分类器和主题模型进行类别的概率预测。预测结果是新闻属于事件相关新闻的概率值。Step2.1. Use the remaining unlabeled data samples in the dataset to predict the probability of categories through the classifier and topic model. The predicted result is the probability value that the news is news related to the event.

Step2.2、再将对未标注新闻的预测结果进行概率由高到低的排序，每次预测都会按照一定的迭代步幅获取到概率靠前的数据作为可靠事件相关新闻样本和概率靠后的数据作为可靠负样本，并将这些样本从未标注样本中剔除，加入到训练数据中，用以进行后续的迭代训练过程。Step2.2, and then sort the prediction results of unmarked news from high to low in probability, and each prediction will obtain data with high probability according to a certain iteration step as reliable event-related news samples and low-probability data The data are used as reliable negative samples, and these samples are removed from unlabeled samples and added to the training data for subsequent iterative training process.

作为本发明的优选方案，所述步骤Step3的具体步骤为：As a preferred solution of the present invention, the specific steps of the step Step3 are:

Step3.1完成初次的训练和预测过程后，在新得到的训练集上重新训练分类器并重复整个预测和训练过程。Step3.1 After completing the initial training and prediction process, retrain the classifier on the newly obtained training set and repeat the entire prediction and training process.

Step3.2、每次迭代完成后，未标注数据的数量会随之减少而训练集的数量随之增加，当未标注数据被完全预测为可靠样本后，整个迭代过程就完成了。Step3.2. After each iteration is completed, the number of unlabeled data will decrease and the number of training sets will increase. When the unlabeled data is completely predicted as a reliable sample, the entire iterative process is completed.

本发明的有益效果为：The beneficial effects of the present invention are:

本发明将PU学习方法应用于事件相关新闻过滤任务上，有效解决了在少量人工标注的情况下进行事件相关新闻过滤的问题。The invention applies the PU learning method to the task of filtering event-related news, effectively solving the problem of filtering event-related news in the case of a small amount of manual labeling.

本发明采用无监督预训练方式抽取事件相关新闻数据的主题信息，并用其增强PU学习的训练过程，相比于普通PU学习提升了准确率。The invention adopts an unsupervised pre-training method to extract the topic information of event-related news data, and uses it to enhance the training process of PU learning, which improves the accuracy rate compared with ordinary PU learning.

构造了一个事件相关新闻数据集并使用本发明方法进行实验，实验结果表明本发明提出方法相比未使用主题增强的PU学习方法在实验中取得更优的结果。An event-related news data set was constructed and experimented with the method of the present invention. The experimental results show that the method proposed by the present invention achieves better results in the experiment than the PU learning method without topic enhancement.

附图说明Description of drawings

图1为本发明中的总的模型图；Fig. 1 is a general model diagram among the present invention;

图2为本发明中的PU学习训练过程图；Fig. 2 is the PU learning training process figure among the present invention;

图3为本发明中的验证集上的实验结果图；Fig. 3 is the experimental result figure on the verification set in the present invention;

图4为本发明中的未标注数据集上的实验结果图；Fig. 4 is the experimental result figure on the unmarked data set among the present invention;

图5为本发明中的不同规模初始数据对比实验结果图；Fig. 5 is the different scale initial data contrast experiment result figure among the present invention;

图6为本发明中的迭代步幅对比实验结果图。Fig. 6 is a graph showing the experimental results of the iterative stride comparison in the present invention.

具体实施方式Detailed ways

实施例1：如图1-5所示，基于融合主题信息增强PU学习的事件相关新闻过滤的学习方法，所述基于融合主题信息增强PU学习的事件相关新闻过滤的学习方法的具体步骤如下：Embodiment 1: as shown in Figure 1-5, based on the learning method of the event-related news filtering of PU learning based on fusion topic information, the specific steps of the learning method of event-related news filtering based on fusion topic information strengthening PU learning are as follows:

Step1、训练分类器，同时加入无监督主题模型VAE进行增强。Step1, train the classifier, and add the unsupervised topic model VAE to enhance it.

Step2、将未标注数据通过训练的分类器模型进行预测，再将对未标注新闻的预测结果进行概率由高到低的排序。Step2. Predict the unlabeled data through the trained classifier model, and then sort the prediction results of the unlabeled news from high to low.

Step3、初次的训练和预测过程完成后，进行PU学习的迭代，即在新得到的训练集上重新训练分类器并重复整个预测和训练过程。Step3. After the initial training and prediction process is completed, the iteration of PU learning is performed, that is, the classifier is retrained on the newly obtained training set and the entire prediction and training process is repeated.

所述步骤Step1的具体步骤为：The concrete steps of described step Step1 are:

实施例2：如图1-5所示，基于融合主题信息增强PU学习的事件相关新闻过滤的学习方法，本实施例与实施例1相同，其中：Embodiment 2: As shown in Figures 1-5, the learning method of event-related news filtering based on fusion topic information enhanced PU learning, this embodiment is the same as Embodiment 1, wherein:

Step1.1.1、一个文本特征在正例集合中出现频率大于90％，而其在未标识集合出现的频率仅有10％，就把这样的特征当成正例的特征。Step1.1.1. The frequency of a text feature in the positive set is greater than 90%, but its frequency in the unlabeled set is only 10%, and such a feature is regarded as a feature of the positive example.

Step1.1.2、通过特征在正例集合和未标识集合中出现的频率不同，建立一个正例特征集合。Step1.1.2. Establish a positive example feature set based on the frequency of features appearing in the positive example set and the unmarked set.

Step1.2.1、变分自编码(VAE)架构是一种编码器-解码器架构。在编码器中，将输入压缩为潜在分布Z，而解码器根据数据潜在空间中Z的分布通过采样重构出输入信号D。Step1.2.1, Variational Autoencoder (VAE) architecture is an encoder-decoder architecture. In the encoder, the input is compressed into a latent distribution Z, while the decoder reconstructs the input signal D by sampling from the distribution of Z in the latent space of the data.

其中Z表述潜在分布，此时，P(D|Z)就描述了由Z来生成D的概率。Where Z represents the potential distribution, at this time, P(D|Z) describes the probability that D is generated by Z.

Step1.2.2、通常情况下，VAE模型假设输入数据D的潜在分布Z的后验概率近似满足高斯分布，即Step1.2.2, usually, the VAE model assumes that the posterior probability of the potential distribution Z of the input data D approximately satisfies the Gaussian distribution, that is

logP(Z|d⁽ⁱ⁾)＝logN(z；μ⁽ⁱ⁾,δ²⁽ⁱ⁾I) (2)logP(Z|d ⁽ⁱ⁾ )＝logN(z; μ ⁽ⁱ⁾ ,δ ²⁽ⁱ⁾ I) (2)

其中d⁽ⁱ⁾表示D中的某个真实样本，每个μ和δ²均是由d⁽ⁱ⁾通过神经网络生成的。通过得到的μ⁽ⁱ⁾和δ²⁽ⁱ⁾，就可以得到每个d⁽ⁱ⁾对应的分布P(Z⁽ⁱ⁾|d⁽ⁱ⁾)，然后通过解码网络重构出/> Where d ⁽ⁱ⁾ represents a certain real sample in D, and each μ and δ ² are generated by d ⁽ⁱ⁾ through the neural network. Through the obtained μ ⁽ⁱ⁾ and δ ²⁽ⁱ⁾ , the distribution P(Z ⁽ⁱ⁾ |d ⁽ⁱ⁾ ) corresponding to each d ⁽ⁱ⁾ can be obtained, and then through the decoding network Refactor out />

Step1.2.3、本发明对μ和δ²的生成以及解码网络Decode的实现使用的是全连接网络(MLP)来实现。Step1.2.3, the present invention realizes the generation of μ and δ ² and the realization of the decoding network Decode using a fully connected network (MLP).

其中m表示预设的潜在主题个数。经过上述计算后，本发明所需要的事件相关新闻潜在主题分布就可以表示为为了使重构数据尽可能接近原始数据，VAE最终的优化目标即为在最大化d⁽ⁱ⁾的生成概率P(d⁽ⁱ⁾)的同时，利用KL 散度使得从数据中得到的后验概率P(Z⁽ⁱ⁾|d⁽ⁱ⁾)尽可能逼近其理论变分概率，即 N(0,I)，这样优化目标的最终表达式如下。Among them, m represents the preset number of potential topics. After the above calculations, the potential topic distribution of event-related news required by the present invention can be expressed as In order to make the reconstructed data as close as possible to the original data, the final optimization goal of VAE is to maximize the generation probability P(d ⁽ⁱ ⁾ ) of d (i) while using the KL divergence to make the posterior obtained from the data The probability P(Z ⁽ⁱ⁾ |d ⁽ⁱ⁾ ) is as close as possible to its theoretical variational probability, that is, N(0,I), so the final expression of the optimization objective is as follows.

Step1.3.1、首先使用Embedding网络层对输入文本进行词嵌入，得到词嵌入向量其中n表示新闻文本长度，v为词向量维度。此外，将输入文本再通过VAE主题模型，得到新闻文本的主题向量/>其中m为预设主题个数。得到两种编码信息。Step1.3.1, first use the Embedding network layer to embedding the input text to get the word embedding vector Among them, n represents the length of the news text, and v is the word vector dimension. In addition, the input text is passed through the VAE topic model to obtain the topic vector of the news text /> Where m is the number of preset themes. Two kinds of encoded information are obtained.

Step1.3.2、使用新闻主题向量来对词嵌入向量X进行指导。由于涉案主题模型获取到的主题向量是形状为1*m的向量，本文将其复制n份，分别拼接到词嵌入向量X后，形成的新矩阵X′就是融入主题向量的新闻编码向量。Step1.3.2. Use news topic vectors To guide the word embedding vector X. Since the topic vector obtained by the topic model involved in the case is a vector with a shape of 1*m, this paper copies n copies of it and stitches them into the word embedding vector X respectively, and the new matrix X′ formed is the news encoding vector integrated into the topic vector.

Step1.3.3、融入主题信息后的新闻编码向量通过双向长短期记忆网络层(BiLSTM) 来对其上下文关系进行建模，得到新闻语义表征向量。具体公式如下所示。Step1.3.3. The news encoding vector after integrating the topic information is used to model its context relationship through the bidirectional long-term short-term memory network layer (BiLSTM), and obtain the news semantic representation vector. The specific formula is as follows.

其中H为BiLSTM编码后的句子向量，q是BiLSTM的隐含层维数，y表示最终的概率输出。Where H is the sentence vector encoded by BiLSTM, q is the hidden layer dimension of BiLSTM, and y represents the final probability output.

本发明构建了一个事件相关新闻数据集用于进行实验，并结合本文方法做了三类实验，一类是与未加主题的PU分类算法的性能进行对比实验，并分析了二者在迭代训练中的预测性能，此外进行了不同规模初试数据集的对比实验，最后进行了迭代步幅对比实验，验证了在不同步幅的情况下，本发明方法对比未加主题的PU分类算法的有效性。实验结果验证了我们的方法对事件相关新闻相关性分析任务的有效性，同时也说明了将主题信息用于增强PU学习迭代过程对模型性能的提升作用。The present invention constructs an event-related news data set for experimentation, and combines the methods in this paper to do three types of experiments, one is to conduct comparative experiments with the performance of the PU classification algorithm without themes, and analyze the performance of the two in iterative training In addition, the comparative experiments of the initial test data sets of different scales were carried out, and finally the iterative stride comparison experiment was carried out, which verified the effectiveness of the method of the present invention compared with the unsubjected PU classification algorithm in the case of different strides . The experimental results verify the effectiveness of our method for the task of event-related news correlation analysis, and also illustrate the effect of using topic information to enhance the iterative process of PU learning to improve model performance.

实验参数的选取直接影响最后的实验结果。由于数据集中的新闻正文长度约在100～250个字符之间，为了方便实验验证效果，本发明将所有数据进行了人工标注，其中包括10000条事件相关新闻，20000条非事件相关新闻。本发明设置正文最大长度为200个字符。采用Adam算法作为优化器；学习率设为0.001；单层Bi-LSTM的 Dropout设为丢失0.2；批次处理大小设为128；训练轮次设置为20；迭代训练次数为未标注数据总量与每次提取的正负样本个数的比值。本文的评价指标主要采用准确率(Acc.)、精确率(P)、召回(R)和F1值。The selection of experimental parameters directly affects the final experimental results. Since the length of the news text in the data set is about 100 to 250 characters, in order to facilitate the experimental verification effect, the present invention manually labels all the data, including 10,000 event-related news and 20,000 non-event-related news. The present invention sets the maximum length of the text to 200 characters. The Adam algorithm is used as the optimizer; the learning rate is set to 0.001; the dropout of the single-layer Bi-LSTM is set to 0.2; the batch size is set to 128; the number of training rounds is set to 20; The ratio of the number of positive and negative samples extracted each time. The evaluation indicators in this paper mainly use accuracy rate (Acc.), precision rate (P), recall (R) and F1 value.

本发明通过这三类实验对本发明方法和传统PU学习方法进行比较，发现在固定初始数据规模和迭代步幅的时候，本发明方法在每次迭代都会优于传统PU方法。在初始数据规模较小或迭代步幅较大的时候，本文方法性能会提升更多，且更加稳定。The present invention compares the inventive method with the traditional PU learning method through these three types of experiments, and finds that the inventive method is superior to the traditional PU method in each iteration when the initial data size and iteration stride are fixed. When the initial data size is small or the iteration step is large, the performance of the method in this paper will be improved more and more stable.

本发明中与未加主题的PU分类算法的性能进行的对比实验，主要是为了验证本发明方法在仅有少量事件相关新闻样本的事件相关新闻过滤问题上的有效性，以及验证主题信息对于PU学习迭代过程的增强效果。设置了两组实验：一组是使用预留的验证集来评估迭代过程训练出的分类器的泛化性能，实验结果如图3所示；另一组是评估每次迭代训练出的分类器在剩余未标注样本上的预测结果，实验结果如图4 所示。通过对图3的分析可知，在“有监督”情况下本发明使用的数据集和分类模型的F1值上限可以达到83.4％，而在“PU学习”的情况下F1值仅有73.9％，二者相差了13.7％，而本发明方法在相同的实验设置下F1值可以达到75.7％，相比“PU学习”的情况提升了1.8％，由此可以说明，本发明方法在仅有少量事件相关新闻样本的事件相关新闻过滤问题上的有效性以及主题模型对于PU学习的增强作用。In the present invention, the comparative experiment carried out with the performance of the PU classification algorithm without subject is mainly to verify the effectiveness of the method of the present invention on the event-related news filtering problem with only a small amount of event-related news samples, and to verify that topic information is important for PU Learning the reinforcement effect of an iterative process. Two sets of experiments were set up: one set was to use the reserved verification set to evaluate the generalization performance of the classifier trained by the iterative process, and the experimental results are shown in Figure 3; the other set was to evaluate the classifier trained by each iteration The prediction results on the remaining unlabeled samples and the experimental results are shown in Figure 4. Through the analysis of Figure 3, it can be known that the upper limit of the F1 value of the data set and classification model used in the present invention can reach 83.4% in the case of "supervised", while the F1 value in the case of "PU learning" is only 73.9%. The difference between the two is 13.7%, while the F1 value of the method of the present invention can reach 75.7% under the same experimental setting, which is 1.8% higher than that of "PU learning". Effectiveness on the event-related news filtering problem of news samples and the enhancement of topic models for PU learning.

通过对图4的分析可知，本发明方法在预测未标注数据时的性能，都是领先于传统PU学习方案，且随着迭代次数增加，两者之间的差距越来越大。也说明了本发明方法针对PU学习方法的改进是有效果的。From the analysis of Figure 4, it can be seen that the performance of the method of the present invention when predicting unlabeled data is ahead of the traditional PU learning scheme, and as the number of iterations increases, the gap between the two becomes larger. It also shows that the method of the present invention is effective for improving the PU learning method.

在不同规模初始数据对比实验中，本发明方法相比未加主题的PU学习方法在预留验证集上的性能提升。实验结果如下图5所示。分析图5可知，当仅有500的初始标注数据规模时，传统PU学习方法几乎已经失效，这种现象是因为PU学习依赖于初始标注数据的规模，在初始数据规模太小的时候，导致训练出来的分类器精度过低，在后续预测过程中得到的可靠正负样本存在的误差过大，随着迭代过程的进行，这种误差会被累加放大并最终导致PU学习失效。随着初始数据规模的增大，每次迭代的误差越来越小，而最终训练的结果也越来越好，这是PU学习的一种常见现象。本文方法同样也遵循这种常见现象。不同的是，在初始数据集较小的时候，本文方法显示出更好的适应性。当初始数据规模在仅有750时，本文方法的效果和传统PU学习之间的F1值差距达到了9.4％，随着初始数据规模的增加，两者差距逐渐缩小，这种现象是因为初始数据规模较少的时候，无监督的主题模型会给少量的涉案训练数据带来更多的信息，这会让初始分类器的性能不会降低太多，发明，最终缓解了这种“误差累加”的现象。In the comparison experiment of initial data of different scales, the performance of the method of the present invention on the reserved verification set is improved compared with the PU learning method without themes. The experimental results are shown in Figure 5 below. Analyzing Figure 5, we can see that when there is only 500 initial labeled data, the traditional PU learning method has almost failed. This phenomenon is because PU learning depends on the size of the initial labeled data. When the initial data scale is too small, the training The accuracy of the resulting classifier is too low, and the error in the reliable positive and negative samples obtained in the subsequent prediction process is too large. As the iterative process progresses, this error will be accumulated and amplified and eventually lead to PU learning failure. As the size of the initial data increases, the error of each iteration is getting smaller and smaller, and the final training results are getting better and better. This is a common phenomenon in PU learning. The method in this paper also follows this common phenomenon. The difference is that our method shows better adaptability when the initial dataset is small. When the initial data size is only 750, the F1 value gap between the effect of this method and the traditional PU learning reaches 9.4%. With the increase of the initial data size, the gap between the two gradually narrows. This phenomenon is because the initial data When the scale is small, the unsupervised topic model will bring more information to a small amount of training data involved in the case, which will not reduce the performance of the initial classifier too much, and finally alleviate this "error accumulation" The phenomenon.

在迭代步幅对比实验中，本发明方法相比未加主题的PU学习方法在预留验证集上的性能提升。实验结果如下图6所示。分析图6可知，在迭代步幅为300和500 时，本发明和传统PU学习的性能都可以保持一个较好的水平，当迭代步幅进一步扩大时，传统PU学习的性能开始出现较大的下降，而本发明方法依旧保持在一个较好的水平。但是当迭代步幅达到了1500时，本发明和传统PU学习都失效了，这是因为在初始数据规模仅有1000时，PU学习训练出的分类模型精度有限，即使加入了主题信息进行增强，也无法达到需要的精度。In the iterative stride comparison experiment, the performance of the method of the present invention on the reserved verification set is improved compared with the PU learning method without topics. The experimental results are shown in Figure 6 below. Analyzing Figure 6, it can be seen that when the iteration stride is 300 and 500, the performance of the present invention and the traditional PU learning can maintain a good level, and when the iterative stride is further expanded, the performance of the traditional PU learning begins to appear larger. decline, while the method of the present invention still remains at a better level. But when the iteration stride reaches 1500, both the present invention and the traditional PU learning are ineffective. This is because when the initial data size is only 1000, the accuracy of the classification model trained by PU learning is limited. Even if the subject information is added for enhancement, The required accuracy cannot be achieved either.

由此可以证明，本发明的学习方法可以更好的进行事件相关新闻的过滤，有效的解决了与事件相关新闻过滤任务中缺乏训练数据的问题，提高了事件相关新闻过滤结果的准确率。It can thus be proved that the learning method of the present invention can better filter event-related news, effectively solve the problem of lack of training data in the event-related news filtering task, and improve the accuracy of event-related news filtering results.

上面结合附图对本发明的具体实施方式作了详细说明，但是本发明并不限于上述实施方式，在本领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下作出各种变化。The specific implementation of the present invention has been described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above-mentioned implementation, and can also be made without departing from the gist of the present invention within the scope of knowledge possessed by those of ordinary skill in the art. Variations.

Claims

1. The learning method based on the event-related news filtering of subject information enhanced PU learning is characterized in that: the specific steps of the method are as follows:

Step1, train the classifier, and add an unsupervised topic model to enhance;

Step2. Predict the unlabeled data through the trained classifier model, and then sort the prediction results of the unlabeled news from high to low in probability;

The concrete steps of described step Step2 are:

Step2.1. Use the remaining unlabeled data samples in the data set to predict the probability of categories through the classifier and topic model; the prediction result is the probability value that the news belongs to the news related to the event;

Step2.2. The prediction results of unmarked news will be sorted from high to low in probability. Each prediction will obtain data with high probability according to a certain iteration step as reliable event-related news samples and data with low probability. As reliable negative samples, these samples are removed from unlabeled samples and added to the training data for subsequent iterative training process;

Step3. After the initial training and prediction process is completed, perform PU learning iterations, that is, retrain the classifier on the newly obtained training set and repeat the entire prediction and training process;

Step4. Put all the samples into the classifier for training to obtain the required event-related news classification model, and filter out the required event-related news based on the event-related news classification model;

Described step Step1 comprises:

Use the I-DNF algorithm to extract non-event-related news data, obtain counterexamples of the same scale as the initial event-related news, to train the initial classifier, and add unsupervised topic model VAE for enhancement;

Among them, the network structure of Embedding and bidirectional long short-term memory network BiLSTM is used as the classifier;

First use the Embedding network layer to embedding the input text to get the word embedding vector Where n represents the length of the news text, and v is the word vector dimension; in addition, the input text is passed through the VAE topic model to obtain the topic vector of the news text/> Among them, m is the number of preset topics, and two kinds of encoding information are obtained;

Topic vectors using news text To guide the word embedding vector X; since the topic vector obtained by the topic model is a vector with a shape of 1*m, copy n copies of it, and splicing them into the word embedding vector X respectively, the new matrix X' formed is to integrate the topic Vector of news encoding vectors:

The news encoding vector after integrating topic information is used to model its context relationship through the bidirectional long-short-term memory network layer BiLSTM, and obtain the news semantic representation vector. The specific formula is as follows:

Where H is the sentence vector encoded by BiLSTM, q is the hidden layer dimension of BiLSTM, and y represents the final probability output.

2. The learning method for filtering event-related news based on incorporating topical information to enhance PU learning according to claim 1, wherein the specific steps of obtaining a negative example of the same scale as the initial event-related news are as follows:

Step1.1.1. The frequency of a text feature in the positive set is greater than 90%, while the frequency of its appearance in the unlabeled set is only 10%, and such a feature is regarded as a feature of the positive example;

Step1.1.2. Establish a positive example feature set by using the different frequencies of features in the positive example set and the unlabeled set;

Step1.1.3. If the sample document in the unlabeled set U does not contain any features in the positive feature set, it is extracted from the unlabeled set U and marked as a negative example.

3. the learning method based on the event-related news filtering of subject information enhanced PU learning according to claim 1, characterized in that: the specific steps of the step Step3 are:

Step3.1 After completing the initial training and prediction process, retrain the classifier on the newly obtained training set and repeat the entire prediction and training process;

Step3.2. After each iteration is completed, the number of unlabeled data will decrease and the number of training sets will increase. When the unlabeled data is completely predicted as a reliable sample, the iterative process ends.