CN111241281A

CN111241281A - Text similarity-based public opinion topic tracking method

Info

Publication number: CN111241281A
Application number: CN202010031039.5A
Authority: CN
Inventors: 张涛; 张琨; 朱显坤
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-01-13
Filing date: 2020-01-13
Publication date: 2020-06-05

Abstract

The invention discloses a public opinion topic tracking method based on text similarity, which is based on a doc2vec model evolved by a word2vec model, can well obtain the expression of vectors of sentences, paragraphs or documents, and is very suitable for processing public opinion topics, but the model ignores the time characteristic of the public opinion topics. Compared with the prior art, the method has the advantages that the data dimension is relatively low on the expression of the vectors of sentences, paragraphs or documents, the time complexity is reduced, the expression of semantics is relatively more accurate, the text similarity calculation accuracy is improved, the timeliness of topics is ensured by adding the time characteristic on the basis of the existing model, and the method has a good effect on topic tracking through experimental tests.

Description

A public opinion topic tracking method based on text similarity

技术领域technical field

本发明属于自然语言处理中话题追踪领域，尤其涉及一种基于文本相似度的话题跟踪方法的研究创新。The invention belongs to the field of topic tracking in natural language processing, and particularly relates to a research innovation of a topic tracking method based on text similarity.

背景技术Background technique

话题跟踪是指给出某话题的一条或多条报道，把输入进来的相关报道和该话题联系起来。根据跟踪需求可以把步骤分成以下两步：首先给出一组样本报道，通过模型训练得到话题模型，然后在后续的报道中找出相似或者同样的话题报道，话题跟踪(TopicTracking)可以把分散且多变的话题汇集并组织起来，帮助用户发现话题间的关系，从整体上了解舆情话题的各个方面的信息以及话题与话题之间的联系。随着相关技术的发展和进步,话题跟踪研究目标和处理对象已不仅限于媒体信息流,而是越来越广泛地应用于与信息相关的各个领域。本发明以文本相似度计算方式对舆情话题进行跟踪，而目前在文本相似度方面，有两种主流的文本相似度计算的方式，基于字符串方式和基于语料库方式。Topic tracking refers to giving one or more reports on a topic and linking the related reports entered with the topic. According to the tracking requirements, the steps can be divided into the following two steps: first, a set of sample reports is given, the topic model is obtained through model training, and then similar or the same topic reports are found in the subsequent reports. The ever-changing topics are gathered and organized to help users discover the relationship between topics, and understand the information of various aspects of public opinion topics as a whole and the connections between topics. With the development and progress of related technologies, topic tracking research goals and processing objects are not limited to media information flow, but are more and more widely used in various fields related to information. The present invention tracks public opinion topics by means of text similarity calculation. At present, in terms of text similarity, there are two mainstream text similarity calculation methods, a string-based method and a corpus-based method.

1基于字符串1 String based

基于字符串的方式是从字符串匹配度出发，以字符串共现和重复程度为相似度的标准来衡量，根据计算粒度的不同又可将该方式分为基于字符方式和基于词方式；当前单纯从字符或者词组成的角度考虑的相似性算法有编辑距离、汉明距离、Dice系数、余弦相似度等方式计算文本相似度，在此基础上加入字符顺序的方法有Jaro-Winkler以及最长公共字串方式；基于上述两种方式又采用了一种集合思想也就是将字符串看成有词语构成的集合，词语共现采用集合的交集来计算，当前主要方法N-gram和Jaccard等方法。The string-based method starts from the string matching degree, and uses the co-occurrence and repetition degree of strings as the similarity standard to measure. According to the different calculation granularity, this method can be divided into character-based method and word-based method; currently Similarity algorithms considered purely from the perspective of character or word composition include edit distance, Hamming distance, Dice coefficient, cosine similarity, etc. to calculate text similarity. On this basis, the methods of adding character order include Jaro-Winkler and longest. Common string method; based on the above two methods, a set idea is adopted, that is, the string is regarded as a set composed of words, and the co-occurrence of words is calculated by the intersection of the sets. The current main methods are N-gram and Jaccard. .

2基于语料库2 Corpus-based

基于语料库的方法使用从语料库中获取的信息计算文本相似度，而基于语料库的方法又可以分为：基于词袋模型、基于神经网络模型，且两种方法是以待比较相似度的文档集合作为语料库。The corpus-based method uses the information obtained from the corpus to calculate the text similarity, and the corpus-based method can be divided into: bag-of-words model and neural network-based model, and the two methods use the set of documents to be compared as the similarity. Corpus.

1)基于词袋模型1) Based on bag-of-words model

词袋模型是建立在分布假说的基础上，也就是词语所处的上下文语境相似，则语义相似，词袋模型的基本思想是不考虑词语在文档中出现的顺序，把文档表示成一系列词语的组合。根据语义的不同，基于词袋模型的方法当前主要包括向量空间模型(VectorSpace Model，VSM)、概率潜在语义分析(Probabilistic Latent Semantic Analysis，PLSA)、潜在语义分析(Latent Semantic Analysis，LSA)以及潜在狄利克雷分布(LatentDirichlet Allocation，LDA)等主流模型方式。The bag-of-words model is based on the distribution hypothesis, that is, the contexts where words are located are similar, and the semantics are similar. The basic idea of the bag-of-words model is to express the document as a series of words regardless of the order in which the words appear in the document. The combination. According to different semantics, methods based on the bag-of-words model currently mainly include Vector Space Model (VSM), Probabilistic Latent Semantic Analysis (PLSA), Latent Semantic Analysis (LSA) and Latent Semantic Analysis (LSA) Mainstream model methods such as LatentDirichlet Allocation (LDA).

2)基于神经网络模型2) Based on neural network model

基于神经网络模型生成词向量来计算文本相似度是近年来该领域研究的热门领域，在这个过程中提出很多如Word2Vec和Glove等词向量模型。词向量的本质是从没有标记的非结构文本中训练出一种低维实数向量，这样的表达方式使得类似的词语在距离上更为接近，同时也能更好的解决词袋模型由于词语独立带来的维数灾难和语义不足的问题。Generating word vectors based on neural network models to calculate text similarity is a hot area of research in this field in recent years. In this process, many word vector models such as Word2Vec and Glove have been proposed. The essence of word vector is to train a low-dimensional real number vector from unlabeled unstructured text. This expression makes similar words closer in distance, and can also better solve the bag-of-words model due to the independence of words. It brings about the curse of dimensionality and the problem of insufficient semantics.

3 doc2vec算法3 doc2vec algorithm

doc2vec模型算法由谷歌2014年基于word2vec模型演化而来，是一种非监督算法模型，其本质是要学出文档的一个表示，可以获得句子/段落/者文档的向量表达，是word2vec的拓展，根据学习出来的向量可以通过计算距离来找句子/段落/文档之间的相似度，可以用于无标签的文本聚类，对于有标签的数据也可以使用监督学习的方式进行文本分类，在训练的过程中相比于word2vec模型增加了paragraph id，即训练语料中每一个句子都有唯一的id，且paragraph id和普通的word一样，先是映射成一个向量也就是paragraph vector与word vector的维数虽然是一样的，但是来自两个不同的向量空间。在之后的计算里，paragraph vector与word vector累加或者连接起来，作为输出层softmax的输入。在一个句子或者文档的训练过程中，paragraph id保持不变，共享同一个paragraph vector，相当于每次在预测单词的概率时，都利用了整个句子的语义。doc2vec模型框架如附图1所示，其任务就是给定上下文，预测上下文的其他词。其中，每个单词都被映射到向量空间中，将上下文的词向量级联或者求和作为特征，预测句子中的下一个单词，其目标函数如公式(1.1)，而预测的任务是一个多分类任务，分类器最后一层使用的是softmax，其计算公式如(1.2)；而在预测任务中每一个词均作为一个被预测的任务，每个词都看作是一个类别，计算公式如(1.3)所示，其中U和b都是参数，h将w_t-k，...，w_t+k级联或者求平均。The doc2vec model algorithm was evolved from Google in 2014 based on the word2vec model. It is an unsupervised algorithm model. Its essence is to learn a representation of the document and obtain the vector representation of the sentence/paragraph/document. It is an extension of word2vec. According to the learned vector, the similarity between sentences/paragraphs/documents can be found by calculating the distance, which can be used for unlabeled text clustering. For labeled data, supervised learning can also be used for text classification. In the process, the paragraph id is added compared to the word2vec model, that is, each sentence in the training corpus has a unique id, and the paragraph id is the same as the ordinary word. First, it is mapped into a vector, that is, the dimension of the paragraph vector and the word vector. Although the same, but from two different vector spaces. In the subsequent calculation, the paragraph vector and the word vector are accumulated or connected as the input of the output layer softmax. During the training process of a sentence or document, the paragraph id remains unchanged and shares the same paragraph vector, which is equivalent to using the semantics of the entire sentence every time the probability of a word is predicted. The doc2vec model framework is shown in Figure 1, and its task is to predict other words in the context given the context. Among them, each word is mapped into the vector space, and the word vector concatenation or summation of the context is used as a feature to predict the next word in the sentence. Its objective function is as formula (1.1), and the task of prediction is a multi- In the classification task, the last layer of the classifier uses softmax, and its calculation formula is as (1.2); in the prediction task, each word is regarded as a predicted task, and each word is regarded as a category, and the calculation formula is as follows (1.3), where U and b are both parameters, and h concatenates or averages w _tk , . . . , w _t+k .

y＝b+Uh(ω_t-k，...，w_t+k；W) (1.3)y=b+Uh(ω _tk , . . . , w _t+k ; W) (1.3)

发明内容SUMMARY OF THE INVENTION

本发明提出一种基于文本相似度的舆情话题跟踪方法，其基础是谷歌2014年提出基于word2vec模型演化而来的doc2vec模型，该模型是一种非监督算法，可以很好的获得句子、段落或者文档的向量的表达，很适合对舆情话题的处理，但是该模型忽略了舆情话题的时间特性，本发明将时间特性作为重要特征加入到算法，确保话题的时效性，同时，为了降低文本数据长短对最终结果产生影响，采用选文本相似性计算方式，实验结果表明采用上述方式对舆情话题跟踪表现出良好的效果。The invention proposes a public opinion topic tracking method based on text similarity, which is based on the doc2vec model evolved from the word2vec model proposed by Google in 2014. The model is an unsupervised algorithm, which can well obtain sentences, paragraphs or The expression of the vector of the document is very suitable for the processing of public opinion topics, but the model ignores the time characteristics of the public opinion topics. The present invention adds the time characteristics as an important feature to the algorithm to ensure the timeliness of the topic. At the same time, in order to reduce the length of text data To affect the final result, the similarity calculation method of selected text is adopted. The experimental results show that the above method has a good effect on the tracking of public opinion topics.

步骤1数据预处理Step 1 Data preprocessing

1)文本数据是使用爬虫技术，数据获取地址是新浪新闻和人民网，获取内容为爬取热点舆情话题以及该话题新闻相关新闻，采用这种爬取方式主要目的是为获取高质量的舆情话题语料。1) The text data uses crawler technology. The data acquisition address is Sina News and People’s Daily. The content is to crawl hot public opinion topics and news related to the topic. The main purpose of this crawling method is to obtain high-quality public opinion topics. corpus.

2)中文分词就是将连续的字序列，按照对中文的理解将其划分为单个词语的过程，采用jieba分词工具对文本进行分词，分词结束后的结果如图3所示，句子已经被划分为单个词。2) Chinese word segmentation is the process of dividing a continuous sequence of words into individual words according to the understanding of Chinese, using jieba word segmentation tool to segment the text, the result after the word segmentation is shown in Figure 3, the sentence has been divided into single word.

3)在中文中正常的文本或是一句话会包含逗号、顿号或者句号等特殊字符，完成分词之后如图2中会保留这些特殊字符，而在进行文本相似度计算时这些特殊字符会影响计算的速度和精度，所以这些字符需要过滤掉，除了这些特殊字符之外，如而且，不仅，的，了等对文本相似度的计算也有类似的影响，且这些词对最终计算结果几乎不影响，所以在数据预处理阶段将这些词过滤掉。3) Normal text or a sentence in Chinese will contain special characters such as commas, commas or periods. After the word segmentation is completed, these special characters will be retained as shown in Figure 2, and these special characters will affect the text similarity calculation. The speed and accuracy of the calculation, so these characters need to be filtered out, in addition to these special characters, such as and, not only, , , etc. also have a similar effect on the calculation of text similarity, and these words have little effect on the final calculation result. , so these words are filtered out in the data preprocessing stage.

步骤2文本相似性计算Step 2 Text Similarity Calculation

由于文本数据是从网上抓取的内容，经过步骤1之后数据的长度可能会很短，为降低或消除这阵短文本对相似性最终计算结果产生影响，采用两种方式进行文本相似度计算，即文本长度小于150的文本采用句子级别的计算方式，否则采用文档级别的计算方式，并在计算过程中将时间特性加入计算中，首先进行时间对比，若时差大于30天且相似性小于0.70的新闻个数小于100条，则认为相似度较低，若时差大于30天且相似性大于等于0.70的新闻个数大于100条则认为相似性较高，经最后加权处理得到对应的文本相似度。Since the text data is captured from the Internet, the length of the data may be very short after step 1. In order to reduce or eliminate the impact of this short text on the final similarity calculation result, two methods are used to calculate the text similarity. That is, the text with the text length less than 150 adopts the sentence-level calculation method, otherwise, the document-level calculation method is adopted, and the time characteristic is added to the calculation during the calculation process, and the time comparison is performed first. If the number of news is less than 100, the similarity is considered to be low. If the time difference is greater than 30 days and the similarity is greater than or equal to 0.70, the similarity is considered to be high. After the final weighting process, the corresponding text similarity is obtained.

步骤3话题跟踪结果Step 3 Topic Tracking Results

根据步骤2到对应文本的向量表达方式，为了更好的展示计算结果本发明使用k-means算法对文本数据进行图像展示，其结果如附图7所示。According to the vector expression method of the corresponding text from step 2, in order to better display the calculation result, the present invention uses the k-means algorithm to display the text data in an image, and the result is shown in FIG. 7 .

与现有技术相比较，本发明在句子、段落或者文档的向量的表达上数据维度相对较低降低了时间复杂度，语义的表达相对更加准确，提升了文本相似度计算精确性，且本发明在现有模型的基础添加时间特性确保话题的时效性，经过实验测试本发明在话题跟踪方面效果良好。Compared with the prior art, the present invention has a relatively low data dimension in the expression of the vector of sentences, paragraphs or documents, which reduces the time complexity, the expression of semantics is relatively more accurate, and improves the accuracy of text similarity calculation. The time characteristic is added on the basis of the existing model to ensure the timeliness of the topic, and the present invention has a good effect in topic tracking through experimental tests.

附图说明Description of drawings

图1是本发明doc2vec模型架构图。FIG. 1 is an architecture diagram of the doc2vec model of the present invention.

图2是本发明文本相似度的舆情话题跟踪的整体流程图。FIG. 2 is an overall flow chart of the public opinion topic tracking of text similarity according to the present invention.

图3是本发明舆情话题语料图。FIG. 3 is a corpus diagram of the public opinion topic of the present invention.

图4是本发明分词完成后的结果图。FIG. 4 is a result diagram after the word segmentation of the present invention is completed.

图5是本发明去停用词完成后的结果图。FIG. 5 is a result diagram after the stop word removal of the present invention is completed.

图6是本发明文本相似度计算完成后的结果图。FIG. 6 is a result diagram after the text similarity calculation of the present invention is completed.

图7是本发明最终话题跟踪结果图。FIG. 7 is a graph of the final topic tracking result of the present invention.

具体实施方式Detailed ways

结合说明书附图对发明的实施方式进行描述，文本相似度的舆情话题跟踪主要分为以下步骤，The embodiments of the invention are described with reference to the accompanying drawings. The public opinion topic tracking of text similarity is mainly divided into the following steps:

步骤1、文本获取Step 1. Text acquisition

文本数据是使用爬虫技术，数据获取地址是新浪新闻和人名网，获取内容主要是爬取舆情话题新闻以及该舆情话题新闻相关新闻，采用这种爬取方式主要目的是为获取高质量的舆情话题语料。The text data uses crawler technology, the data acquisition address is Sina News and Renming.com, and the content is mainly to crawl public opinion topic news and news related to the public opinion topic. The main purpose of this crawling method is to obtain high-quality public opinion topics. corpus.

步骤2、中文分词Step 2, Chinese word segmentation

中文分词就是将连续的字序列，按照对中文的理解将其划分为单个词语的过程，再实施过程中采用jieba分词工具对文本进行分词，分词结束后的结果如图4所示，句子已经被划分为单个词了。Chinese word segmentation is the process of dividing a continuous sequence of words into individual words according to the understanding of Chinese. During the implementation process, the jieba word segmentation tool is used to segment the text. The result after the word segmentation is shown in Figure 4. The sentence has been divided into single words.

步骤3、去停用词Step 3. Remove stop words

在中文中正常的文本或是一句话通常会包含逗号，顿号或者句号等特殊字符，完成分词之后如图3中会保留这些特殊字符，而在进行文本相似度计算时这些特殊字符会影响计算的速度和精度，所以这些字符需要过滤掉，除了这些特殊字符之外，如而且，不仅，的，了等对文本相似度的计算也有类似的影响，且这些词对最终计算结果几乎不影响，所以在数据预处理阶段将这些词过滤掉。Normal text or a sentence in Chinese usually contains special characters such as commas, commas or periods. After the word segmentation is completed, these special characters will be retained as shown in Figure 3, and these special characters will affect the calculation of text similarity. speed and precision, so these characters need to be filtered out, in addition to these special characters, such as and, not only, , , etc. also have a similar impact on the calculation of text similarity, and these words have little effect on the final calculation result, So these words are filtered out in the data preprocessing stage.

步骤4、文本相似度计算Step 4. Text similarity calculation

由于文本数据是从网上抓取的内容，经过步骤1、2之后数据的长度可能会很短，为降低或消除这阵短文本对相似性最终计算结果产生影响，采用两种方式进行文本相似度计算，即文本长度小于150的文本采用句子级别的计算方式，否则采用文档级别的计算方式，并在计算过程中将时间特性加入计算中，首先进行时间对比，若时差大于30天且相似性小于0.70的新闻个数小于100条，则认为相似度较低，若时差大于30天且相似性大于等于0.70的新闻个数大于100条则认为相似性较高，经最后加权处理得到对应的文本相似度。Since the text data is grabbed from the Internet, the length of the data may be very short after steps 1 and 2. In order to reduce or eliminate the impact of this short text on the final similarity calculation result, two methods are used to measure the text similarity. Calculation, that is, the text with text length less than 150 adopts sentence-level calculation method, otherwise, document-level calculation method is adopted, and time characteristics are added to the calculation during the calculation process. First, time comparison is performed. If the time difference is greater than 30 days and the similarity is less than If the number of news with 0.70 is less than 100, the similarity is considered to be low. If the time difference is greater than 30 days and the number of news with a similarity greater than or equal to 0.70 is greater than 100, the similarity is considered to be high. After the final weighting process, the corresponding texts are similar. Spend.

步骤5、话题跟踪结果Step 5. Topic tracking results

根据步骤4可到对应文本的向量表达方式，为了更好的展示计算结果本发明使用k-means算法对文本数据进行图像展示，其结果如附图7所示。According to step 4, the vector representation of the corresponding text can be obtained, in order to better display the calculation result, the present invention uses the k-means algorithm to display the text data in an image, and the result is shown in FIG. 7 .

Claims

1. A public opinion topic tracking method based on text similarity is characterized in that: the method comprises the following steps of,

step 1 data preprocessing

1) The text data is obtained by crawling hot public sentiment topics and news related to the topics through a crawler technology, and high-quality public sentiment topic corpora are obtained;

2) the Chinese word segmentation is a process of dividing a continuous word sequence into single words according to the understanding of Chinese, a jieba word segmentation tool is adopted to segment words of a text, and sentences are already divided into the single words;

3) normal text or a sentence in Chinese contains comma, pause or sentence special characters, the special characters are reserved after word segmentation is finished, and the special characters influence the calculation speed and precision when text similarity calculation is carried out, so the characters need to be filtered, except the special characters, the words not only have influence on the calculation of the text similarity, but also do not influence the final calculation result, so the words are filtered in a data preprocessing stage;

step 2 text similarity calculation

Because the text data is the content captured from the internet, the length of the data after the step 1 is possibly very short, two modes are adopted for carrying out text similarity calculation, namely, the text with the text length less than 150 adopts a sentence-level calculation mode, otherwise, a document-level calculation mode is adopted, the time characteristic is added into the calculation in the calculation process, time comparison is firstly carried out, if the time difference is greater than 30 days and the number of the news with the similarity less than 0.70 is less than 100, the similarity is considered to be low, if the time difference is greater than 30 days and the number of the news with the similarity greater than or equal to 0.70 is greater than 100, the similarity is considered to be high, and the corresponding text similarity is obtained through final weighting processing;

step 3 topic tracking results

And (4) displaying the image of the text data by using a k-means algorithm according to the vector expression mode of the corresponding text from the step (2).