CN111241281A - Text similarity-based public opinion topic tracking method - Google Patents
Text similarity-based public opinion topic tracking method Download PDFInfo
- Publication number
- CN111241281A CN111241281A CN202010031039.5A CN202010031039A CN111241281A CN 111241281 A CN111241281 A CN 111241281A CN 202010031039 A CN202010031039 A CN 202010031039A CN 111241281 A CN111241281 A CN 111241281A
- Authority
- CN
- China
- Prior art keywords
- text
- similarity
- calculation
- words
- text similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a public opinion topic tracking method based on text similarity, which is based on a doc2vec model evolved by a word2vec model, can well obtain the expression of vectors of sentences, paragraphs or documents, and is very suitable for processing public opinion topics, but the model ignores the time characteristic of the public opinion topics. Compared with the prior art, the method has the advantages that the data dimension is relatively low on the expression of the vectors of sentences, paragraphs or documents, the time complexity is reduced, the expression of semantics is relatively more accurate, the text similarity calculation accuracy is improved, the timeliness of topics is ensured by adding the time characteristic on the basis of the existing model, and the method has a good effect on topic tracking through experimental tests.
Description
Technical Field
The invention belongs to the field of topic tracking in natural language processing, and particularly relates to research and innovation of a topic tracking method based on text similarity.
Background
Topic tracking refers to giving one or more stories of a certain topic and linking the related stories input into the topic. The steps can be divided into the following two steps according to the tracking requirement: firstly, a group of sample reports are given, a topic model is obtained through model training, then similar or same topic reports are found out in subsequent reports, and topic tracking (TopicTracking) can collect and organize scattered and variable topics, help users find relations among topics, and integrally know information of all aspects of public opinion topics and relations among topics. With the development and progress of related technologies, topic tracking research targets and processing objects have been not limited to media information streams, but have been increasingly widely applied to various fields related to information. The invention tracks the public sentiment topics in a text similarity calculation mode, and at present, in the aspect of text similarity, two mainstream text similarity calculation modes are provided, namely a character string mode and a corpus mode.
1 based on character strings
The mode based on the character strings is measured by taking the character string matching degree as the standard of similarity, and the mode can be divided into a character-based mode and a word-based mode according to the difference of calculation granularity; the similarity algorithm which is only considered from the angle of character or word composition at present has the modes of editing distance, Hamming distance, Dice coefficient, cosine similarity and the like to calculate text similarity, and the method of adding character sequence on the basis has the modes of Jaro-Winkler and the longest public string mode; based on the two modes, a set idea is adopted, namely the character strings are regarded as a set formed by words, the co-occurrence of the words is calculated by adopting the intersection of the sets, and the current main methods are methods such as N-gram and Jaccard.
2 based on corpus
Corpus-based methods calculate text similarity using information obtained from a corpus, and corpus-based methods can be further divided into: based on bag-of-words model, based on neural network model, and the two methods are to use the document set with similarity to be compared as corpus.
1) Bag of words based model
The bag-of-words model is based on the distribution hypothesis, that is, the context in which the words are located is similar, and the semantics are similar, and the basic idea of the bag-of-words model is to represent the document as a combination of a series of words without considering the sequence in which the words appear in the document. According to the difference of semantics, the method based on the bag-of-words Model mainly includes mainstream Model modes such as Vector Space Model (VSM), Probabilistic Latent Semantic Analysis (PLSA), Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA).
2) Neural network model
The method for calculating text similarity by generating Word vectors based on the neural network model is a popular field researched in the field in recent years, and a plurality of Word vector models such as Word2Vec and Glove are proposed in the process. The essence of the word vector is to train a low-dimensional real number vector from an unmarked unstructured text, and the expression mode enables similar words to be closer in distance, and simultaneously can better solve the problems of dimension disaster and insufficient semantics of a bag-of-words model caused by independent words.
3 doc2vec algorithm
The doc2vec model algorithm is evolved from google 2014 based on a word2vec model, is an unsupervised algorithm model, is essentially a representation of a document to be learned, can obtain vector expression of sentences/paragraphs/documents, is an expansion of word2vec, can find similarity between sentences/paragraphs/documents by calculating distance according to the learned vector, can be used for label-free text clustering, can also perform text classification on labeled data in a supervised learning mode, and is increased by a paramagraph id compared with the word2vec model in the training process, namely each sentence in a training corpus has a unique id, and the paramaph id is the same as an ordinary word, and is mapped into a vector, namely the paramaph vector and the word vector have the same dimension, but come from two different vector spaces. In the following calculations, paragraph vThe ector is added or concatenated with the word vector as input to the output layer softmax. In the training process of a sentence or a document, the paragraph id is kept unchanged, the same paragraph vector is shared, and the semantic meaning of the whole sentence is utilized every time the probability of a word is predicted. The doc2vec model framework is shown in fig. 1, whose tasks are to predict other words of a context given the context. Wherein each word is mapped into a vector space, context word vectors are cascaded or summed to serve as characteristics, a next word in a sentence is predicted, the target function is shown as formula (1.1), the predicted task is a multi-classification task, the last layer of a classifier uses softmax, and the calculation formula is shown as (1.2); and each word is regarded as a predicted task in the prediction task, each word is regarded as a category, and the calculation formula is shown as (1.3), wherein U and b are parameters, and h is wt-k,...,wt+kCascading or averaging.
y=b+Uh(ωt-k,...,wt+k;W) (1.3)
Disclosure of Invention
The invention provides a public opinion topic tracking method based on text similarity, which is based on providing a doc2vec model evolved based on a word2vec model in the 2014 of Google, wherein the model is an unsupervised algorithm, can well obtain the expression of vectors of sentences, paragraphs or documents and is very suitable for processing the public opinion topics, but the model ignores the time characteristic of the public opinion topics.
Step 1 data preprocessing
1) The text data is obtained by using a crawler technology, the data obtaining address is news of New wave and people's network, the obtained content is a crawled hot spot public opinion topic and news related to the topic news, and the main purpose of the crawled mode is to obtain high-quality public opinion topic linguistic data.
2) The Chinese word segmentation is a process of dividing a continuous word sequence into single words according to the understanding of Chinese, a jieba word segmentation tool is adopted to segment words of a text, the result after the word segmentation is finished is shown in FIG. 3, and a sentence is already divided into the single words.
3) In chinese, normal text or a sentence contains special characters such as comma, pause or sentence, which are retained in fig. 2 after completing word segmentation, and the special characters affect the speed and precision of the calculation when performing text similarity calculation, so that the characters need to be filtered out, except for the special characters, for example, the special characters have similar effects on the calculation of text similarity, and the words hardly affect the final calculation result, so the words are filtered out in the data preprocessing stage.
Step 2 text similarity calculation
Because the text data is the content captured from the internet, the length of the data after the step 1 is possibly very short, in order to reduce or eliminate the influence of the short text on the final calculation result of the similarity, two modes are adopted for performing text similarity calculation, namely, the text with the text length less than 150 adopts a sentence-level calculation mode, otherwise, a document-level calculation mode is adopted, and the time characteristic is added into the calculation in the calculation process, time comparison is firstly performed, if the time difference is greater than 30 days and the number of the news with the similarity less than 0.70 is less than 100, the similarity is considered to be low, if the time difference is greater than 30 days and the number of the news with the similarity greater than or equal to 0.70 is greater than 100, the similarity is considered to be high, and the corresponding text similarity is obtained through final weighting processing.
According to the vector expression mode from the step 2 to the corresponding text, in order to better show the calculation result, the invention uses the k-means algorithm to carry out image display on the text data, and the result is shown in the attached figure 7.
Compared with the prior art, the method has the advantages that the data dimension is relatively low on the expression of the vectors of sentences, paragraphs or documents, the time complexity is reduced, the expression of semantics is relatively more accurate, the text similarity calculation accuracy is improved, the timeliness of topics is ensured by adding the time characteristic on the basis of the existing model, and the method has a good effect on topic tracking through experimental tests.
Drawings
FIG. 1 is a diagram of the doc2vec model architecture of the present invention.
Fig. 2 is an overall flowchart of the public opinion topic tracking of text similarity according to the present invention.
Fig. 3 is a corpus diagram of public sentiment topics of the present invention.
FIG. 4 is a diagram of the results of the present invention after word segmentation is completed.
FIG. 5 is a diagram of the results of the invention after the stop word is completed.
Fig. 6 is a diagram showing the result of the completion of the calculation of the similarity of texts according to the present invention.
FIG. 7 is a final topic tracking result diagram of the present invention.
Detailed Description
The embodiment of the invention is described by combining the drawings of the specification, the public opinion topic tracking of text similarity is mainly divided into the following steps,
step 1, text acquisition
The text data is obtained by using a crawler technology, a data obtaining address is news of New wave and a people name network, the obtained content mainly comprises the crawling of the news of the public sentiment topic and the related news of the public sentiment topic, and the crawling mode is mainly used for obtaining high-quality linguistic data of the public sentiment topic.
Step 2, Chinese word segmentation
The Chinese word segmentation is a process of dividing a continuous word sequence into single words according to the understanding of Chinese, a jieba word segmentation tool is adopted to segment words of a text in the implementation process, the result after word segmentation is finished is shown in FIG. 4, and a sentence is already divided into single words.
In chinese, normal text or a sentence usually contains special characters such as comma, pause or period, which are retained in fig. 3 after completing word segmentation, and when performing text similarity calculation, they affect the speed and precision of calculation, so these characters need to be filtered out, except these special characters, for example, besides these special characters, these words also have similar effects on the calculation of text similarity, and these words hardly affect the final calculation result, so these words are filtered out in the data preprocessing stage.
Step 4, calculating text similarity
Because the text data is the content captured from the internet, the length of the data after the steps 1 and 2 is possibly very short, in order to reduce or eliminate the influence of the short text on the final calculation result of the similarity, two modes are adopted for performing text similarity calculation, namely, the text with the text length less than 150 adopts a sentence-level calculation mode, otherwise, a document-level calculation mode is adopted, and the time characteristic is added into the calculation in the calculation process, time comparison is firstly performed, if the time difference is greater than 30 days and the number of the news with the similarity less than 0.70 is less than 100, the similarity is considered to be low, if the time difference is greater than 30 days and the number of the news with the similarity greater than or equal to 0.70 is greater than 100, the similarity is considered to be high, and the corresponding text similarity is obtained through final weighting processing.
Step 5, topic tracking result
According to the vector expression mode of the corresponding text in the step 4, in order to better show the calculation result, the invention uses the k-means algorithm to carry out image display on the text data, and the result is shown in the attached figure 7.
Claims (1)
1. A public opinion topic tracking method based on text similarity is characterized in that: the method comprises the following steps of,
step 1 data preprocessing
1) The text data is obtained by crawling hot public sentiment topics and news related to the topics through a crawler technology, and high-quality public sentiment topic corpora are obtained;
2) the Chinese word segmentation is a process of dividing a continuous word sequence into single words according to the understanding of Chinese, a jieba word segmentation tool is adopted to segment words of a text, and sentences are already divided into the single words;
3) normal text or a sentence in Chinese contains comma, pause or sentence special characters, the special characters are reserved after word segmentation is finished, and the special characters influence the calculation speed and precision when text similarity calculation is carried out, so the characters need to be filtered, except the special characters, the words not only have influence on the calculation of the text similarity, but also do not influence the final calculation result, so the words are filtered in a data preprocessing stage;
step 2 text similarity calculation
Because the text data is the content captured from the internet, the length of the data after the step 1 is possibly very short, two modes are adopted for carrying out text similarity calculation, namely, the text with the text length less than 150 adopts a sentence-level calculation mode, otherwise, a document-level calculation mode is adopted, the time characteristic is added into the calculation in the calculation process, time comparison is firstly carried out, if the time difference is greater than 30 days and the number of the news with the similarity less than 0.70 is less than 100, the similarity is considered to be low, if the time difference is greater than 30 days and the number of the news with the similarity greater than or equal to 0.70 is greater than 100, the similarity is considered to be high, and the corresponding text similarity is obtained through final weighting processing;
step 3 topic tracking results
And (4) displaying the image of the text data by using a k-means algorithm according to the vector expression mode of the corresponding text from the step (2).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010031039.5A CN111241281A (en) | 2020-01-13 | 2020-01-13 | Text similarity-based public opinion topic tracking method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010031039.5A CN111241281A (en) | 2020-01-13 | 2020-01-13 | Text similarity-based public opinion topic tracking method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111241281A true CN111241281A (en) | 2020-06-05 |
Family
ID=70863967
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010031039.5A Pending CN111241281A (en) | 2020-01-13 | 2020-01-13 | Text similarity-based public opinion topic tracking method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111241281A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101980199A (en) * | 2010-10-28 | 2011-02-23 | 北京交通大学 | Method and system for discovering network hot topic based on situation assessment |
CN107894994A (en) * | 2017-10-18 | 2018-04-10 | 北京京东尚科信息技术有限公司 | A kind of method and apparatus for detecting much-talked-about topic classification |
CN108536781A (en) * | 2018-03-29 | 2018-09-14 | 武汉大学 | A kind of method for digging and system of social networks mood focus |
CN110134787A (en) * | 2019-05-15 | 2019-08-16 | 北京信息科技大学 | A kind of news topic detection method |
CN110516067A (en) * | 2019-08-23 | 2019-11-29 | 北京工商大学 | Public sentiment monitoring method, system and storage medium based on topic detection |
-
2020
- 2020-01-13 CN CN202010031039.5A patent/CN111241281A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101980199A (en) * | 2010-10-28 | 2011-02-23 | 北京交通大学 | Method and system for discovering network hot topic based on situation assessment |
CN107894994A (en) * | 2017-10-18 | 2018-04-10 | 北京京东尚科信息技术有限公司 | A kind of method and apparatus for detecting much-talked-about topic classification |
CN108536781A (en) * | 2018-03-29 | 2018-09-14 | 武汉大学 | A kind of method for digging and system of social networks mood focus |
CN110134787A (en) * | 2019-05-15 | 2019-08-16 | 北京信息科技大学 | A kind of news topic detection method |
CN110516067A (en) * | 2019-08-23 | 2019-11-29 | 北京工商大学 | Public sentiment monitoring method, system and storage medium based on topic detection |
Non-Patent Citations (1)
Title |
---|
李峰等: "Doc2vec 在政策文本分类中的应用研究", 《软件》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yue et al. | A survey of sentiment analysis in social media | |
Wang et al. | Application of convolutional neural network in natural language processing | |
Li et al. | Sentiment analysis of danmaku videos based on naïve bayes and sentiment dictionary | |
CN106997382B (en) | Innovative creative tag automatic labeling method and system based on big data | |
Yu et al. | An attention mechanism and multi-granularity-based Bi-LSTM model for Chinese Q&A system | |
He et al. | Applying deep matching networks to Chinese medical question answering: a study and a dataset | |
Chang et al. | Research on detection methods based on Doc2vec abnormal comments | |
Cai et al. | Intelligent question answering in restricted domains using deep learning and question pair matching | |
CN110598219A (en) | Emotion analysis method for broad-bean-net movie comment | |
Ji et al. | Survey of visual sentiment prediction for social media analysis | |
CN112000804B (en) | Microblog hot topic user group emotion tendentiousness analysis method | |
Chen et al. | Sentiment classification of tourism based on rules and LDA topic model | |
Nasim et al. | Cluster analysis of urdu tweets | |
Yu et al. | Question classification based on MAC-LSTM | |
CN113934835B (en) | Retrieval type reply dialogue method and system combining keywords and semantic understanding representation | |
Zhang et al. | Chinese-English mixed text normalization | |
Jiang et al. | A hierarchical bidirectional LSTM sequence model for extractive text summarization in electric power systems | |
Al-Sultany et al. | Enriching tweets for topic modeling via linking to the wikipedia | |
CN111241281A (en) | Text similarity-based public opinion topic tracking method | |
Li et al. | Emotion analysis for the upcoming response in open-domain human-computer conversation | |
Jirasirilerd et al. | Automatic labeling for Thai news articles based on vector representation of documents | |
Zhang et al. | Sentiment analysis on Chinese health forums: a preliminary study of different language models | |
Khodaei et al. | A Transfer-Based Deep Learning Model for Persian Emotion Classification | |
Jiang et al. | Transfer learning based recurrent neural network algorithm for linguistic analysis | |
Zhao et al. | Semantic computation in geography question answering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200605 |
|
RJ01 | Rejection of invention patent application after publication |