CN107193796A - A kind of public sentiment event detecting method and device - Google Patents

A kind of public sentiment event detecting method and device Download PDF

Info

Publication number
CN107193796A
CN107193796A CN201610197073.3A CN201610197073A CN107193796A CN 107193796 A CN107193796 A CN 107193796A CN 201610197073 A CN201610197073 A CN 201610197073A CN 107193796 A CN107193796 A CN 107193796A
Authority
CN
China
Prior art keywords
sensitive
text
feature
detected
feature word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610197073.3A
Other languages
Chinese (zh)
Other versions
CN107193796B (en
Inventor
蔡慧慧
刘克松
张丹
于晓明
杨建武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Peking University
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University
Publication of CN107193796A publication Critical patent/CN107193796A/en
Application granted granted Critical
Publication of CN107193796B publication Critical patent/CN107193796B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种舆情事件检测方法及装置,方法包括:获取待检测文本的特征词向量;获取所有特征词对应的向量,并获取敏感义项向量;计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度;获取相似度最大时对应的第一敏感义项,并获取待检测文本中第一敏感义项的数量和待检测文本中特征词的数量,根据第一预设权值和第二预设权值,计算第一敏感义项的数量和特征词的数量的加权和,当加权和大于阈值时确定待检测文本中描述的事件为舆情事件。本发明通过对待检测文本向量化,能够达到有效的语义约束;同时通过计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度,能够准确检测出需要进行关注的舆情事件的问题。

The invention discloses a public opinion event detection method and device. The method includes: obtaining the feature word vector of the text to be detected; obtaining the vectors corresponding to all the feature words, and obtaining the sensitive sense item vector; calculating the feature word vector and all the feature words of the text to be detected The similarity of the feature word vector corresponding to the word; obtain the first sensitive meaning item corresponding to the maximum similarity, and obtain the number of the first sensitive meaning item in the text to be detected and the number of feature words in the text to be detected, according to the first preset weight value and the second preset weight, calculate the weighted sum of the quantity of the first sensitive meaning item and the quantity of characteristic words, when the weighted sum is greater than the threshold, it is determined that the event described in the text to be detected is a public opinion event. The present invention can achieve effective semantic constraints through the vectorization of the text to be detected; at the same time, by calculating the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, it can accurately detect the public opinion events that need to be paid attention to question.

Description

一种舆情事件检测方法及装置A method and device for detecting public opinion events

技术领域technical field

本发明涉及计算机技术领域,具体涉及一种舆情事件检测方法及装置。The invention relates to the field of computer technology, in particular to a public opinion event detection method and device.

背景技术Background technique

随着互联网的迅猛发展,网络舆情正在成为普通百姓表达利益诉求,倡导社会公平公正,不间断地向我国各级政府传达民众共同心声的一块思想阵地。越来越多的人愿意把所想表达的观点和所看到的现象发布到网络上,通过网络的传播让更多的人参与进来,从而对网民情绪和社会稳定产生了重大影响。因此,利用现代科学技术,准确检测舆情事件具有十分重要的意义。With the rapid development of the Internet, Internet public opinion is becoming an ideological front for ordinary people to express their interests, advocate social fairness and justice, and continuously convey the common aspirations of the people to governments at all levels in our country. More and more people are willing to post their views and the phenomena they see on the Internet. Through the dissemination of the Internet, more people can participate, which has a major impact on the mood of Internet users and social stability. Therefore, using modern science and technology to accurately detect public opinion events is of great significance.

目前关于舆情事件的检测发现,还停留在利用一些舆情敏感词汇来进行语义匹配,又由于与舆情事件关联的命名实体词,如人名、外文人名译名和机构名简称,只有出现在相关联事件的语境中才体现舆情。而对于存在重名的命名实体,需要结合当前舆情事件背景分析其含义,对于该类具有歧义的特征词,传统静态语料库中可能未含有对其最新的解释性义项。这种传统的基于舆情特征词(敏感词、命名实体等)的过滤方法,因其实现机制简单、执行效率高,仍是一种重要的预处理手段;然而,面对互联网海量文本,尤其是碎片化、不规范的社会化媒体内容,该预处理过滤机制由于缺乏有效的语义约束,存在一定的假阳性,容易造成错判、漏判,无法准确识别需要进行关注的舆情事件。在大数据的网络舆情预警应用环境中给后续处理带来相当可观的噪音数据输入,因此亟需要具备语义理解能力的数据预处理机制。At present, the detection of public opinion events has found that some sensitive words of public opinion are still used for semantic matching, and because the named entity words associated with public opinion events, such as personal names, translations of foreign names, and abbreviations of organization names, only appear in the associated events. The public opinion is reflected in the context. For named entities with duplicate names, it is necessary to analyze their meanings based on the background of current public opinion events. For such ambiguous feature words, the traditional static corpus may not contain the latest explanatory meanings for them. This traditional filtering method based on public opinion feature words (sensitive words, named entities, etc.) is still an important preprocessing method because of its simple implementation mechanism and high execution efficiency; however, in the face of massive Internet texts, especially Fragmented and non-standard social media content, the pre-processing filtering mechanism lacks effective semantic constraints, and there are certain false positives, which is easy to cause misjudgment and missed judgment, and cannot accurately identify public opinion events that need attention. In the network public opinion early warning application environment of big data, it brings considerable noise data input to the subsequent processing, so there is an urgent need for a data preprocessing mechanism with semantic understanding capabilities.

发明内容Contents of the invention

由于传统的特征词过滤方法面对互联网海量文本,缺乏有效的语义约束,容易造成错判、漏判,无法准确检测出需要进行关注的舆情事件的问题,本发明提出一种舆情事件检测方法及装置。Due to the fact that the traditional feature word filtering method lacks effective semantic constraints in the face of massive Internet texts, it is easy to cause misjudgments and missed judgments, and cannot accurately detect public opinion events that need to be paid attention to. The present invention proposes a public opinion event detection method and device.

第一方面,本发明提出一种舆情事件检测方法,包括:In the first aspect, the present invention proposes a public opinion event detection method, including:

获取待检测文本的特征词向量,所述特征词向量的元素表示待检测文本中对应的特征词是否出现;Obtain the feature word vector of the text to be detected, the element of the feature word vector represents whether the corresponding feature word in the text to be detected appears;

从语义知识库中获取所有特征词对应的向量,并从敏感词库获取敏感义项向量,所述特征词对应的向量的元素包括当前特征词、当前特征词是否包含敏感义项、当前特征词的当前义项和当前特征词对应的特征词向量,所述敏感义项向量表示当前特征词对应的向量中的义项为当前敏感义项;Obtain the vectors corresponding to all feature words from the semantic knowledge base, and obtain the sensitive meaning item vector from the sensitive lexicon. The elements of the vector corresponding to the feature words include the current feature words, whether the current feature words contain sensitive meanings, the current The feature word vector corresponding to meaning item and current feature word, described sensitive meaning item vector represents the meaning item in the vector corresponding to current feature word as current sensitive meaning item;

计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度,其中,所述所有特征词对应的特征词向量包括所有敏感义项向量;Calculate the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words include all sensitive sense item vectors;

获取相似度最大时对应的第一敏感义项,并获取待检测文本中所述第一敏感义项的数量和待检测文本中特征词的数量,根据第一预设权值和第二预设权值,计算所述第一敏感义项的数量和所述特征词的数量的加权和,当所述加权和大于阈值时确定待检测文本中描述的事件为舆情事件。Obtain the first sensitive sense item corresponding to the maximum similarity, and obtain the number of the first sensitive sense item in the text to be detected and the number of feature words in the text to be detected, according to the first preset weight and the second preset weight , calculating the weighted sum of the quantity of the first sensitive meaning item and the quantity of the feature words, and determining that the event described in the text to be detected is a public opinion event when the weighted sum is greater than a threshold.

优选地,所述获取待检测文本的特征词向量之前包括:Preferably, before said obtaining the feature word vector of the text to be detected includes:

根据网页内容构建所述语义知识库。The semantic knowledge base is constructed according to the content of the webpage.

优选地,所述网页内容存储在xml格式文件中。Preferably, the webpage content is stored in an xml format file.

优选地,所述网页内容为维基百科。Preferably, the web page content is Wikipedia.

优选地,所述根据网页内容构建所述语义知识库之后包括:Preferably, after constructing the semantic knowledge base according to the content of the webpage, it includes:

根据所述语义知识库和预设特征词的敏感义项建立敏感词库。A sensitive thesaurus is established according to the semantic knowledge base and the sensitive meaning items of preset feature words.

第二方面,本发明还提出一种舆情事件检测装置,包括:In the second aspect, the present invention also proposes a public opinion event detection device, including:

特征词向量获取模块,用于获取待检测文本的特征词向量,所述特征词向量的元素表示待检测文本中对应的特征词是否出现;The feature word vector acquisition module is used to obtain the feature word vector of the text to be detected, and the element of the feature word vector represents whether the corresponding feature word in the text to be detected appears;

对应向量获取模块,用于从语义知识库中获取所有特征词对应的向量,并从敏感词库获取敏感义项向量,所述特征词对应的向量的元素包括当前特征词、当前特征词是否包含敏感义项、当前特征词的当前义项和当前特征词对应的特征词向量,所述敏感义项向量表示当前特征词对应的向量中的义项为当前敏感义项;The corresponding vector acquisition module is used to obtain the corresponding vectors of all feature words from the semantic knowledge base, and obtain the sensitive sense item vector from the sensitive lexicon, the elements of the vector corresponding to the feature words include the current feature words, whether the current feature words contain sensitive The current meaning item of meaning item, current feature word and the feature word vector corresponding to current feature word, described sensitive meaning item vector represents that the meaning item in the vector corresponding to current feature word is current sensitive meaning item;

相似度计算模块,用于计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度,其中,所述所有特征词对应的特征词向量包括所有敏感义项向量;A similarity calculation module, configured to calculate the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words include all sensitive sense item vectors;

事件检测模块,用于获取相似度最大时对应的第一敏感义项,并获取待检测文本中所述第一敏感义项的数量和待检测文本中特征词的数量;根据第一预设权值和第二预设权值,计算所述第一敏感义项的数量和所述特征词的数量的加权和,当所述加权和大于阈值时确定待检测文本中描述的事件为舆情事件。The event detection module is used to obtain the first sensitive sense item corresponding to the maximum similarity, and obtain the number of the first sensitive sense item in the text to be detected and the number of feature words in the text to be detected; according to the first preset weight and The second preset weight value is to calculate the weighted sum of the quantity of the first sensitive meaning item and the quantity of the characteristic words, and determine that the event described in the text to be detected is a public opinion event when the weighted sum is greater than a threshold.

优选地,还包括:Preferably, it also includes:

语义知识库构建模块,用于根据网页内容构建所述语义知识库。The semantic knowledge base construction module is used to construct the semantic knowledge base according to the content of the webpage.

优选地,所述网页内容存储在xml格式文件中。Preferably, the webpage content is stored in an xml format file.

优选地,所述网页内容为维基百科。Preferably, the web page content is Wikipedia.

优选地,还包括:Preferably, it also includes:

敏感词库建立模块,用于根据所述语义知识库和预设特征词的敏感义项建立敏感词库。The sensitive lexicon establishment module is used to establish a sensitive lexicon according to the semantic knowledge base and the sensitive meaning items of preset feature words.

由上述技术方案可知,本发明通过对待检测文本向量化,能够达到有效的语义约束;同时通过计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度,能够准确检测出需要进行关注的舆情事件的问题,大大降低错判和漏判的概率。It can be seen from the above technical solution that the present invention can achieve effective semantic constraints by vectorizing the text to be detected; at the same time, by calculating the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all feature words, it can accurately detect the required Concerned about public opinion events, greatly reducing the probability of misjudgments and missed judgments.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

图1为本发明一实施例提供的一种舆情事件检测方法的流程示意图;Fig. 1 is a schematic flow chart of a public opinion event detection method provided by an embodiment of the present invention;

图2为本发明一实施例提供的一种舆情事件检测方法的流程图;Fig. 2 is a flowchart of a public opinion event detection method provided by an embodiment of the present invention;

图3为本发明一实施例提供的一种舆情事件检测装置的结构示意图。Fig. 3 is a schematic structural diagram of a public opinion event detection device provided by an embodiment of the present invention.

具体实施方式detailed description

下面结合附图,对发明的具体实施方式作进一步描述。以下实施例仅用于更加清楚地说明本发明的技术方案,而不能以此来限制本发明的保护范围。The specific embodiments of the invention will be further described below in conjunction with the accompanying drawings. The following examples are only used to illustrate the technical solution of the present invention more clearly, but not to limit the protection scope of the present invention.

图1示出了本发明一实施例提供的一种舆情事件检测方法的流程示意图,包括:Fig. 1 shows a schematic flow chart of a public opinion event detection method provided by an embodiment of the present invention, including:

S101、获取待检测文本的特征词向量,所述特征词向量的元素表示待检测文本中对应的特征词是否出现;S101. Obtain a feature word vector of the text to be detected, where elements of the feature word vector indicate whether the corresponding feature word in the text to be detected appears;

S102、从语义知识库中获取所有特征词对应的向量,并从敏感词库获取敏感义项向量,所述特征词对应的向量的元素包括当前特征词、当前特征词是否包含敏感义项、当前特征词的当前义项和当前特征词对应的特征词向量,所述敏感义项向量表示当前特征词对应的向量中的义项为当前敏感义项;S102. Acquire vectors corresponding to all feature words from the semantic knowledge base, and acquire sensitive sense item vectors from the sensitive lexicon, the elements of the vectors corresponding to the feature words include the current feature word, whether the current feature word contains sensitive meaning items, the current feature word The current meaning item and the feature word vector corresponding to the current feature word, the sensitive meaning item vector represents that the meaning item in the vector corresponding to the current feature word is the current sensitive meaning item;

S103、计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度,其中,所述所有特征词对应的特征词向量包括所有敏感义项向量;S103. Calculate the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words include all sensitive sense item vectors;

S104、获取相似度最大时对应的第一敏感义项,并获取待检测文本中所述第一敏感义项的数量和待检测文本中特征词的数量,根据第一预设权值和第二预设权值,计算所述第一敏感义项的数量和所述特征词的数量的加权和,当所述加权和大于阈值时确定待检测文本中描述的事件为舆情事件。S104. Obtain the first sensitive sense item corresponding to the maximum similarity, and obtain the number of the first sensitive sense item in the text to be detected and the number of feature words in the text to be detected, according to the first preset weight and the second preset The weight value is to calculate the weighted sum of the quantity of the first sensitive meaning item and the quantity of the feature words, and determine that the event described in the text to be detected is a public opinion event when the weighted sum is greater than a threshold.

其中,当所述特征词向量的元素对应的特征词为敏感词时,可将对应元素设为0。Wherein, when the feature word corresponding to the element of the feature word vector is a sensitive word, the corresponding element can be set to 0.

本实施例通过对待检测文本向量化,能够达到有效的语义约束;同时通过计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度,能够准确检测出需要进行关注的舆情事件的问题,大大降低错判和漏判的概率。This embodiment can achieve effective semantic constraints by vectorizing the text to be detected; at the same time, by calculating the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all feature words, it is possible to accurately detect public opinion events that need to be paid attention to problem, greatly reducing the probability of misjudgment and missed judgment.

作为本实施例的可选方案,步骤S101之前包括:As an optional solution of this embodiment, before step S101 includes:

S100、根据网页内容构建所述语义知识库。S100. Construct the semantic knowledge base according to webpage content.

通过构建语义知识库,对舆情敏感词进行歧义标注,为分析检测舆情事件提供语义支撑,为待检测文本中的敏感词找到正确的含义提供依据。由于舆情特征词往往是对舆情的直接体现,但是舆情特征词在不同的语境却可以表示不同的含义,因此,该类具有歧义的舆情特征词往往给文本过滤预处理带来假阳性问题。因此,通过借助该语义知识库准确给出其描述可识别出其在具体语境中所表达的意思。By constructing a semantic knowledge base, ambiguity labeling is carried out on public opinion sensitive words, providing semantic support for analyzing and detecting public opinion events, and providing a basis for finding the correct meaning of sensitive words in the text to be detected. Since public opinion feature words are often a direct reflection of public opinion, but public opinion feature words can express different meanings in different contexts, therefore, such ambiguous public opinion feature words often bring false positives to text filtering preprocessing. Therefore, by using the semantic knowledge base to accurately give its description, the meaning expressed in the specific context can be identified.

其中,对于语义知识库中存储的特征词对应的向量,是通过对分词预处理后的文本利用深度学习工具word2vec进行训练得到的。对每个分词(即为待检测文本中的特征词),都可以用一定维数的向量将其有效表示。如下表所示Among them, the vectors corresponding to the feature words stored in the semantic knowledge base are obtained by training the preprocessed text using the deep learning tool word2vec. For each word segment (that is, a feature word in the text to be detected), it can be effectively represented by a vector of a certain dimension. as shown in the table below

具体地,所述网页内容存储在xml格式文件中。Specifically, the webpage content is stored in an xml format file.

举例来说,所述网页内容为维基百科。For example, the web page content is Wikipedia.

维基百科(Wikipedia)是规模最大的在线网络百科全书之一,采用群体在线合作编辑的Wiki机制,具有质量高、覆盖广、实时演化和半结构化等特点,是用来构建语义知识库的优质语料来源。特别针对维基百科中的歧义词,人工标注反映舆情特征的义项,为后续预警分析提供支持。以xml格式的维基百科语料作为输入,从中提取词的描述内容,分析是否为歧义词和重定向词、是否需要繁简转换,保留摘要介绍部分,同时对敏感特征词进行标注。Wikipedia (Wikipedia) is one of the largest online encyclopedias. It adopts the Wiki mechanism of group online cooperative editing. It has the characteristics of high quality, wide coverage, real-time evolution and semi-structured. It is used to build a high-quality semantic knowledge base. source of corpus. Especially for ambiguous words in Wikipedia, the meaning items reflecting the characteristics of public opinion are manually marked to provide support for subsequent early warning analysis. Take the Wikipedia corpus in xml format as input, extract the description content of words from it, analyze whether it is an ambiguous word or a redirection word, whether it needs to be converted from traditional to simplified, keep the abstract introduction part, and mark sensitive feature words at the same time.

借助维基百科强大的语义知识,可自动增加舆情敏感词,扩大舆情事件的表征范围,从而辅助用户更好地把握舆情动向,制定相关对策予以应对。With the help of Wikipedia's powerful semantic knowledge, it can automatically add sensitive words of public opinion and expand the representation range of public opinion events, thereby assisting users to better grasp public opinion trends and formulate relevant countermeasures to deal with them.

进一步地,步骤S100之后包括:Further, after step S100 includes:

S1001、根据所述语义知识库和预设特征词的敏感义项建立敏感词库。S1001. Establish a sensitive thesaurus according to the semantic knowledge base and sensitive meaning items of preset feature words.

其中,对待检测文本进行处理时,可以以分句为处理单位,对敏感词进行处理。具体处理时,将待检测文本分句的特征词向量中的特征词与语义知识库中特征词对应的向量相匹配,通过计算不同特征词的义项之间的相似度以及与待检测文本的相似度,相似度越高说明该义项越贴近其在文本中的真实含义,则选取该义项与敏感词相配,利用最优化方法获取目标函数最大值时各歧义词在文本中的准确含义。计算公式如下:Wherein, when the text to be detected is processed, the sensitive words may be processed by taking the sentence as the processing unit. In the specific processing, match the feature words in the feature word vector of the sentence to be detected with the vector corresponding to the feature words in the semantic knowledge base, and calculate the similarity between the meanings of different feature words and the similarity with the text to be detected The higher the similarity, the closer the meaning is to its true meaning in the text. Then select the meaning to match the sensitive words, and use the optimization method to obtain the exact meaning of each ambiguous word in the text when the maximum value of the objective function is obtained. Calculated as follows:

maxf(wi)maxf(w i )

f(wi)=f(wi+1)+Sim(wi,wi+1)+Sim(wi,doci)f(w i )=f(w i+1 )+Sim(w i ,w i+1 )+Sim(w i ,doc i )

s.t.s.t.

wi∈{v1,v2…,vm}w i ∈{v 1 ,v 2 …,v m }

doci=(w1,w2,…,wn),wi=0doc i =(w 1 ,w 2 ,...,w n ),w i =0

其中:wi表示待检测文本中的特征词,f(wi)表示词wi到句子结尾词的语义相似度值,doci是文本去除敏感词后的向量表示,即相应位置的元素置为0;v1,v2……是特征词对应的向量,若该词为非歧义词,则有一个向量表示,反之,有多个向量表示;Sim(wi,wi+1)是计算相邻敏感词相似度的函数,Sim(wi,doci)是计算敏感词与文本的相似度的函数。由于词与文本均用词向量来表示,相似度计算函数可采用余弦相似度计算方法。Among them: w i represents the feature word in the text to be detected, f(w i ) represents the semantic similarity value from word w i to the word at the end of the sentence, doc i is the vector representation of the text after removing sensitive words, that is, the element at the corresponding position is set is 0; v 1 , v 2 ... are vectors corresponding to feature words, if the word is a non-ambiguous word, there is one vector representation, otherwise, there are multiple vector representations; Sim( wi ,w i+1 ) is A function to calculate the similarity between adjacent sensitive words, Sim(w i , doc i ) is a function to calculate the similarity between sensitive words and text. Since both words and texts are represented by word vectors, the similarity calculation function can use the cosine similarity calculation method.

举例来说,根据待检测文本检测舆情事件时,如图2所示,可先对待检测文本进行分词和去停用词操作,其中,分词是指将待检测文本中的句子分成多个特征词,去停用词是指删去待检测文本中的停用词,如“同时”、“另外”等。For example, when detecting public opinion events based on the text to be detected, as shown in Figure 2, the text to be detected can be segmented and stop words removed first, where the word segmentation refers to dividing the sentence in the text to be detected into multiple feature words , removing stop words refers to deleting stop words in the text to be detected, such as "simultaneously" and "otherwise".

然后,利用word2vec从语义知识库和敏感词库中获取待检测文本中敏感义项的向量,便于后续针对待检测文本的句子中的相邻词进行相似度计算;Then, use word2vec to obtain the vector of the sensitive meaning item in the text to be detected from the semantic knowledge base and the sensitive thesaurus, so as to facilitate subsequent similarity calculations for adjacent words in the sentence of the text to be detected;

接着,利用每个特征词的敏感义项向量与其他特征词对应的向量及待检测文本的特征词向量进行相似度计算,取相似度最大值时各敏感义项的含义,从而获取与其他词及待检测文本都能合理搭配的敏感义项,确定该特征词在待检测文本中的具体含义;Then, use the sensitive sense item vector of each feature word and the vector corresponding to other feature words and the feature word vector of the text to be detected to perform similarity calculation, and take the meaning of each sensitive meaning item when the similarity is at the maximum value, so as to obtain the similarity with other words and the text to be detected. Sensitive meaning items that can be reasonably matched in the detection text, and determine the specific meaning of the feature word in the text to be detected;

最后,对文本中的命名实体及敏感义项进行权重求和,大于一定阈值则判定为需要预警的舆情事件。其中,命名实体是指待检测文本中特征词的数量。Finally, the weights of named entities and sensitive items in the text are summed, and if it is greater than a certain threshold, it is judged as a public opinion event that needs to be warned. Among them, the named entity refers to the number of feature words in the text to be detected.

本实施例利用特征词的不同义项和待检测文本中所有特征词的信息标注进行有监督学习的语义识别。能够避免仅仅依靠关键词匹配对舆情事件进行错误检测的弊端,从而准确识别舆情事件,对需要预警的舆情事件进行预警提示。In this embodiment, the semantic recognition of supervised learning is performed by using the different synonyms of the feature words and the information annotations of all the feature words in the text to be detected. It can avoid the disadvantages of false detection of public opinion events only relying on keyword matching, so as to accurately identify public opinion events and provide early warning prompts for public opinion events that require early warning.

图3示出了本发明一实施例提供的一种舆情事件检测装置的结构示意图,包括:Fig. 3 shows a schematic structural diagram of a public opinion event detection device provided by an embodiment of the present invention, including:

特征词向量获取模块31,用于获取待检测文本的特征词向量,所述特征词向量的元素表示待检测文本中对应的特征词是否出现;The feature word vector acquisition module 31 is used to obtain the feature word vector of the text to be detected, and the element of the feature word vector represents whether the corresponding feature word in the text to be detected occurs;

对应向量获取模块32,用于从语义知识库中获取所有特征词对应的向量,并从敏感词库获取敏感义项向量,所述特征词对应的向量的元素包括当前特征词、当前特征词是否包含敏感义项、当前特征词的当前义项和当前特征词对应的特征词向量,所述敏感义项向量表示当前特征词对应的向量中的义项为当前敏感义项;Corresponding vector obtaining module 32, is used for obtaining the corresponding vector of all feature words from semantic knowledge base, and obtains sensitive meaning item vector from sensitive lexicon, the element of the vector corresponding to described feature word comprises current feature word, whether current feature word contains Sensitive meaning item, the current meaning item of current characteristic word and the characteristic word vector corresponding to current characteristic word, described sensitive meaning item vector represents that the meaning item in the vector corresponding to current characteristic word is current sensitive meaning item;

相似度计算模块33,用于计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度,其中,所述所有特征词对应的特征词向量包括所有敏感义项向量;The similarity calculation module 33 is used to calculate the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words include all sensitive sense item vectors;

事件检测模块34,用于获取相似度最大时对应的第一敏感义项,并获取待检测文本中所述第一敏感义项的数量和待检测文本中特征词的数量;根据第一预设权值和第二预设权值,计算所述第一敏感义项的数量和所述特征词的数量的加权和,当所述加权和大于阈值时确定待检测文本中描述的事件为舆情事件。The event detection module 34 is used to obtain the corresponding first sensitive meaning item when the similarity is the largest, and obtain the quantity of the first sensitive meaning item in the text to be detected and the number of characteristic words in the text to be detected; according to the first preset weight and a second preset weight value, calculate the weighted sum of the quantity of the first sensitive meaning item and the quantity of the feature words, and determine that the event described in the text to be detected is a public opinion event when the weighted sum is greater than a threshold.

本实施例通过对待检测文本向量化,能够达到有效的语义约束;同时通过计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度,能够准确检测出需要进行关注的舆情事件的问题,大大降低错判和漏判的概率。This embodiment can achieve effective semantic constraints by vectorizing the text to be detected; at the same time, by calculating the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all feature words, it is possible to accurately detect public opinion events that need to be paid attention to problem, greatly reducing the probability of misjudgment and missed judgment.

作为本实施例的可选方案,还包括:As an optional solution of this embodiment, it also includes:

语义知识库构建模块,用于根据网页内容构建所述语义知识库。The semantic knowledge base construction module is used to construct the semantic knowledge base according to the content of the webpage.

具体地,所述网页内容存储在xml格式文件中。Specifically, the webpage content is stored in an xml format file.

举例来说,所述网页内容为维基百科。For example, the web page content is Wikipedia.

进一步地,还包括:Further, it also includes:

敏感词库建立模块,用于根据所述语义知识库和预设特征词的敏感义项建立敏感词库。The sensitive lexicon establishment module is used to establish a sensitive lexicon according to the semantic knowledge base and the sensitive meaning items of preset feature words.

本发明的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description of the invention, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.

Claims (10)

1.一种舆情事件检测方法,其特征在于,包括:1. A public opinion event detection method, characterized in that, comprising: 获取待检测文本的特征词向量,所述特征词向量的元素表示待检测文本中对应的特征词是否出现;Obtain the feature word vector of the text to be detected, the element of the feature word vector represents whether the corresponding feature word in the text to be detected appears; 从语义知识库中获取所有特征词对应的向量,并从敏感词库获取敏感义项向量,所述特征词对应的向量的元素包括当前特征词、当前特征词是否包含敏感义项、当前特征词的当前义项和当前特征词对应的特征词向量,所述敏感义项向量表示当前特征词对应的向量中的义项为当前敏感义项;Obtain the vectors corresponding to all feature words from the semantic knowledge base, and obtain the sensitive meaning item vector from the sensitive lexicon. The elements of the vector corresponding to the feature words include the current feature words, whether the current feature words contain sensitive meanings, the current The feature word vector corresponding to meaning item and current feature word, described sensitive meaning item vector represents the meaning item in the vector corresponding to current feature word as current sensitive meaning item; 计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度,其中,所述所有特征词对应的特征词向量包括所有敏感义项向量;Calculate the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words include all sensitive sense item vectors; 获取相似度最大时对应的第一敏感义项,并获取待检测文本中所述第一敏感义项的数量和待检测文本中特征词的数量,根据第一预设权值和第二预设权值,计算所述第一敏感义项的数量和所述特征词的数量的加权和,当所述加权和大于阈值时确定待检测文本中描述的事件为舆情事件。Obtain the first sensitive sense item corresponding to the maximum similarity, and obtain the number of the first sensitive sense item in the text to be detected and the number of feature words in the text to be detected, according to the first preset weight and the second preset weight , calculating the weighted sum of the quantity of the first sensitive meaning item and the quantity of the feature words, and determining that the event described in the text to be detected is a public opinion event when the weighted sum is greater than a threshold. 2.根据权利要求1所述的方法,其特征在于,所述获取待检测文本的特征词向量之前包括:2. method according to claim 1, is characterized in that, before the feature word vector of described acquisition to-be-detected text comprises: 根据网页内容构建所述语义知识库。The semantic knowledge base is constructed according to the content of the webpage. 3.根据权利要求2所述的方法,其特征在于,所述网页内容存储在xml格式文件中。3. The method according to claim 2, wherein the webpage content is stored in an xml format file. 4.根据权利要求3所述的方法,其特征在于,所述网页内容为维基百科。4. The method according to claim 3, wherein the web page content is Wikipedia. 5.根据权利要求4所述的方法,其特征在于,所述根据网页内容构建所述语义知识库之后包括:5. The method according to claim 4, characterized in that, after said constructing said semantic knowledge base according to webpage content, comprising: 根据所述语义知识库和预设特征词的敏感义项建立敏感词库。A sensitive thesaurus is established according to the semantic knowledge base and the sensitive meaning items of preset feature words. 6.一种舆情事件检测装置,其特征在于,包括:6. A public opinion event detection device, characterized in that, comprising: 特征词向量获取模块,用于获取待检测文本的特征词向量,所述特征词向量的元素表示待检测文本中对应的特征词是否出现;The feature word vector acquisition module is used to obtain the feature word vector of the text to be detected, and the element of the feature word vector represents whether the corresponding feature word in the text to be detected appears; 对应向量获取模块,用于从语义知识库中获取所有特征词对应的向量,并从敏感词库获取敏感义项向量,所述特征词对应的向量的元素包括当前特征词、当前特征词是否包含敏感义项、当前特征词的当前义项和当前特征词对应的特征词向量,所述敏感义项向量表示当前特征词对应的向量中的义项为当前敏感义项;The corresponding vector acquisition module is used to obtain the corresponding vectors of all feature words from the semantic knowledge base, and obtain the sensitive sense item vector from the sensitive lexicon, the elements of the vector corresponding to the feature words include the current feature words, whether the current feature words contain sensitive The current meaning item of meaning item, current feature word and the feature word vector corresponding to current feature word, described sensitive meaning item vector represents that the meaning item in the vector corresponding to current feature word is current sensitive meaning item; 相似度计算模块,用于计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度,其中,所述所有特征词对应的特征词向量包括所有敏感义项向量;A similarity calculation module, configured to calculate the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words include all sensitive sense item vectors; 事件检测模块,用于获取相似度最大时对应的第一敏感义项,并获取待检测文本中所述第一敏感义项的数量和待检测文本中特征词的数量;根据第一预设权值和第二预设权值,计算所述第一敏感义项的数量和所述特征词的数量的加权和,当所述加权和大于阈值时确定待检测文本中描述的事件为舆情事件。The event detection module is used to obtain the first sensitive sense item corresponding to the maximum similarity, and obtain the number of the first sensitive sense item in the text to be detected and the number of feature words in the text to be detected; according to the first preset weight and The second preset weight value is to calculate the weighted sum of the quantity of the first sensitive meaning item and the quantity of the characteristic words, and determine that the event described in the text to be detected is a public opinion event when the weighted sum is greater than a threshold. 7.根据权利要求6所述的装置,其特征在于,还包括:7. The device according to claim 6, further comprising: 语义知识库构建模块,用于根据网页内容构建所述语义知识库。The semantic knowledge base construction module is used to construct the semantic knowledge base according to the content of the webpage. 8.根据权利要求7所述的装置,其特征在于,所述网页内容存储在xml格式文件中。8. The device according to claim 7, wherein the webpage content is stored in an xml format file. 9.根据权利要求8所述的装置,其特征在于,所述网页内容为维基百科。9. The device according to claim 8, wherein the web page content is Wikipedia. 10.根据权利要求9所述的装置,其特征在于,还包括:10. The device according to claim 9, further comprising: 敏感词库建立模块,用于根据所述语义知识库和预设特征词的敏感义项建立敏感词库。The sensitive lexicon establishment module is used to establish a sensitive lexicon according to the semantic knowledge base and the sensitive meaning items of preset feature words.
CN201610197073.3A 2016-03-14 2016-03-31 Public opinion event detection method and device Active CN107193796B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610144761 2016-03-14
CN2016101447613 2016-03-14

Publications (2)

Publication Number Publication Date
CN107193796A true CN107193796A (en) 2017-09-22
CN107193796B CN107193796B (en) 2021-12-24

Family

ID=59870838

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610197073.3A Active CN107193796B (en) 2016-03-14 2016-03-31 Public opinion event detection method and device

Country Status (1)

Country Link
CN (1) CN107193796B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992471A (en) * 2017-11-10 2018-05-04 北京光年无限科技有限公司 Information filtering method and device in a kind of interactive process
CN108647335A (en) * 2018-05-12 2018-10-12 苏州华必讯信息科技有限公司 Internet public opinion analysis method and apparatus
CN109214407A (en) * 2018-07-06 2019-01-15 阿里巴巴集团控股有限公司 Event detection model, calculates equipment and storage medium at method, apparatus
CN109344258A (en) * 2018-11-28 2019-02-15 中国电子科技网络信息安全有限公司 A kind of intelligent self-adaptive sensitive data identifying system and method
CN109472018A (en) * 2018-09-26 2019-03-15 深圳壹账通智能科技有限公司 Enterprise's public sentiment monitoring method, device, computer equipment and storage medium
CN110516166A (en) * 2019-08-30 2019-11-29 北京明略软件系统有限公司 Public sentiment event-handling method, device, processing equipment and storage medium
CN110674251A (en) * 2019-08-21 2020-01-10 杭州电子科技大学 Computer-assisted secret point annotation method based on semantic information
CN110727880A (en) * 2019-10-18 2020-01-24 西安电子科技大学 Sensitive corpus detection method based on word bank and word vector model
CN110807319A (en) * 2019-10-31 2020-02-18 北京奇艺世纪科技有限公司 Text content detection method and device, electronic equipment and storage medium
CN113505221A (en) * 2020-03-24 2021-10-15 国家计算机网络与信息安全管理中心 Enterprise false propaganda risk identification method, device and storage medium
CN114091441A (en) * 2021-09-28 2022-02-25 马上消费金融股份有限公司 Text detection method and device and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102096680A (en) * 2009-12-15 2011-06-15 北京大学 Method and device for analyzing information validity
CN103605692A (en) * 2013-11-04 2014-02-26 北京奇虎科技有限公司 Device and method used for shielding advertisement contents in ask-and-answer community
CN103605691A (en) * 2013-11-04 2014-02-26 北京奇虎科技有限公司 Device and method used for processing issued contents in social network
CN104820629A (en) * 2015-05-14 2015-08-05 中国电子科技集团公司第五十四研究所 Intelligent system and method for emergently processing public sentiment emergency
CN104899230A (en) * 2014-03-07 2015-09-09 上海市玻森数据科技有限公司 Public opinion hotspot automatic monitoring system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102096680A (en) * 2009-12-15 2011-06-15 北京大学 Method and device for analyzing information validity
CN103605692A (en) * 2013-11-04 2014-02-26 北京奇虎科技有限公司 Device and method used for shielding advertisement contents in ask-and-answer community
CN103605691A (en) * 2013-11-04 2014-02-26 北京奇虎科技有限公司 Device and method used for processing issued contents in social network
CN104899230A (en) * 2014-03-07 2015-09-09 上海市玻森数据科技有限公司 Public opinion hotspot automatic monitoring system
CN104820629A (en) * 2015-05-14 2015-08-05 中国电子科技集团公司第五十四研究所 Intelligent system and method for emergently processing public sentiment emergency

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HASSAN SAYYADI ET AL.: "A Graph Analytical Approach for Topic Detection", 《ACM TRANSACTIONS ON INTERNET TECHNOLOGY》 *
曹坚峰: "面向公共危机预警的网络舆情分析研究", 《中国博士学位论文全文数据库-信息科技辑》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992471A (en) * 2017-11-10 2018-05-04 北京光年无限科技有限公司 Information filtering method and device in a kind of interactive process
CN108647335A (en) * 2018-05-12 2018-10-12 苏州华必讯信息科技有限公司 Internet public opinion analysis method and apparatus
CN109214407A (en) * 2018-07-06 2019-01-15 阿里巴巴集团控股有限公司 Event detection model, calculates equipment and storage medium at method, apparatus
CN109214407B (en) * 2018-07-06 2022-04-19 创新先进技术有限公司 Event detection model, method and device, computing equipment and storage medium
CN109472018A (en) * 2018-09-26 2019-03-15 深圳壹账通智能科技有限公司 Enterprise's public sentiment monitoring method, device, computer equipment and storage medium
CN109344258B (en) * 2018-11-28 2021-11-12 中国电子科技网络信息安全有限公司 Intelligent self-adaptive sensitive data identification system and method
CN109344258A (en) * 2018-11-28 2019-02-15 中国电子科技网络信息安全有限公司 A kind of intelligent self-adaptive sensitive data identifying system and method
CN110674251A (en) * 2019-08-21 2020-01-10 杭州电子科技大学 Computer-assisted secret point annotation method based on semantic information
CN110516166A (en) * 2019-08-30 2019-11-29 北京明略软件系统有限公司 Public sentiment event-handling method, device, processing equipment and storage medium
CN110516166B (en) * 2019-08-30 2022-10-25 北京明略软件系统有限公司 Public opinion event processing method, device, processing equipment and storage medium
CN110727880A (en) * 2019-10-18 2020-01-24 西安电子科技大学 Sensitive corpus detection method based on word bank and word vector model
CN110727880B (en) * 2019-10-18 2022-06-17 西安电子科技大学 Sensitive corpus detection method based on word bank and word vector model
CN110807319A (en) * 2019-10-31 2020-02-18 北京奇艺世纪科技有限公司 Text content detection method and device, electronic equipment and storage medium
CN110807319B (en) * 2019-10-31 2023-07-25 北京奇艺世纪科技有限公司 Text content detection method, detection device, electronic equipment and storage medium
CN113505221A (en) * 2020-03-24 2021-10-15 国家计算机网络与信息安全管理中心 Enterprise false propaganda risk identification method, device and storage medium
CN113505221B (en) * 2020-03-24 2024-03-12 国家计算机网络与信息安全管理中心 Enterprise false propaganda risk identification method, equipment and storage medium
CN114091441A (en) * 2021-09-28 2022-02-25 马上消费金融股份有限公司 Text detection method and device and computer readable storage medium

Also Published As

Publication number Publication date
CN107193796B (en) 2021-12-24

Similar Documents

Publication Publication Date Title
CN107193796A (en) A kind of public sentiment event detecting method and device
CN110516067B (en) Public opinion monitoring method, system and storage medium based on topic detection
JP5936698B2 (en) Word semantic relation extraction device
US8380489B1 (en) System, methods, and data structure for quantitative assessment of symbolic associations in natural language
CN110674317B (en) A method and device for entity linking based on graph neural network
US9792277B2 (en) System and method for determining the meaning of a document with respect to a concept
WO2017118427A1 (en) Webpage training method and device, and search intention identification method and device
RU2704531C1 (en) Method and apparatus for analyzing semantic information
CN107239439A (en) Public sentiment sentiment classification method based on word2vec
Wang et al. A Neural Model for Joint Event Detection and Summarization.
US9632998B2 (en) Claim polarity identification
CN105760363B (en) Word sense disambiguation method and device for text files
CN113986864A (en) Log data processing method, device, electronic device and storage medium
CN105989047A (en) Acquisition device, acquisition method, training device and detection device
CN111859979A (en) Sarcastic text collaborative recognition method, apparatus, device, and computer-readable medium
US20220366135A1 (en) Extended open information extraction system
Gautam et al. Hindi word sense disambiguation using lesk approach on bigram and trigram words
Qiu et al. Detecting geo-relation phrases from web texts for triplet extraction of geographic knowledge: A context-enhanced method
AlShammari et al. Aspect-based Sentiment Analysis and Location Detection for Arabic Language Tweets.
CN113988085B (en) Text semantic similarity matching method and device, electronic equipment and storage medium
Lee et al. QA-It: classifying non-referential it for question answer pairs
Arfat et al. Bangla Misleading Clickbait Detection Using Ensemble Learning Approach
Li et al. Attention-based LSTM-CNNs for uncertainty identification on Chinese social media texts
KR102625347B1 (en) A method for extracting food menu nouns using parts of speech such as verbs and adjectives, a method for updating a food dictionary using the same, and a system for the same
Mamatha et al. Supervised aspect category detection of co-occurrence data using conditional random fields

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230619

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: Peking University

Patentee after: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

Address before: 100871, fangzheng building, 298 Fu Cheng Road, Beijing, Haidian District

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: Peking University

Patentee before: BEIJING FOUNDER ELECTRONICS Co.,Ltd.