CN107193796A

CN107193796A - A kind of public sentiment event detecting method and device

Info

Publication number: CN107193796A
Application number: CN201610197073.3A
Authority: CN
Inventors: 蔡慧慧; 刘克松; 张丹; 于晓明; 杨建武
Original assignee: Peking University; Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Peking University; Beijing Founder Electronics Co Ltd
Priority date: 2016-03-14
Filing date: 2016-03-31
Publication date: 2017-09-22
Anticipated expiration: 2036-03-31
Also published as: CN107193796B

Abstract

The invention discloses a public opinion event detection method and device. The method includes: obtaining the feature word vector of the text to be detected; obtaining the vectors corresponding to all the feature words, and obtaining the sensitive sense item vector; calculating the feature word vector and all the feature words of the text to be detected The similarity of the feature word vector corresponding to the word; obtain the first sensitive meaning item corresponding to the maximum similarity, and obtain the number of the first sensitive meaning item in the text to be detected and the number of feature words in the text to be detected, according to the first preset weight value and the second preset weight, calculate the weighted sum of the quantity of the first sensitive meaning item and the quantity of characteristic words, when the weighted sum is greater than the threshold, it is determined that the event described in the text to be detected is a public opinion event. The present invention can achieve effective semantic constraints through the vectorization of the text to be detected; at the same time, by calculating the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, it can accurately detect the public opinion events that need to be paid attention to question.

Description

A method and device for detecting public opinion events

技术领域technical field

本发明涉及计算机技术领域，具体涉及一种舆情事件检测方法及装置。The invention relates to the field of computer technology, in particular to a public opinion event detection method and device.

背景技术Background technique

随着互联网的迅猛发展，网络舆情正在成为普通百姓表达利益诉求，倡导社会公平公正，不间断地向我国各级政府传达民众共同心声的一块思想阵地。越来越多的人愿意把所想表达的观点和所看到的现象发布到网络上，通过网络的传播让更多的人参与进来，从而对网民情绪和社会稳定产生了重大影响。因此，利用现代科学技术，准确检测舆情事件具有十分重要的意义。With the rapid development of the Internet, Internet public opinion is becoming an ideological front for ordinary people to express their interests, advocate social fairness and justice, and continuously convey the common aspirations of the people to governments at all levels in our country. More and more people are willing to post their views and the phenomena they see on the Internet. Through the dissemination of the Internet, more people can participate, which has a major impact on the mood of Internet users and social stability. Therefore, using modern science and technology to accurately detect public opinion events is of great significance.

目前关于舆情事件的检测发现，还停留在利用一些舆情敏感词汇来进行语义匹配，又由于与舆情事件关联的命名实体词，如人名、外文人名译名和机构名简称，只有出现在相关联事件的语境中才体现舆情。而对于存在重名的命名实体，需要结合当前舆情事件背景分析其含义，对于该类具有歧义的特征词，传统静态语料库中可能未含有对其最新的解释性义项。这种传统的基于舆情特征词(敏感词、命名实体等)的过滤方法，因其实现机制简单、执行效率高，仍是一种重要的预处理手段；然而，面对互联网海量文本，尤其是碎片化、不规范的社会化媒体内容，该预处理过滤机制由于缺乏有效的语义约束，存在一定的假阳性，容易造成错判、漏判，无法准确识别需要进行关注的舆情事件。在大数据的网络舆情预警应用环境中给后续处理带来相当可观的噪音数据输入，因此亟需要具备语义理解能力的数据预处理机制。At present, the detection of public opinion events has found that some sensitive words of public opinion are still used for semantic matching, and because the named entity words associated with public opinion events, such as personal names, translations of foreign names, and abbreviations of organization names, only appear in the associated events. The public opinion is reflected in the context. For named entities with duplicate names, it is necessary to analyze their meanings based on the background of current public opinion events. For such ambiguous feature words, the traditional static corpus may not contain the latest explanatory meanings for them. This traditional filtering method based on public opinion feature words (sensitive words, named entities, etc.) is still an important preprocessing method because of its simple implementation mechanism and high execution efficiency; however, in the face of massive Internet texts, especially Fragmented and non-standard social media content, the pre-processing filtering mechanism lacks effective semantic constraints, and there are certain false positives, which is easy to cause misjudgment and missed judgment, and cannot accurately identify public opinion events that need attention. In the network public opinion early warning application environment of big data, it brings considerable noise data input to the subsequent processing, so there is an urgent need for a data preprocessing mechanism with semantic understanding capabilities.

发明内容Contents of the invention

由于传统的特征词过滤方法面对互联网海量文本，缺乏有效的语义约束，容易造成错判、漏判，无法准确检测出需要进行关注的舆情事件的问题，本发明提出一种舆情事件检测方法及装置。Due to the fact that the traditional feature word filtering method lacks effective semantic constraints in the face of massive Internet texts, it is easy to cause misjudgments and missed judgments, and cannot accurately detect public opinion events that need to be paid attention to. The present invention proposes a public opinion event detection method and device.

第一方面，本发明提出一种舆情事件检测方法，包括：In the first aspect, the present invention proposes a public opinion event detection method, including:

获取待检测文本的特征词向量，所述特征词向量的元素表示待检测文本中对应的特征词是否出现；Obtain the feature word vector of the text to be detected, the element of the feature word vector represents whether the corresponding feature word in the text to be detected appears;

从语义知识库中获取所有特征词对应的向量，并从敏感词库获取敏感义项向量，所述特征词对应的向量的元素包括当前特征词、当前特征词是否包含敏感义项、当前特征词的当前义项和当前特征词对应的特征词向量，所述敏感义项向量表示当前特征词对应的向量中的义项为当前敏感义项；Obtain the vectors corresponding to all feature words from the semantic knowledge base, and obtain the sensitive meaning item vector from the sensitive lexicon. The elements of the vector corresponding to the feature words include the current feature words, whether the current feature words contain sensitive meanings, the current The feature word vector corresponding to meaning item and current feature word, described sensitive meaning item vector represents the meaning item in the vector corresponding to current feature word as current sensitive meaning item;

计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度，其中，所述所有特征词对应的特征词向量包括所有敏感义项向量；Calculate the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words include all sensitive sense item vectors;

获取相似度最大时对应的第一敏感义项，并获取待检测文本中所述第一敏感义项的数量和待检测文本中特征词的数量，根据第一预设权值和第二预设权值，计算所述第一敏感义项的数量和所述特征词的数量的加权和，当所述加权和大于阈值时确定待检测文本中描述的事件为舆情事件。Obtain the first sensitive sense item corresponding to the maximum similarity, and obtain the number of the first sensitive sense item in the text to be detected and the number of feature words in the text to be detected, according to the first preset weight and the second preset weight , calculating the weighted sum of the quantity of the first sensitive meaning item and the quantity of the feature words, and determining that the event described in the text to be detected is a public opinion event when the weighted sum is greater than a threshold.

优选地，所述获取待检测文本的特征词向量之前包括：Preferably, before said obtaining the feature word vector of the text to be detected includes:

根据网页内容构建所述语义知识库。The semantic knowledge base is constructed according to the content of the webpage.

优选地，所述网页内容存储在xml格式文件中。Preferably, the webpage content is stored in an xml format file.

优选地，所述网页内容为维基百科。Preferably, the web page content is Wikipedia.

优选地，所述根据网页内容构建所述语义知识库之后包括：Preferably, after constructing the semantic knowledge base according to the content of the webpage, it includes:

根据所述语义知识库和预设特征词的敏感义项建立敏感词库。A sensitive thesaurus is established according to the semantic knowledge base and the sensitive meaning items of preset feature words.

第二方面，本发明还提出一种舆情事件检测装置，包括：In the second aspect, the present invention also proposes a public opinion event detection device, including:

特征词向量获取模块，用于获取待检测文本的特征词向量，所述特征词向量的元素表示待检测文本中对应的特征词是否出现；The feature word vector acquisition module is used to obtain the feature word vector of the text to be detected, and the element of the feature word vector represents whether the corresponding feature word in the text to be detected appears;

对应向量获取模块，用于从语义知识库中获取所有特征词对应的向量，并从敏感词库获取敏感义项向量，所述特征词对应的向量的元素包括当前特征词、当前特征词是否包含敏感义项、当前特征词的当前义项和当前特征词对应的特征词向量，所述敏感义项向量表示当前特征词对应的向量中的义项为当前敏感义项；The corresponding vector acquisition module is used to obtain the corresponding vectors of all feature words from the semantic knowledge base, and obtain the sensitive sense item vector from the sensitive lexicon, the elements of the vector corresponding to the feature words include the current feature words, whether the current feature words contain sensitive The current meaning item of meaning item, current feature word and the feature word vector corresponding to current feature word, described sensitive meaning item vector represents that the meaning item in the vector corresponding to current feature word is current sensitive meaning item;

相似度计算模块，用于计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度，其中，所述所有特征词对应的特征词向量包括所有敏感义项向量；A similarity calculation module, configured to calculate the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words include all sensitive sense item vectors;

事件检测模块，用于获取相似度最大时对应的第一敏感义项，并获取待检测文本中所述第一敏感义项的数量和待检测文本中特征词的数量；根据第一预设权值和第二预设权值，计算所述第一敏感义项的数量和所述特征词的数量的加权和，当所述加权和大于阈值时确定待检测文本中描述的事件为舆情事件。The event detection module is used to obtain the first sensitive sense item corresponding to the maximum similarity, and obtain the number of the first sensitive sense item in the text to be detected and the number of feature words in the text to be detected; according to the first preset weight and The second preset weight value is to calculate the weighted sum of the quantity of the first sensitive meaning item and the quantity of the characteristic words, and determine that the event described in the text to be detected is a public opinion event when the weighted sum is greater than a threshold.

优选地，还包括：Preferably, it also includes:

语义知识库构建模块，用于根据网页内容构建所述语义知识库。The semantic knowledge base construction module is used to construct the semantic knowledge base according to the content of the webpage.

优选地，还包括：Preferably, it also includes:

敏感词库建立模块，用于根据所述语义知识库和预设特征词的敏感义项建立敏感词库。The sensitive lexicon establishment module is used to establish a sensitive lexicon according to the semantic knowledge base and the sensitive meaning items of preset feature words.

由上述技术方案可知，本发明通过对待检测文本向量化，能够达到有效的语义约束；同时通过计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度，能够准确检测出需要进行关注的舆情事件的问题，大大降低错判和漏判的概率。It can be seen from the above technical solution that the present invention can achieve effective semantic constraints by vectorizing the text to be detected; at the same time, by calculating the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all feature words, it can accurately detect the required Concerned about public opinion events, greatly reducing the probability of misjudgments and missed judgments.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

图1为本发明一实施例提供的一种舆情事件检测方法的流程示意图；Fig. 1 is a schematic flow chart of a public opinion event detection method provided by an embodiment of the present invention;

图2为本发明一实施例提供的一种舆情事件检测方法的流程图；Fig. 2 is a flowchart of a public opinion event detection method provided by an embodiment of the present invention;

图3为本发明一实施例提供的一种舆情事件检测装置的结构示意图。Fig. 3 is a schematic structural diagram of a public opinion event detection device provided by an embodiment of the present invention.

具体实施方式detailed description

下面结合附图，对发明的具体实施方式作进一步描述。以下实施例仅用于更加清楚地说明本发明的技术方案，而不能以此来限制本发明的保护范围。The specific embodiments of the invention will be further described below in conjunction with the accompanying drawings. The following examples are only used to illustrate the technical solution of the present invention more clearly, but not to limit the protection scope of the present invention.

图1示出了本发明一实施例提供的一种舆情事件检测方法的流程示意图，包括：Fig. 1 shows a schematic flow chart of a public opinion event detection method provided by an embodiment of the present invention, including:

S101、获取待检测文本的特征词向量，所述特征词向量的元素表示待检测文本中对应的特征词是否出现；S101. Obtain a feature word vector of the text to be detected, where elements of the feature word vector indicate whether the corresponding feature word in the text to be detected appears;

S102、从语义知识库中获取所有特征词对应的向量，并从敏感词库获取敏感义项向量，所述特征词对应的向量的元素包括当前特征词、当前特征词是否包含敏感义项、当前特征词的当前义项和当前特征词对应的特征词向量，所述敏感义项向量表示当前特征词对应的向量中的义项为当前敏感义项；S102. Acquire vectors corresponding to all feature words from the semantic knowledge base, and acquire sensitive sense item vectors from the sensitive lexicon, the elements of the vectors corresponding to the feature words include the current feature word, whether the current feature word contains sensitive meaning items, the current feature word The current meaning item and the feature word vector corresponding to the current feature word, the sensitive meaning item vector represents that the meaning item in the vector corresponding to the current feature word is the current sensitive meaning item;

S103、计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度，其中，所述所有特征词对应的特征词向量包括所有敏感义项向量；S103. Calculate the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words include all sensitive sense item vectors;

S104、获取相似度最大时对应的第一敏感义项，并获取待检测文本中所述第一敏感义项的数量和待检测文本中特征词的数量，根据第一预设权值和第二预设权值，计算所述第一敏感义项的数量和所述特征词的数量的加权和，当所述加权和大于阈值时确定待检测文本中描述的事件为舆情事件。S104. Obtain the first sensitive sense item corresponding to the maximum similarity, and obtain the number of the first sensitive sense item in the text to be detected and the number of feature words in the text to be detected, according to the first preset weight and the second preset The weight value is to calculate the weighted sum of the quantity of the first sensitive meaning item and the quantity of the feature words, and determine that the event described in the text to be detected is a public opinion event when the weighted sum is greater than a threshold.

其中，当所述特征词向量的元素对应的特征词为敏感词时，可将对应元素设为0。Wherein, when the feature word corresponding to the element of the feature word vector is a sensitive word, the corresponding element can be set to 0.

本实施例通过对待检测文本向量化，能够达到有效的语义约束；同时通过计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度，能够准确检测出需要进行关注的舆情事件的问题，大大降低错判和漏判的概率。This embodiment can achieve effective semantic constraints by vectorizing the text to be detected; at the same time, by calculating the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all feature words, it is possible to accurately detect public opinion events that need to be paid attention to problem, greatly reducing the probability of misjudgment and missed judgment.

作为本实施例的可选方案，步骤S101之前包括：As an optional solution of this embodiment, before step S101 includes:

S100、根据网页内容构建所述语义知识库。S100. Construct the semantic knowledge base according to webpage content.

通过构建语义知识库，对舆情敏感词进行歧义标注，为分析检测舆情事件提供语义支撑，为待检测文本中的敏感词找到正确的含义提供依据。由于舆情特征词往往是对舆情的直接体现，但是舆情特征词在不同的语境却可以表示不同的含义，因此，该类具有歧义的舆情特征词往往给文本过滤预处理带来假阳性问题。因此，通过借助该语义知识库准确给出其描述可识别出其在具体语境中所表达的意思。By constructing a semantic knowledge base, ambiguity labeling is carried out on public opinion sensitive words, providing semantic support for analyzing and detecting public opinion events, and providing a basis for finding the correct meaning of sensitive words in the text to be detected. Since public opinion feature words are often a direct reflection of public opinion, but public opinion feature words can express different meanings in different contexts, therefore, such ambiguous public opinion feature words often bring false positives to text filtering preprocessing. Therefore, by using the semantic knowledge base to accurately give its description, the meaning expressed in the specific context can be identified.

其中，对于语义知识库中存储的特征词对应的向量，是通过对分词预处理后的文本利用深度学习工具word2vec进行训练得到的。对每个分词(即为待检测文本中的特征词)，都可以用一定维数的向量将其有效表示。如下表所示Among them, the vectors corresponding to the feature words stored in the semantic knowledge base are obtained by training the preprocessed text using the deep learning tool word2vec. For each word segment (that is, a feature word in the text to be detected), it can be effectively represented by a vector of a certain dimension. as shown in the table below

具体地，所述网页内容存储在xml格式文件中。Specifically, the webpage content is stored in an xml format file.

举例来说，所述网页内容为维基百科。For example, the web page content is Wikipedia.

维基百科(Wikipedia)是规模最大的在线网络百科全书之一，采用群体在线合作编辑的Wiki机制，具有质量高、覆盖广、实时演化和半结构化等特点，是用来构建语义知识库的优质语料来源。特别针对维基百科中的歧义词，人工标注反映舆情特征的义项，为后续预警分析提供支持。以xml格式的维基百科语料作为输入，从中提取词的描述内容，分析是否为歧义词和重定向词、是否需要繁简转换，保留摘要介绍部分，同时对敏感特征词进行标注。Wikipedia (Wikipedia) is one of the largest online encyclopedias. It adopts the Wiki mechanism of group online cooperative editing. It has the characteristics of high quality, wide coverage, real-time evolution and semi-structured. It is used to build a high-quality semantic knowledge base. source of corpus. Especially for ambiguous words in Wikipedia, the meaning items reflecting the characteristics of public opinion are manually marked to provide support for subsequent early warning analysis. Take the Wikipedia corpus in xml format as input, extract the description content of words from it, analyze whether it is an ambiguous word or a redirection word, whether it needs to be converted from traditional to simplified, keep the abstract introduction part, and mark sensitive feature words at the same time.

借助维基百科强大的语义知识，可自动增加舆情敏感词，扩大舆情事件的表征范围，从而辅助用户更好地把握舆情动向，制定相关对策予以应对。With the help of Wikipedia's powerful semantic knowledge, it can automatically add sensitive words of public opinion and expand the representation range of public opinion events, thereby assisting users to better grasp public opinion trends and formulate relevant countermeasures to deal with them.

进一步地，步骤S100之后包括：Further, after step S100 includes:

S1001、根据所述语义知识库和预设特征词的敏感义项建立敏感词库。S1001. Establish a sensitive thesaurus according to the semantic knowledge base and sensitive meaning items of preset feature words.

其中，对待检测文本进行处理时，可以以分句为处理单位，对敏感词进行处理。具体处理时，将待检测文本分句的特征词向量中的特征词与语义知识库中特征词对应的向量相匹配，通过计算不同特征词的义项之间的相似度以及与待检测文本的相似度，相似度越高说明该义项越贴近其在文本中的真实含义，则选取该义项与敏感词相配，利用最优化方法获取目标函数最大值时各歧义词在文本中的准确含义。计算公式如下：Wherein, when the text to be detected is processed, the sensitive words may be processed by taking the sentence as the processing unit. In the specific processing, match the feature words in the feature word vector of the sentence to be detected with the vector corresponding to the feature words in the semantic knowledge base, and calculate the similarity between the meanings of different feature words and the similarity with the text to be detected The higher the similarity, the closer the meaning is to its true meaning in the text. Then select the meaning to match the sensitive words, and use the optimization method to obtain the exact meaning of each ambiguous word in the text when the maximum value of the objective function is obtained. Calculated as follows:

maxf(w_i)maxf(w _i )

f(w_i)＝f(w_i+1)+Sim(w_i,w_i+1)+Sim(w_i,doc_i)f(w _i )＝f(w _i+1 )+Sim(w _i ,w _i+1 )+Sim(w _i ,doc _i )

s.t.s.t.

w_i∈{v₁,v₂…,v_m}w _i ∈{v ₁ ,v ₂ …,v _m }

doc_i＝(w₁,w₂,…,w_n),w_i＝0doc _i =(w ₁ ,w ₂ ,...,w _n ),w _i =0

其中：w_i表示待检测文本中的特征词，f(w_i)表示词w_i到句子结尾词的语义相似度值，doc_i是文本去除敏感词后的向量表示，即相应位置的元素置为0；v₁，v₂……是特征词对应的向量，若该词为非歧义词，则有一个向量表示，反之，有多个向量表示；Sim(w_i,w_i+1)是计算相邻敏感词相似度的函数，Sim(w_i,doc_i)是计算敏感词与文本的相似度的函数。由于词与文本均用词向量来表示，相似度计算函数可采用余弦相似度计算方法。Among them: w _i represents the feature word in the text to be detected, f(w _i ) represents the semantic similarity value from word w _i to the word at the end of the sentence, doc _i is the vector representation of the text after removing sensitive words, that is, the element at the corresponding position is set is 0; v ₁ , v ₂ ... are vectors corresponding to feature words, if the word is a non-ambiguous word, there is one vector representation, otherwise, there are multiple vector representations; Sim( _wi ,w _i+1 ) is A function to calculate the similarity between adjacent sensitive words, Sim(w _i , doc _i ) is a function to calculate the similarity between sensitive words and text. Since both words and texts are represented by word vectors, the similarity calculation function can use the cosine similarity calculation method.

举例来说，根据待检测文本检测舆情事件时，如图2所示，可先对待检测文本进行分词和去停用词操作，其中，分词是指将待检测文本中的句子分成多个特征词，去停用词是指删去待检测文本中的停用词，如“同时”、“另外”等。For example, when detecting public opinion events based on the text to be detected, as shown in Figure 2, the text to be detected can be segmented and stop words removed first, where the word segmentation refers to dividing the sentence in the text to be detected into multiple feature words , removing stop words refers to deleting stop words in the text to be detected, such as "simultaneously" and "otherwise".

然后，利用word2vec从语义知识库和敏感词库中获取待检测文本中敏感义项的向量，便于后续针对待检测文本的句子中的相邻词进行相似度计算；Then, use word2vec to obtain the vector of the sensitive meaning item in the text to be detected from the semantic knowledge base and the sensitive thesaurus, so as to facilitate subsequent similarity calculations for adjacent words in the sentence of the text to be detected;

接着，利用每个特征词的敏感义项向量与其他特征词对应的向量及待检测文本的特征词向量进行相似度计算，取相似度最大值时各敏感义项的含义，从而获取与其他词及待检测文本都能合理搭配的敏感义项，确定该特征词在待检测文本中的具体含义；Then, use the sensitive sense item vector of each feature word and the vector corresponding to other feature words and the feature word vector of the text to be detected to perform similarity calculation, and take the meaning of each sensitive meaning item when the similarity is at the maximum value, so as to obtain the similarity with other words and the text to be detected. Sensitive meaning items that can be reasonably matched in the detection text, and determine the specific meaning of the feature word in the text to be detected;

最后，对文本中的命名实体及敏感义项进行权重求和，大于一定阈值则判定为需要预警的舆情事件。其中，命名实体是指待检测文本中特征词的数量。Finally, the weights of named entities and sensitive items in the text are summed, and if it is greater than a certain threshold, it is judged as a public opinion event that needs to be warned. Among them, the named entity refers to the number of feature words in the text to be detected.

本实施例利用特征词的不同义项和待检测文本中所有特征词的信息标注进行有监督学习的语义识别。能够避免仅仅依靠关键词匹配对舆情事件进行错误检测的弊端，从而准确识别舆情事件，对需要预警的舆情事件进行预警提示。In this embodiment, the semantic recognition of supervised learning is performed by using the different synonyms of the feature words and the information annotations of all the feature words in the text to be detected. It can avoid the disadvantages of false detection of public opinion events only relying on keyword matching, so as to accurately identify public opinion events and provide early warning prompts for public opinion events that require early warning.

图3示出了本发明一实施例提供的一种舆情事件检测装置的结构示意图，包括：Fig. 3 shows a schematic structural diagram of a public opinion event detection device provided by an embodiment of the present invention, including:

特征词向量获取模块31，用于获取待检测文本的特征词向量，所述特征词向量的元素表示待检测文本中对应的特征词是否出现；The feature word vector acquisition module 31 is used to obtain the feature word vector of the text to be detected, and the element of the feature word vector represents whether the corresponding feature word in the text to be detected occurs;

对应向量获取模块32，用于从语义知识库中获取所有特征词对应的向量，并从敏感词库获取敏感义项向量，所述特征词对应的向量的元素包括当前特征词、当前特征词是否包含敏感义项、当前特征词的当前义项和当前特征词对应的特征词向量，所述敏感义项向量表示当前特征词对应的向量中的义项为当前敏感义项；Corresponding vector obtaining module 32, is used for obtaining the corresponding vector of all feature words from semantic knowledge base, and obtains sensitive meaning item vector from sensitive lexicon, the element of the vector corresponding to described feature word comprises current feature word, whether current feature word contains Sensitive meaning item, the current meaning item of current characteristic word and the characteristic word vector corresponding to current characteristic word, described sensitive meaning item vector represents that the meaning item in the vector corresponding to current characteristic word is current sensitive meaning item;

相似度计算模块33，用于计算待检测文本的特征词向量和所有特征词对应的特征词向量的相似度，其中，所述所有特征词对应的特征词向量包括所有敏感义项向量；The similarity calculation module 33 is used to calculate the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words include all sensitive sense item vectors;

事件检测模块34，用于获取相似度最大时对应的第一敏感义项，并获取待检测文本中所述第一敏感义项的数量和待检测文本中特征词的数量；根据第一预设权值和第二预设权值，计算所述第一敏感义项的数量和所述特征词的数量的加权和，当所述加权和大于阈值时确定待检测文本中描述的事件为舆情事件。The event detection module 34 is used to obtain the corresponding first sensitive meaning item when the similarity is the largest, and obtain the quantity of the first sensitive meaning item in the text to be detected and the number of characteristic words in the text to be detected; according to the first preset weight and a second preset weight value, calculate the weighted sum of the quantity of the first sensitive meaning item and the quantity of the feature words, and determine that the event described in the text to be detected is a public opinion event when the weighted sum is greater than a threshold.

作为本实施例的可选方案，还包括：As an optional solution of this embodiment, it also includes:

进一步地，还包括：Further, it also includes:

本发明的说明书中，说明了大量具体细节。然而，能够理解，本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中，并未详细示出公知的方法、结构和技术，以便不模糊对本说明书的理解。In the description of the invention, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.

Claims

1. A public opinion event detection method, characterized in that, comprising:

Obtain the feature word vector of the text to be detected, the element of the feature word vector represents whether the corresponding feature word in the text to be detected appears;

Obtain the vectors corresponding to all feature words from the semantic knowledge base, and obtain the sensitive meaning item vector from the sensitive lexicon. The elements of the vector corresponding to the feature words include the current feature words, whether the current feature words contain sensitive meanings, the current The feature word vector corresponding to meaning item and current feature word, described sensitive meaning item vector represents the meaning item in the vector corresponding to current feature word as current sensitive meaning item;

Calculate the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words include all sensitive sense item vectors;

Obtain the first sensitive sense item corresponding to the maximum similarity, and obtain the number of the first sensitive sense item in the text to be detected and the number of feature words in the text to be detected, according to the first preset weight and the second preset weight , calculating the weighted sum of the quantity of the first sensitive meaning item and the quantity of the feature words, and determining that the event described in the text to be detected is a public opinion event when the weighted sum is greater than a threshold.

2. method according to claim 1, is characterized in that, before the feature word vector of described acquisition to-be-detected text comprises:

The semantic knowledge base is constructed according to the content of the webpage.

3. The method according to claim 2, wherein the webpage content is stored in an xml format file.

4. The method according to claim 3, wherein the web page content is Wikipedia.

5. The method according to claim 4, characterized in that, after said constructing said semantic knowledge base according to webpage content, comprising:

A sensitive thesaurus is established according to the semantic knowledge base and the sensitive meaning items of preset feature words.

6. A public opinion event detection device, characterized in that, comprising:

The feature word vector acquisition module is used to obtain the feature word vector of the text to be detected, and the element of the feature word vector represents whether the corresponding feature word in the text to be detected appears;

The corresponding vector acquisition module is used to obtain the corresponding vectors of all feature words from the semantic knowledge base, and obtain the sensitive sense item vector from the sensitive lexicon, the elements of the vector corresponding to the feature words include the current feature words, whether the current feature words contain sensitive The current meaning item of meaning item, current feature word and the feature word vector corresponding to current feature word, described sensitive meaning item vector represents that the meaning item in the vector corresponding to current feature word is current sensitive meaning item;

A similarity calculation module, configured to calculate the similarity between the feature word vectors of the text to be detected and the feature word vectors corresponding to all the feature words, wherein the feature word vectors corresponding to all the feature words include all sensitive sense item vectors;

The event detection module is used to obtain the first sensitive sense item corresponding to the maximum similarity, and obtain the number of the first sensitive sense item in the text to be detected and the number of feature words in the text to be detected; according to the first preset weight and The second preset weight value is to calculate the weighted sum of the quantity of the first sensitive meaning item and the quantity of the characteristic words, and determine that the event described in the text to be detected is a public opinion event when the weighted sum is greater than a threshold.

7. The device according to claim 6, further comprising:

The semantic knowledge base construction module is used to construct the semantic knowledge base according to the content of the webpage.

8. The device according to claim 7, wherein the webpage content is stored in an xml format file.

9. The device according to claim 8, wherein the web page content is Wikipedia.

10. The device according to claim 9, further comprising:

The sensitive lexicon establishment module is used to establish a sensitive lexicon according to the semantic knowledge base and the sensitive meaning items of preset feature words.