CN102609427A

CN102609427A - Public opinion vertical search analysis system and method

Info

Publication number: CN102609427A
Application number: CN2011103549731A
Authority: CN
Inventors: 饶国政; 贾彪; 冯志勇
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2011-11-10
Filing date: 2011-11-10
Publication date: 2012-07-25

Abstract

The invention relates to network information processing technology, and discloses a public opinion vertical search and analysis system, which is applied to text-based network public opinion search analysis, including a vertical search engine crawler module, a template-based information extraction module, and a phrase extraction-based text tendency Characteristic analysis module, the text tendency analysis module based on vocabulary statistics mode; Compared with the prior art, the algorithm accuracy rate of the information emotion tendency based on the phrase pattern and the vocabulary statistics mode adopted by the present invention is compared with the prior art, which improves the About 5 percentage points, the effect of algorithm improvement is relatively obvious; at the same time, the multi-threaded method design improves the execution efficiency of processing, therefore, a faster and more accurate search analysis effect is achieved for public opinion search analysis.

Description

Public opinion vertical search analysis system and method

技术领域 technical field

本发明涉及网络信息处理技术，特别是涉及一种网络舆情搜索和分析系统和方法。 The invention relates to network information processing technology, in particular to a network public opinion search and analysis system and method. the

背景技术 Background technique

本发明所涉及的主要技术包括： The main technologies involved in the present invention include:

1.与网络舆情监测相关的关键性技术 1. Key technologies related to network public opinion monitoring

(1)网络舆情采集与提取技术：网络舆情主要通过新闻、论坛/BBS、博客、即时通信软件等渠道形成和传播，这些通道的承载体主要为动态网页，它们承载着松散的结构化信息，使得舆情信息的有效抽取很有难度。通过全自动生成网页信息抽取Wrapper的方法在一定程度上实现了动态网页数据的抽取与集成，具有一定的处理准确率以及抽取效率。 (1) Internet public opinion collection and extraction technology: Internet public opinion is mainly formed and disseminated through channels such as news, forums/BBS, blogs, and instant messaging software. The carriers of these channels are mainly dynamic web pages, which carry loosely structured information. It is very difficult to effectively extract public opinion information. The method of automatically generating webpage information extraction Wrapper realizes the extraction and integration of dynamic webpage data to a certain extent, and has certain processing accuracy and extraction efficiency. the

(2)网络舆情话题发现与追踪技术：网民讨论的话题繁多，涵盖社会方方面面，如何从海量信息中找到热点、敏感话题，并对其趋势变化进行追踪成为研究热点。 (2) Internet public opinion topic discovery and tracking technology: Netizens discuss a variety of topics, covering all aspects of society. How to find hot and sensitive topics from massive information and track their trend changes has become a research hotspot. the

(3)网络舆情倾向性分析技术：通过倾向性分析可以明确网络传播者所蕴涵的感情、态度、观点、立场、意图等主观反映。对舆情文本进行倾向性分析，实际上就是试图用计算机实现根据文本的内容提炼出文本作者的情感方向的目标。 (3) Internet public opinion tendency analysis technology: through tendency analysis, the subjective reflections such as feelings, attitudes, opinions, positions, and intentions contained in Internet communicators can be clarified. Analyzing the tendency of public opinion texts is actually trying to use computers to achieve the goal of extracting the emotional direction of the text author based on the content of the text. the

(4)多文档自动文摘技术：新闻、帖子、博文等页面都包含着垃圾信息，多文档自动摘要技术能对页面内容进行过滤，并提炼成概要信息，便于查询和检索。 (4) Multi-document automatic summarization technology: news, posts, blog posts and other pages contain spam information, multi-document automatic summarization technology can filter the page content and extract it into summary information, which is convenient for query and retrieval. the

2.信息抽取技术 2. Information extraction technology

垂直搜索引擎实现流程是spider爬取网页，对网页进行分类、信息提取，即将网页的非结构化数据抽取成特定的结构化数据，将这些数据存储到数据库，进行进一步的加工处理，如去重、分析比较等，最后通过分词索引提供用户搜索。上述流程中最关键的就是将非结构化数据按照需求抽取成结构化数据，这也是垂直搜索引擎和通用搜索引擎的最大区别。 The implementation process of the vertical search engine is that the spider crawls the webpage, classifies the webpage, and extracts information, that is, extracts the unstructured data of the webpage into specific structured data, stores these data in the database, and performs further processing, such as deduplication , analysis and comparison, etc., and finally provide user search through word segmentation index. The key point in the above process is to extract unstructured data into structured data according to requirements, which is also the biggest difference between vertical search engines and general search engines. the

目前主要有两种方式实现结构化信息抽取： At present, there are two main ways to realize structured information extraction:

(1)基于网页库级的结构化信息抽取方式 (1) The structured information extraction method based on the webpage database level

采用页面结构分析与智能节点分析转换的方法，自动抽取结构化数据。该方式可对任意的正常网页进行抽取，完全自动化，智能抽取准确率高。但由于需要通用性良好，其技术实现难度较高，前期研发成本高、周期长，仅适合高端应用。 Automatically extract structured data by using page structure analysis and intelligent node analysis and transformation methods. This method can extract any normal web page, which is completely automatic, and the accuracy of intelligent extraction is high. However, due to the need for good versatility, it is difficult to implement the technology, and the initial research and development costs are high and the cycle is long, so it is only suitable for high-end applications. the

(2)模板方式 (2) Template method

模板方式是事先对数据源的网页结构进行分析，针对不同的结构，进行模板匹配。在抽取模板中运用特定的正则表达式，对有限网站的信息进行精确采集。该方式的实现过程较为简易，针对数据源的网页结构，可轻松地配置模板，准确率高、实时性强、方便快捷部署。但在信息源多样性和不稳定的情况下维护量巨大，故这种方式适合相对固定的有限信息源的信息处理。 The template method is to analyze the web page structure of the data source in advance, and perform template matching for different structures. Use specific regular expressions in the extraction template to accurately collect information from limited websites. The implementation process of this method is relatively simple. According to the web page structure of the data source, the template can be easily configured, with high accuracy, strong real-time performance, and convenient and quick deployment. However, in the case of diverse and unstable information sources, the amount of maintenance is huge, so this method is suitable for information processing of relatively fixed and limited information sources. the

3.基于语义的文本倾向性研究方法 3. Semantic-based text orientation research method

目前，基于语义的文本倾向性研究方法主要有两种。 At present, there are two main methods of text orientation research based on semantics. the

(1)第一种是先对待分析文本中的形容词或能够体现主观色彩的短语进行抽取，然后对抽取出来的形容词或短语逐一进行倾向性判断并赋予一个倾向值，最后将上述所有倾向值累加起来得到文章的总体文本倾向性。即： (1) The first one is to extract the adjectives or phrases that can reflect the subjective color in the text to be analyzed, and then judge the tendency of the extracted adjectives or phrases one by one and assign a tendency value, and finally accumulate all the above-mentioned tendency values Get the overall text tendency of the article. Right now:

1)利用连接形容词的连词的语言学约束来判断所连接的两个形容词表达的感情是否一致，然后用类聚方法来获得表示情感倾向的两个形容词类。Turney等人使用PMI_IR(Pointwise Mutual Information and Information Retrieval)方法来估计短语与表示情感的两个立场的基准词(如“好”与“坏”)的相似度，相似度计算用逐点互信息。判断词的倾向性还有一类方法是基于一个现存的本体知识库，如英文的WordNet及中文的HowNet，来计算待估词与已选定的基准词对的语义距离，进而判断待估词的倾向性。 1) Use the linguistic constraints of the conjunctions connecting adjectives to judge whether the emotions expressed by the two connected adjectives are consistent, and then use the clustering method to obtain two adjective classes that express emotional tendencies. Turney et al. used the PMI_IR (Pointwise Mutual Information and Information Retrieval) method to estimate the similarity between the phrase and the benchmark words (such as "good" and "bad") that represent the two positions of emotion. The similarity calculation uses pointwise mutual information. Another type of method for judging the tendency of words is based on an existing ontology knowledge base, such as English WordNet and Chinese HowNet, to calculate the semantic distance between the word to be estimated and the selected reference word pair, and then judge the word to be estimated. Tendency. the

2)利用HowNet提供的语相似度和语义相关场的计算功能，计算待估词与预先选好的褒贬基准词对组的相关性，从而得到该词的倾向性。 2) Using the calculation function of language similarity and semantic correlation field provided by HowNet, calculate the correlation between the word to be estimated and the pre-selected reference word pair of praise and criticism, so as to obtain the tendency of the word. the

(2)第二种基于语义的文本倾向性的研究方法：预先建立一个倾向性语义模式库，有时还会附带一个倾向性字典。然后将待估文档参照语义模式库做模式匹配，最后累加所有匹配模式对应的倾向性值从而得到整个文档的倾向性。刘永丹等人将已有的语义分析技术用于倾向性判断，用精简的格语法和语义框架表达文本中的语义关系并进行倾向性分析。而郑宇等人采用了倾向性词典和语义规则匹配相结合的分析方法来进行倾向性文本过滤。 (2) The second research method of text orientation based on semantics: pre-establish a library of orientation semantic patterns, sometimes with an orientation dictionary. Then, the document to be evaluated is matched against the semantic pattern library, and finally the propensity values corresponding to all matching patterns are accumulated to obtain the propensity of the entire document. Liu Yongdan and others used the existing semantic analysis technology for orientation judgment, expressed the semantic relationship in the text with simplified case grammar and semantic framework, and conducted orientation analysis. However, Zheng Yu et al. used the analysis method of combining the tendency dictionary and the semantic rule matching to carry out the tendency text filtering. the

发明内容 Contents of the invention

基于上述现有技术，本发明提出一种舆情垂直搜索分析系统及方法，在web2.0网络环境下，实现了基于广度优先搜索策略与基于网页拓扑和关键字过滤算法的网页爬取处理及基于文本语义倾向性(特别是基于短语模式和词汇统计模式的信息情感倾向性)的分析处理，以实现快速和更具深度的舆情垂直搜索分析。 Based on the above prior art, the present invention proposes a public opinion vertical search analysis system and method, in the web2. Analysis and processing of text semantic tendency (especially information sentiment tendency based on phrase pattern and vocabulary statistical pattern) to achieve fast and deeper public opinion vertical search analysis. the

本发明提出一种舆情垂直搜索分析系统，该系统应用于基于文本的网络舆情搜索分析，该系统包括垂直搜索引擎爬虫模块、基于模板的信息抽取模块、基于短语抽取的文本倾向性分析模块、基于词汇统计模式的文本倾向性分析模块，其中： The present invention proposes a public opinion vertical search and analysis system, which is applied to text-based network public opinion search and analysis. The system includes a vertical search engine crawler module, a template-based information extraction module, a phrase extraction-based text tendency analysis module, a The text tendency analysis module of the vocabulary statistics mode, in which:

垂直搜索引擎爬虫模块，利用爬虫算法通过基于网络拓扑和网页内容关键字的过滤技术及广度优先搜索的网页爬取，有选择的搜索并下载与舆情主题相关的互联网网页； The vertical search engine crawler module uses the crawler algorithm to selectively search and download Internet pages related to public opinion topics through the filtering technology based on network topology and webpage content keywords and breadth-first search webpage crawling;

基于模板的信息抽取模块，从网页源代码信息中抽取出结构化的数据，并以所需的固定形式存储到数据库中； Template-based information extraction module extracts structured data from web page source code information and stores it in the database in the required fixed form;

基于短语抽取的文本倾向性分析模块，基于短语抽取模式得到结构化信息，并分别对结构化信息文本语料进行倾向性分析，得到文本语料的最终倾向度Sensibility(Text)；该模块的处理包括： The text orientation analysis module based on phrase extraction obtains structured information based on the phrase extraction mode, and conducts orientation analysis on the text corpus of structured information respectively to obtain the final orientation degree Sensibility (Text) of the text corpus; the processing of this module includes:

词汇A与词汇B的情感倾向权值，记为Sensibility(A)或Sensibility(B)； The emotional tendency weights of vocabulary A and vocabulary B, denoted as Sensibility(A) or Sensibility(B);

判断词汇A与词汇B是否存在于“程度副词”及“否定副词”词表中： Determine whether vocabulary A and vocabulary B exist in the vocabulary of "degree adverbs" and "negative adverbs":

若词汇A与词汇B均不在，则该短语的情感倾向权值为 If both vocabulary A and vocabulary B are absent, the emotional orientation weight of the phrase is

Sensibility(A+B)＝Sensibility(A)+Sensibility(B)； Sensibility(A+B)＝Sensibility(A)+Sensibility(B);

若词汇A存在于“否定副词”词表中，则短语中心词为词汇B，计算词汇B的情感权值为Sensibility(B)，则该短语的情感权值Sensibility(A+B)＝(-1)×Sensibility(B)； If vocabulary A exists in the vocabulary of "negative adverbs", then the head word of the phrase is vocabulary B, and the emotional weight of vocabulary B is calculated as Sensibility(B), then the emotional weight of the phrase Sensibility(A+B)=(- 1)×Sensibility(B);

反之，若词汇B存在于“否定副词”词表中，则该短语中心词为词汇A，该短语的情感权值Sensibility(A+B)＝(-1)×Sensibility(A)； Conversely, if vocabulary B exists in the vocabulary of "negative adverbs", then the head word of the phrase is vocabulary A, and the emotional weight of the phrase Sensibility(A+B)=(-1)×Sensibility(A);

若词汇A存在于“程度副词”词表中，则短语中心词为词汇B，用level(A)表示作为程度副词的词汇A的程度倍数，该短语的情感权值 If vocabulary A exists in the vocabulary of "degree adverbs", then the head word of the phrase is vocabulary B, and level(A) is used to represent the degree multiple of vocabulary A as a degree adverb, and the emotional weight of the phrase

Sensibility(A+B)＝level(A)×Sensibility(B)； Sensibility(A+B)＝level(A)×Sensibility(B);

反之，用level(B)表示作为程度副词的词汇B的程度倍数，该短语的情感权值 Conversely, use level (B) to represent the degree multiple of vocabulary B as a degree adverb, the emotional weight of the phrase

Sensibility(A+B)＝level(B)×Sensibility(A)； Sensibility(A+B)＝level(B)×Sensibility(A);

分别计算所有褒义倾向与贬义词倾向的短语权值和，用Positive(words)与Negative(words)分别表示有褒义倾向与贬义词倾向的短语权值： Calculate the sum of the phrase weights of all commendatory tendencies and derogatory tendencies respectively, and use Positive(words) and Negative(words) to represent the phrase weights with commendatory tendencies and derogatory tendencies respectively:

将所有短语情感权值求和，所得结果小于0的作为贬义词倾向的短语权值 Sum the emotional weights of all phrases, and the result is less than 0 as the phrase weight of the tendency of derogatory words

$Negative Negative ((words words)) = = {Σ Σ}_{i i = = 00}^{N N} Sensibility Sensibility {((A A + + B B))}_{i i},, if if ((Sensibility Sensibility {((A A + + B B))}_{i i})) < < 00;;$

将所有短语情感权值求和，所得结果大于或等于0的作为褒义词倾向和中性短语权值 Sum the emotional weights of all phrases, and the result is greater than or equal to 0 as commendatory word tendency and neutral phrase weights

$Positive Positive ((words words)) = = {Σ Σ}_{i i = = 00}^{N N} Sensibility Sensibility {((A A + + B B))}_{j j},, if if ((Sensibility Sensibility {((A A + + B B))}_{i i})) > > = = 00;;$

文本语料的最终倾向度用Sensibility(Text)表示，则 The final tendency of the text corpus is represented by Sensibility (Text), then

$Sensibility Sensibility ((Text Text)) = = \frac{Negative Negative ((words words)) + + Postitive Positive ((words words))}{Positvie Positvie ((words words)) - - Negative Negative ((words words))};;$

若Sensibility(Text)＜0，则表示该文本为贬义倾向文本；若Sensibility(Text)＞＝0则表示该文本为褒义倾向或中性文本； If Sensibility(Text)<0, it means that the text is a derogatory text; if Sensibility(Text)>=0, it means that the text is a positive or neutral text;

基于词汇统计模式的文本倾向性分析模块，完成系统的信息来源及负面倾向性分析，得到文本Text情感倾向性值，该模块的具体处理包括： The text tendency analysis module based on the vocabulary statistics mode completes the information source and negative tendency analysis of the system, and obtains the emotional tendency value of the text Text. The specific processing of this module includes:

读入文本Text，将文本Text按标点进行分句，标记为S1，S2，……Sn； Read in the text Text, divide the text Text into sentences according to punctuation, and mark it as S1, S2, ... Sn;

搜索S1所有具有明确语义倾向的态度词，这里所搜索的态度词的词性为形容词、副词、名词、动词及成语等，利用词汇情感计算模块计算各态度词情感权值，并将S1中所有态度词的权值进行叠加，得到该分句的所有态度词权值总和V1； Search for all attitude words with clear semantic orientation in S1. The parts of speech of the attitude words searched here are adjectives, adverbs, nouns, verbs, and idioms. Use the vocabulary emotion calculation module to calculate the emotional weight of each attitude word, and compare all the attitude words in S1 The weights of the words are superimposed to obtain the sum V1 of the weights of all attitude words in the clause;

搜索S1所有包含在程度副词词典中的程度词数量，当包含程度词时，将态度权值V1乘以程度副词在程度词典中的程度倍数level()，即level()×V1； Search S1 for the number of degree words included in the degree adverb dictionary. When degree words are included, multiply the attitude weight V1 by the degree multiple level() of the degree adverb in the degree dictionary, that is, level()×V1;

S1计算完毕，搜索Text的下一分句S2重复前面三个步骤，计算得到该分句S2的所有态度词权值总和V2； After the calculation of S1 is completed, search for the next clause S2 of Text and repeat the previous three steps to calculate the sum of all attitude word weights V2 of the clause S2;

直到计算出最后一分句的所有态度词权值总和Vn后，分别计算正面Vi权值总和Positive(Sentences)，与负面Vi权值总和Negative(Sentences) Until the sum of all attitude word weights Vn in the last clause is calculated, the sum of positive Vi weights Positive(Sentences) and the sum of negative Vi weights Negative(Sentences) are respectively calculated

$Negative Negative ((Sentences Sentences)) = = {a a}_{i i = = 00}^{{N N}_{00}} {Vi Vi}_{i i} if if ((Vi Vi)) < < 00;;$

$Positive Positive ((Sentences Sentences)) = = {a a}_{j j = = 00}^{{N N}_{00}} {Vj Vj}_{j j} if if ((Vj Vj)) > > = = 00;;$

最后计算最终文本倾向度为： Finally, the final text tendency is calculated as:

$Sensibility Sensibility ((Text Text)) = = \frac{Negative Negative ((Sentences Sentences)) + + Positive Positive ((Sentences Sentences))}{Positive Positive ((Sentences Sentences)) - - Negative Negative ((Sentences Sentences))} . .$

所述网页爬取策略包括深度优先搜索策略、广度优先搜索策略、最佳优先搜索策略。 The web page crawling strategy includes a depth-first search strategy, a breadth-first search strategy, and a best-first search strategy. the

所述网页爬取包括爬取URL、爬取网页深度、爬取网页数目、URL过滤标识在内的四个属性； Described webpage crawling comprises four attributes including crawling URL, crawling webpage depth, crawling webpage number, URL filter identification;

所述垂直搜索引擎爬虫模块，其爬取操作采用多线程，包括以下处理： Described vertical search engine crawler module, its crawling operation adopts multithreading, comprises the following processing:

当某个线程完成页面下载后，将下载的页面以提取出的网页中包含的链接网址形式提交至解析缓冲区线程池，加入到待下载的缓冲队列中，线程池调用解析器解析网页提取URL，并把解析得到的URL加入到URL记录中：URL在网络中是一个树状结构，在树状结构中，不同的层次节点的URL可能相同，因为同一个URL可以由其他很多网页中解析出来。在爬虫设计中包括记录解析出的URL，对URL进行记录前先对URL进行判断，如果解析出的URL已经存在于记录中，则跳过该URL，否则会把URL加入到未处理记录中。在爬虫进行搜索时，它首先处理初始的URL，通过解析器解析后，将得到新一层URL队列，接下来爬虫按照URL在队列中的默认顺序对这些URL进行下载，并进行分析，解析出新的URL，处理后的URL放入已处理记录队列中。只有在当前层次的所有网页爬取完成后，才会对下一层次的URL进行爬取； When a thread finishes downloading the page, it submits the downloaded page to the parsing buffer thread pool in the form of the link URL contained in the extracted web page, and adds it to the buffer queue to be downloaded, and the thread pool calls the parser to parse the web page and extract the URL , and add the parsed URL to the URL record: URL is a tree structure in the network. In the tree structure, the URLs of different hierarchical nodes may be the same, because the same URL can be parsed from many other web pages . The crawler design includes recording the parsed URL. Before recording the URL, the URL is judged. If the parsed URL already exists in the record, the URL is skipped, otherwise the URL will be added to the unprocessed record. When the crawler searches, it first processes the initial URL. After being parsed by the parser, it will get a new layer of URL queue. Next, the crawler downloads these URLs according to the default order of the URLs in the queue, and analyzes them. New URLs, processed URLs are placed in the processed record queue. Only after all the web pages of the current level are crawled, the URL of the next level will be crawled;

所述词汇统计模式的文本倾向性分析模块中，若搜索到分句S1所有包含在否定副词表中的否定词数量，当否定词个数为奇数时，将S1所在分句态度权值V1转化为-(1-t)×V1；其中t为模糊值，与否定词在程度副词先后的位置有关，否定词在前模糊值大于零，否定词在后模糊值小于零，设模糊值在0.2～0.4之间。 In the text tendency analysis module of the lexical statistics mode, if the number of negative words included in the negative adverb list of the sentence S1 is searched, when the number of negative words is an odd number, the attitude weight V1 of the sentence where S1 is located is transformed into It is -(1-t)×V1; where t is a fuzzy value, which is related to the positions of negative words in the degree adverbs, the fuzzy value of negative words before is greater than zero, and the fuzzy value of negative words after negative words is less than zero, and the fuzzy value is set at 0.2 ~0.4. the

本发明还提出一种舆情垂直搜索分析方法，该方法包括以下步骤： The present invention also proposes a method for vertical search and analysis of public opinion, which method includes the following steps:

调用垂直搜索引擎爬虫模块，利用爬虫算法通过基于网络拓扑和网页内容关键字的过滤技术及广度优先搜索的网页爬取，有选择的搜索并下载与舆情主题相关的互联网网页； Call the vertical search engine crawler module, use the crawler algorithm to selectively search and download Internet pages related to public opinion topics through the filtering technology based on network topology and web content keywords and the webpage crawling of breadth-first search;

通过基于模板的信息抽取模块从网页源代码信息中抽取出结构化的数据，并以所需的固定形式存储到数据库中； Extract the structured data from the source code information of the webpage through the template-based information extraction module, and store it in the database in the required fixed form;

通过文本倾向性分析模块实现两种算法：基于短语抽取模式和基于词汇统计模式，并分别对结构化信息文本进行倾向性分析，得到文本情感倾向权值； Two algorithms are realized through the text tendency analysis module: based on the phrase extraction mode and based on the vocabulary statistics mode, and the tendency analysis is performed on the structured information text respectively to obtain the text emotional tendency weight;

基于短语抽取模式得到结构化信息，分别对结构化信息文本语料进行倾向性分析，得到文本语料的最终倾向度Sensibility(Text)： The structured information is obtained based on the phrase extraction mode, and the orientation analysis of the structured information text corpus is carried out respectively, and the final orientation degree Sensibility (Text) of the text corpus is obtained:

若Sensibility(Text)＜0，则表示该文本为贬义倾向文本；若Sensibility(Texy)＞＝0则表示该文本为褒义倾向或中性文本； If Sensibility(Text)<0, it means that the text is a derogatory text; if Sensibility(Texy)>=0, it means that the text is a positive or neutral text;

基于词汇统计模式的文本倾向性分析，完成系统的信息来源及负面倾向性分析，得到文本Text的情感倾向性值： Based on the text tendency analysis of the vocabulary statistics mode, the system's information source and negative tendency analysis are completed, and the emotional tendency value of the text Text is obtained:

直到计算出最后一分句的所有态度词权值总和Vn后，分别计算正面Vi权值总和Positive(Sentences)，与负面Vi权值总和Negative(Sentences)。 After calculating the weight sum Vn of all attitude words in the last clause, calculate the positive Vi weight sum Positive(Sentences) and the negative Vi weight sum Negative(Sentences) respectively. the

所述网页爬取包括爬取URL、爬取网页深度、爬取网页数目、URL过滤标识在内的四个属性。 The webpage crawling includes four attributes including crawling URL, crawling webpage depth, number of crawling webpages, and URL filtering identifier. the

所述若搜索到某分句S_n所有包含在否定副词表中的否定词数量，当否定词个数为奇数时，将S1所在分句态度权值V_n转化为-(1-t)×V_n；其中t为模糊值，与否定词在程度副词先后的位置有关，否定词在前模糊值大于零，否定词在后模糊值小于零，设置模糊值在0.2～0.4之间。 If the number of negative words included in the list of negative adverbs of a certain clause S _n is searched, when the number of negative words is an odd number, the attitude weight V _n of the clause where S1 is located is converted into -(1-t)× V _n ; where t is a fuzzy value, which is related to the position of the negative word in the degree adverb. The fuzzy value of the negative word before it is greater than zero, and the fuzzy value of the negative word after it is less than zero. Set the fuzzy value between 0.2 and 0.4.

与现有技术相比，本发明采用的基于短语模式和词汇统计模式的信息情感倾向性的算法准确率较现有技术对比，提高了5个百分点左右，算法改进的效果比较明显步骤；同时，多线程的方法设计提高了处理的执行效率，因此，对于舆情搜索分析达成了更快，更准确的搜索分析效果。 Compared with the prior art, the accuracy rate of the algorithm based on the phrase mode and the vocabulary statistical mode used in the present invention is improved by about 5 percentage points compared with the prior art, and the effect of the algorithm improvement is more obvious; at the same time, The multi-threaded method design improves the execution efficiency of processing, therefore, a faster and more accurate search analysis effect is achieved for public opinion search analysis. the

附图说明 Description of drawings

图1为本发明的垂直搜索分析系统模块图； Fig. 1 is a vertical search analysis system block diagram of the present invention;

图2为本发明的垂直搜索爬取模块的流程图； Fig. 2 is the flow chart of vertical search crawling module of the present invention;

图3为本发明具体实施例中基于模板的信息抽取模块的模板化处理前包含噪声信息的网页界面示意图； Fig. 3 is a schematic diagram of a web page interface containing noise information before the template processing of the template-based information extraction module in a specific embodiment of the present invention;

图4为本发明的具体实施例中基于模板的信息抽取模块的模板抽取后格式化信息结果界面； Fig. 4 is formatted information result interface after the template extraction of template-based information extraction module in the specific embodiment of the present invention;

图5为本发明发明的文本倾向性分析模块流程图； Fig. 5 is the flow chart of text tendency analysis module of the present invention;

图6为基于短语与词汇统计的文本倾向性分析算法文本识别数目对比图，(样本大小分别为1000、2000和3000文本)； Figure 6 is a comparison chart of the number of text recognition based on the text tendency analysis algorithm based on phrase and vocabulary statistics, (the sample size is 1000, 2000 and 3000 texts respectively);

图7为基于短语与词汇统计的文本倾向性分析算法文本识别准确率对比图(样本大小同上)。 Figure 7 is a comparison chart of text recognition accuracy based on the text tendency analysis algorithm based on phrase and vocabulary statistics (the sample size is the same as above). the

具体实施方式 Detailed ways

本发明以舆情监控为背景，主要完成以下处理： The present invention takes public opinion monitoring as the background, and mainly completes the following processing:

第一步，调用垂直搜索引擎爬虫模块，利用爬虫算法通过基于网络拓扑和网页内容关键字的过滤技术及广度优先搜索的爬取策略，有选择的搜索并下载与舆情主题相关的互联网网页； The first step is to call the vertical search engine crawler module, and use the crawler algorithm to selectively search and download Internet pages related to public opinion topics through the filtering technology based on network topology and web content keywords and the crawling strategy of breadth-first search;

此步骤中，本发明采用多线程下载的设计，爬取操作包括以下处理：对下载的网页进行分析，： In this step, the present invention adopts the design of multi-threaded download, and the crawling operation includes the following processing: the downloaded webpage is analyzed:

当某个线程完成页面下载后，将下载的页面以提取出的网页中包含的链接网址形式提交至解析缓冲区线程池，加入到待下载的缓冲队列中，线程池调用解析器解析网页提取URL，并把解析得到的URL加入到URL记录中：URL在网络中是一个树状结构，在树状结构中，不同的层次节点的URL可能相同，因为同一个URL可以由其他很多网页中解析出来。在爬虫设计中包括记录解析出的URL，对URL进行记录前先对URL进行判断，如果解析出的URL已经存在于记录中，则跳过该URL，否则会把URL加入到未处理记录中。在爬虫进行搜索时，它首先处理初始的URL，通过解析器解析后，将得到新一层URL队列，接下来爬虫按照URL在队列中的默认顺序对这些URL进行下载，并进行分析，解析出新的URL，处理后的URL放入已处理记录队列中。只有在当前层次的所有网页爬取完成后，才会对下一层次的URL进行爬取。 When a thread finishes downloading the page, it submits the downloaded page to the parsing buffer thread pool in the form of the link URL contained in the extracted web page, and adds it to the buffer queue to be downloaded, and the thread pool calls the parser to parse the web page and extract the URL , and add the parsed URL to the URL record: URL is a tree structure in the network. In the tree structure, the URLs of different hierarchical nodes may be the same, because the same URL can be parsed from many other web pages . The crawler design includes recording the parsed URL. Before recording the URL, the URL is judged. If the parsed URL already exists in the record, the URL is skipped, otherwise the URL will be added to the unprocessed record. When the crawler searches, it first processes the initial URL. After being parsed by the parser, it will get a new layer of URL queue. Next, the crawler downloads these URLs according to the default order of the URLs in the queue, and analyzes them. New URLs, processed URLs are placed in the processed record queue. Only after all the web pages of the current level are crawled, the URLs of the next level will be crawled. the

本发明所涉及的网页爬取策略有：深度优先搜索策略、广度优先搜索策略、最佳优先搜索策略。 The web crawling strategies involved in the present invention include: a depth-first search strategy, a breadth-first search strategy, and a best-first search strategy. the

本舆情垂直搜索系统需对包括各类新闻及论坛等网站舆情信息进行爬取汇总及信息抽取，在链接分析过程中，由于新闻与论坛两类网站在网络拓扑结构的不同，新闻网站的网页内容一般在一个URL页面中即可完全显示，网民留言数较少，即使留言较多的页面也均在一个URL页面中显示，只是在网页长度上增加而已，并不像论坛中存在着“下一页”的链接标签，因此这里可以把新闻网站中的链接分析定义“网页深度”概念，即当前页面URL源码中的所有超链接均为当前页面的下一深度页面。 This public opinion vertical search system needs to crawl and summarize the public opinion information including various news and forums and extract information. Generally, it can be fully displayed on one URL page, and the number of comments from netizens is small. Even pages with more comments are displayed on one URL page, but the length of the webpage only increases. It is not like the "next page" in forums page", so here we can analyze the link analysis in the news website and define the concept of "web page depth", that is, all hyperlinks in the URL source code of the current page are the next deep page of the current page. the

第二步，通过基于模板的信息抽取模块从网页源代码信息中抽取出结构化的数据，并以所需的固定形式存储到数据库中； In the second step, the structured data is extracted from the source code information of the webpage through the information extraction module based on the template, and stored in the database in the required fixed form;

本发明的舆情垂直搜索引擎系统所需的数据格式分别为：舆论文章的URL、舆论文章来源、舆论主题(标题)、舆论作者、舆论正文、舆论发表时间、评论数量、转发数量。本发明采用模板方式进行信息抽取，因此模板格式的确定尤为重要。模板确定后，爬虫程序根据模板匹配后必须得到上面所列出的所有数据格式。 The required data formats of the public opinion vertical search engine system of the present invention are respectively: the URL of the public opinion article, the source of the public opinion article, the topic (title) of the public opinion, the author of the public opinion, the text of the public opinion, the publication time of the public opinion, the number of comments, and the number of reposts. The present invention uses a template method to extract information, so the determination of the template format is particularly important. After the template is determined, the crawler program must obtain all the data formats listed above after matching according to the template. the

对于爬虫程序本身而言，包括初始的爬取URL、爬取网页深度、爬取网页数目、URL过滤标识在内的四个属性也是爬虫所必需的。因此结合结构化信息格式，本系统将模板定义成如下表所示的数据格式。 For the crawler program itself, four attributes including the initial URL to crawl, the depth of the crawled webpage, the number of crawled webpages, and the URL filtering flag are also necessary for the crawler. Therefore, combined with the structured information format, this system defines the template as the data format shown in the following table. the

序号 serial number 标识符 identifier 注释说明 Comments 1 1 style style 网站类型，既新闻类或论坛类 Website type, either news or forum 2 2 authorstart author start “作者”标识开始标签 "Author" logo start tag 3 3 authorend author end “作者”标识结束标签 "Author" identification end tag 4 4 contentstart contentstart “内容”标识开始标签 "Content" identifies the start tag 5 5 contentend content end “内容”标识结束标签 "Content" identifies the closing tag 6 6 source source “来源”标识标签 "Source" identification label 7 7 timestart timestart “时间”标识开始标签 "Time" identifies the start tag 8 8 timeend timeend “时间”标识结束标签 "Time" identifies the end tag 9 9 url url “初始URL”标识标签 "Initial URL" identification tag 10 10 ex_url ex_url “URL过滤”标识标签 "URL filtering" identification label 11 11 count(deep) count(deep) 爬取深度或爬取网页数目 Crawl depth or number of crawled pages 12 12 keyword keyword 过滤关键词 filter keywords

第三步，通过文本倾向性分析模块实现两种算法：基于短语抽取模式和基于词汇统计模式，并分别对结构化信息文本进行倾向性分析，得到文本情感权值； The third step is to implement two algorithms through the text tendency analysis module: based on the phrase extraction mode and based on the vocabulary statistics mode, and analyze the tendency of the structured information text respectively to obtain the text sentiment weight;

基于短语抽取模式的文本Text倾向性的具体算法，短语用A+B表示： The specific algorithm of the text text tendency based on the phrase extraction mode, the phrase is represented by A+B:

a)分别判断词汇A与词汇B是否存在于“程度副词”及“否定副词”表中，若均不在则通过词汇情感计算模块计算其情感倾向权值，记为Sensibility(A)或Sensibility(B）； a) Determine whether vocabulary A and vocabulary B exist in the "degree adverb" and "negative adverb" tables, if not, calculate the emotional tendency weight through the vocabulary emotion calculation module, and record it as Sensibility(A) or Sensibility(B );

则该短语的情感权值 Then the sentiment weight of the phrase

Sensibility(A+B）＝Sensibility(A)+Sensibility(B)； Sensibility(A+B)＝Sensibility(A)+Sensibility(B);

b)若词汇A存在于“否定副词”词表中，则短语中心词为词汇B，计算词汇B的情感权值为Sensibility(B)，则该短语的情感权值 b) If vocabulary A exists in the vocabulary of "negative adverbs", then the head word of the phrase is vocabulary B, and the emotional weight of vocabulary B is calculated as Sensibility (B), then the emotional weight of the phrase

Sensibility+(A+B)＝(-1)×Sensibility(B)； Sensibility+(A+B)＝(-1)×Sensibility(B);

反之，若词汇B存在于“否定副词”词表中，则该短语中心词为词汇A，该短语的情感权值 Conversely, if vocabulary B exists in the vocabulary of "negative adverbs", then the head word of the phrase is vocabulary A, and the emotional weight of the phrase

Sensibility(A+B)＝(-1)×Sensibility(A)； Sensibility(A+B)＝(-1)×Sensibility(A);

c)若词汇A存在于“程度副词”词表中，则短语中心词为词汇B，用level(A)表示作为程度副词的词汇A的程度倍数，该短语的情感权值 c) If vocabulary A exists in the vocabulary of "degree adverbs", then the head word of the phrase is vocabulary B, and level(A) is used to represent the degree multiple of vocabulary A as a degree adverb, and the emotional weight of the phrase

d)分别计算所有褒义倾向与贬义词倾向的短语权值和，用Positive(words)与Negative(wirds)分别表示有褒义倾向与贬义词倾向的短语权值 d) Calculate the sum of the phrase weights of all commendatory tendencies and derogatory tendencies respectively, and use Positive(words) and Negative(wirds) to represent the phrase weights with commendatory tendencies and derogatory tendencies respectively.

1)将所有短语情感权值求和，所得结果小于0的作为贬义词倾向的短语权值 1) Sum the emotional weights of all the phrases, and the result is less than 0 as the phrase weight of the tendency of derogatory words

2)将所有短语情感权值求和，所得结果大于或等于0的作为褒义词倾向和中性短语权值 2) Sum the emotional weights of all phrases, and the result is greater than or equal to 0 as commendatory word tendency and neutral phrase weights

e)文本语料的最终倾向度用Sensibility(Text)表示，则 e) The final tendency of the text corpus is represented by Sensibility (Text), then

若Sensibility(Text)＜0，则表示该文本为贬义倾向文本；若Sensibility(Text)＞＝0则表示该文本为褒义倾向或中性文本。 If Sensibility(Text)<0, it means that the text is a derogatory text; if Sensibility(Text)>=0, it means that the text is a positive or neutral text. the

注：在语言学中，语料库(Corpus)指大量文本的集合，库中的文本(称为语料)通常经过整理，具有既定的格式与标记，特指计算机存储的数字化语料库。 Note: In linguistics, a corpus (Corpus) refers to a collection of a large number of texts. The texts in the library (called corpus) are usually organized and have established formats and tags, specifically referring to digital corpora stored in computers. the

第四步，最后，将上述的搜索爬虫算法、信息抽取模块、文本倾向性分析模块与所要分析的舆情网相结合，完成系统的信息来源及负面倾向性分析功能。即实现词汇统计模式的文本倾向性分析处理模块： The fourth step, finally, combine the above-mentioned search crawler algorithm, information extraction module, and text tendency analysis module with the public opinion network to be analyzed to complete the information source and negative tendency analysis functions of the system. That is, the text tendency analysis and processing module that realizes the vocabulary statistics mode:

词汇统计模式的文本倾向性分析处理，其计算文本Text情感倾向性值的具体算法如下： The text tendency analysis and processing of the vocabulary statistics mode, the specific algorithm for calculating the text sentiment tendency value is as follows:

a)读入文本Text，将文本Text按标点进行分句，标记为S1，S2，……Sn； a) Read in the text Text, divide the text Text into sentences according to punctuation, and mark it as S1, S2, ... Sn;

b)搜索S1所有具有明确语义倾向的态度词，这里所搜索的态度词的词性为形容词、副词、名词、动词及成语等，利用词汇情感计算模块计算各态度词情感权值，并将S1中所有态度词的权值进行叠加，得到该分句的所有态度词权值总和V1； b) Search for all attitude words with clear semantic orientation in S1. The parts of speech of the attitude words searched here are adjectives, adverbs, nouns, verbs, and idioms. The weights of all attitude words are superimposed to obtain the sum V1 of the weights of all attitude words in this clause;

c)搜索S1所有包含在否定副词表中的否定词数量。当否定词为奇数个时，将S1所在分句态度权值V1转化为-(1-t)×V1；(其中t为模糊值，与否定词在程度副词先后的位置有关，否定词在前模糊值大于零，否定词在后模糊值小于零，可设模糊值在0.2～0.4之间)；当否定词为偶数个时，则不需要作前面所述的处理。 c) Search S1 for the number of all negative words included in the negative adverb list. When there are an odd number of negative words, the attitude weight V1 of the sentence where S1 is located is transformed into -(1-t)×V1; (where t is a fuzzy value, which is related to the position of the negative word in the degree adverb, and the negative word is in the front The fuzzy value is greater than zero, the fuzzy value of the negative word is less than zero after the negative word, and the fuzzy value can be set between 0.2～0.4); when the negative word is an even number, it does not need to do the aforementioned processing. the

d)搜索S1所有包含在程度副词词典中的程度词数量，当包含程度词时，将态度权值V1乘以程度副词在程度词典中的程度倍数level()，即level()×V1； d) Search S1 for the number of degree words included in the degree adverb dictionary. When degree words are included, multiply the attitude weight V1 by the degree multiple level() of the degree adverb in the degree dictionary, that is, level()×V1;

e)S1计算完毕，搜索Text的下一分句S2重复b)、c)、d)步骤，计算得到该分句的所有态度词权值总和V2； e) After S1 is calculated, search for the next clause S2 of Text and repeat steps b), c), and d) to calculate the sum of all attitude word weights V2 of the clause;

f)直到计算出最后一分句的所有态度词权值总和Vn后，分别计算正面Vi权值总和Positive(Sentences)，与负面Vi权值总和Negative(Sentences)。 f) Until the sum of weights Vn of all attitude words in the last clause is calculated, respectively calculate the sum of positive Vi weights Positive(Sentences) and the sum of negative Vi weights Negative(Sentences). the

g)最后根据归一性原理，最终文本倾向度为： g) Finally, according to the principle of normalization, the final text tendency is:

$Sensibility Sensibility ((Text Text)) = = \frac{Negative Negative ((Sentences Sentences)) + + Positive Positive ((Sentences Sentences))}{Positive Positive ((Sentences Sentences)) - - Negative Negative ((Sentences Sentences))}$

以下为本发明的具体实施例，以进一步说明本发明的技术方案： The following are specific embodiments of the present invention, to further illustrate the technical scheme of the present invention:

1.本发明的预处理分为网络爬虫与爬取信息模板化两部分。 1. The preprocessing of the present invention is divided into two parts: web crawler and crawling information templating. the

爬虫的具体过程： The specific process of reptiles:

a)访问URL数据库，读取URL入口地址，生成内存访问队列 a) Access the URL database, read the URL entry address, and generate a memory access queue

b)寻找空闲的HTTP下载模块，分配URL，启动下载任务 b) Find an idle HTTP download module, assign a URL, and start a download task

c)HTTP下载模块访问互联网，得到网页内容放入结果队列 c) The HTTP download module accesses the Internet, gets the content of the web page and puts it into the result queue

d)保存到网页数据库，为后续索引及其他操作做准备 d) Save to the webpage database to prepare for subsequent indexing and other operations

e)链接分析模块提取页面内的新链接，存入URL数据库等待下载 e) The link analysis module extracts the new link in the page, stores it in the URL database and waits for downloading

f)重复上述过程直到全部下载完成。 f) Repeat the above process until all downloads are completed. the

线程负责对URL及网页内容进行过滤，挑出所需的URL，将网页源代码进行分析，解析出新的URL放入记录列表并根据定义好的数据结构，将各种信息从网页源代码中抽取后存入数据库中。 The thread is responsible for filtering URLs and webpage content, picking out the required URLs, analyzing the source code of the webpage, parsing out new URLs and putting them into the record list, and storing various information from the source code of the webpage according to the defined data structure. Extracted and stored in the database. the

模板文件提交之后，信息抽取程序在后台获取模板文件各个标签所对应的字符串，然后通过对爬虫程序所下载的网页源代码进行模板匹配过滤，即可得到“内容”、“时间”、“标题”、“来源”、“评论数”、“转载数”等所需格式数据。 After the template file is submitted, the information extraction program obtains the strings corresponding to each tag of the template file in the background, and then performs template matching and filtering on the source code of the webpage downloaded by the crawler program to obtain "content", "time", "title ", "Source", "Number of Comments", "Number of Reprints" and other required format data. the

2.基于短语模式和词汇统计模式的文本倾向性分析比较 2. Comparison of text orientation analysis based on phrase pattern and vocabulary statistics pattern

基于词汇统计模式的文本倾向性分析算法，在文本情感标志特征的抽取方式上，弥补了短语模式文本倾向性分析算法的漏洞，用统计学的方式抛弃固定规则的抽取方式，很好的实现的文本情感特征的抽取；以句子为文本情感分析单位的方法比短语对为分析单位的方法更贴近文本语义，否定副词的作用以句子为单位更好的进行了分析和实现，并引入了模糊值的概念，使得分析结果更为准确；对标点符号对句子语义的影响也在算法中进行了分析与概括。 The text tendency analysis algorithm based on the vocabulary statistics mode makes up for the loopholes of the phrase mode text tendency analysis algorithm in the extraction method of the text emotion mark feature, and discards the extraction method of fixed rules in a statistical way, which is well realized Extraction of text emotional features; the method of using sentences as text sentiment analysis units is closer to text semantics than the method of phrase pairs as analysis units, and the role of negative adverbs is better analyzed and realized with sentences as units, and fuzzy values are introduced The concept of punctuation makes the analysis results more accurate; the impact of punctuation marks on sentence semantics is also analyzed and summarized in the algorithm. the

对词汇统计模式的文本倾向性分析算法，本实验采用与上一节中基于短语模式的文本倾向性分析算法相同的实验数据进行了实验，结果如下： For the text orientation analysis algorithm of the vocabulary statistics mode, this experiment uses the same experimental data as the text orientation analysis algorithm based on the phrase mode in the previous section to carry out the experiment, and the results are as follows:

表2，样本大小分别为1000、2000和3000文本。 Table 2, sample sizes are 1000, 2000 and 3000 texts. the

文件数目 number of files 1000 1000 2000 2000 3000 3000 已标注负面文本 Negative text marked 1000 1000 2000 2000 3000 3000 测试负面文本 Test negative text 833 833 1685 1685 2443 2443 准确率 Accuracy 83.300％ 83.300% 84.250％ 84.250% 81.433％ 81.433%

根据附图5、6可以看出，对情感语料库的测试结果基于词汇统计模式的文本倾向性分析算法较基于短语模式的文本倾向性分析算法准确率高出5个百分点左右，对比来说，算法改进的效果比较明显。 According to accompanying drawings 5 and 6, it can be seen that the text orientation analysis algorithm based on the vocabulary statistics mode is about 5 percentage points higher in accuracy than the text orientation analysis algorithm based on the phrase mode in the test results of the emotional corpus. In comparison, the algorithm The effect of improvement is more obvious. the

Claims

1. A public opinion vertical search analysis system, which is applied to text-based network public opinion search analysis, is characterized in that the system includes a vertical search engine crawler module, an information extraction module based on templates, and text tendency analysis based on phrase extraction Module, text orientation analysis module based on vocabulary statistics mode, in which:

The vertical search engine crawler module uses crawler algorithms to selectively search and download Internet pages related to public opinion topics through filtering techniques based on network topology and web content keywords and breadth-first search;

Template-based information extraction module extracts structured data from web page source code information and stores it in the database in the required fixed form;

The text orientation analysis module based on phrase extraction obtains structured information based on the phrase extraction mode, and conducts orientation analysis on the text corpus of structured information respectively to obtain the final orientation degree Sensibility (Text) of the text corpus; the processing of this module includes:

The emotional tendency weights of vocabulary A and vocabulary B are denoted as Sensibility(A) or Sensibility(B);

Determine whether vocabulary A and vocabulary B exist in the "degree adverb" and "negative adverb" vocabulary:

If neither vocabulary A nor vocabulary B exists, the sentiment weight of the phrase is

Sensibility(A+B)=Sensibility(A)+Sensibility(B);

If vocabulary A exists in the vocabulary of "negative adverbs", then the head word of the phrase is vocabulary B, and the emotional weight of vocabulary B is calculated as Sensibility(B), then the emotional weight of the phrase Sensibility(A+B)=(- 1)×Sensibility(B);

Conversely, if vocabulary B exists in the vocabulary of "negative adverbs", then the head word of the phrase is vocabulary A, and the sentiment weight Sensibility(A+B)=(-1)×Sensibility(A) of the phrase;

If vocabulary A exists in the vocabulary of "degree adverbs", then the head word of the phrase is vocabulary B, and level (A) is used to represent the degree multiple of vocabulary A as a degree adverb, and the emotional weight of the phrase

Sensibility(A+B)=level(A)×Sensibility(B);

Conversely, use level(B) to represent the degree multiple of vocabulary B as a degree adverb, and the emotional weight of the phrase

Sensibility(A+B)=level(B)×Sensibility(A);

Calculate the sum of the phrase weights of all commendatory and derogatory tendencies respectively, and use Positive(words) and Negative(words) to represent the phrase weights with commendatory tendencies and derogatory tendencies respectively:

Sum the emotional weights of all phrases, and the result is less than 0 as the phrase weight of the tendency of derogatory words

Negative Negative ((words words)) = = {Σ Σ}_{i i = = 00}^{N N} Sensibility Sensibility {((A A + + B B))}_{i i},, if if ((Sensibility Sensibility {((A A + + B B))}_{i i})) < < 00;;

Sum the emotional weights of all phrases, and the result is greater than or equal to 0 as the commendatory word tendency and neutral phrase weight

Positive Positive ((words words)) = = {Σ Σ}_{i i = = 00}^{N N} Sensibility Sensibility {((A A + + B B))}_{j j},, if if ((Sensibility Sensibility {((A A + + B B))}_{i i})) > > = = 00;;

The final tendency of the text corpus is represented by Sensibility(Text), then

Sensibility Sensibility ((Text Text)) = = \frac{Negative Negative ((words words)) + + Positive Positive ((words words))}{Positvie Positvie ((words words)) - - Negative Negative ((words words))};;

If Sensibility(Text)<0, it means that the text is a derogatory text; if Sensibility(Text)>=0, it means that the text is a positive or neutral text;

The text tendency analysis module based on the vocabulary statistics mode completes the information source and negative tendency analysis of the system, and obtains the emotional tendency value of the text Text. The specific processing of this module includes:

Read in the text Text, divide the text Text into sentences according to punctuation, and mark it as S1, S2, ΛΛ Sn;

Search for all attitude words with clear semantic orientation in S1. The parts of speech of the attitude words searched here are adjectives, adverbs, nouns, verbs, and idioms. Use the vocabulary emotion calculation module to calculate the emotional weight of each attitude word, and compare all the attitude words in S1 The weights of the words are superimposed to obtain the sum V1 of the weights of all attitude words in the clause;

Search S1 for the number of degree words included in the degree adverb dictionary. When degree words are included, multiply the attitude weight V1 by the degree multiple level() of the degree adverb in the degree dictionary, that is, level()×V1;

After the calculation of S1 is completed, the next clause S2 of the search Text repeats the previous three steps, and the sum of all attitude word weights V2 of the clause S2 is calculated;

Until the sum of all attitude word weights Vn in the last clause is calculated, the sum of positive Vi weights Positive(Sentences) and the sum of negative Vi weights Negative(Sentences) are respectively calculated

Negative Negative ((Sentences Sentences)) = = {a a}_{i i = = 00}^{{N N}_{00}} {Vi Vi}_{i i} if if ((Vi Vi)) < < 00;;

Positive Positive ((Sentences Sentences)) = = {a a}_{j j = = 00}^{{N N}_{00}} {Vj Vj}_{j j} if if ((Vj Vj)) > > = = 00;;

Finally, the final text tendency is calculated as:

Sensibility Sensibility ((Text Text)) = = \frac{Negative Negative ((Sentences Sentences)) + + Positive Positive ((Sentences Sentences))}{Positive Positive ((Sentences Sentences)) - - Negative Negative ((Sentences Sentences))} . .

2. The public opinion vertical search and analysis system according to claim 1, wherein said web page crawling strategy includes a depth-first search strategy, a breadth-first search strategy, and a best-first search strategy.

3. public opinion vertical search analysis system as claimed in claim 1, is characterized in that, described webpage crawling comprises four attributes including crawling URL, crawling webpage depth, crawling webpage number, URL filtering mark.

4. public opinion vertical search analysis system as claimed in claim 1, is characterized in that, described vertical search engine crawler module, its crawling operation adopts multithreading, comprises the following processing:

When a thread finishes downloading the page, it submits the downloaded page to the parsing buffer thread pool in the form of the link URL contained in the extracted web page, and adds it to the buffer queue to be downloaded, and the thread pool calls the parser to parse the web page and extract the URL , and add the parsed URL to the URL record: URL is a tree structure in the network. In the tree structure, the URLs of different hierarchical nodes may be the same, because the same URL can be parsed from many other web pages ;Record the parsed URL in the crawler design, judge the URL before recording the URL, if the parsed URL already exists in the record, skip the URL, otherwise the URL will be added to the unprocessed record ; When the crawler searches, it first processes the initial URL, and after being parsed by the parser, it will get a new layer of URL queue, and then the crawler downloads these URLs according to the default sequence of URLs in the queue, and analyzes and parses them A new URL is generated, and the processed URL is put into the processed record queue; only after all the web pages of the current level are crawled, the URL of the next level will be crawled.

5. public opinion vertical search and analysis system as claimed in claim 1, is characterized in that, in the text tendency analysis module of described vocabulary statistics mode, if all the negative words quantity that clause S1 is included in negative adverb table is searched, When the number of negative words is an odd number, the attitude weight V1 of the sentence where S1 is located is transformed into -(1-t)×V1; where t is a fuzzy value, which is related to the position of the negative word in the degree adverb, and the negative word comes first The fuzzy value is greater than zero, and the fuzzy value after the negative word is less than zero, and the fuzzy value is set between 0.2 and 0.4.

6. A method for public opinion vertical search analysis, characterized in that the method comprises the following steps:

Call the vertical search engine crawler module, use the crawler algorithm to selectively search and download Internet pages related to public opinion topics through the filtering technology based on network topology and webpage content keywords and the webpage crawling of breadth-first search;

Extract the structured data from the source code information of the webpage through the information extraction module based on the template, and store it in the database in the required fixed form;

Two algorithms are implemented through the text tendency analysis module: based on the phrase extraction mode and based on the vocabulary statistics mode, and the tendency analysis is performed on the structured information text respectively to obtain the weight value of the text's emotional tendency;

Based on the phrase extraction mode, the structured information is obtained, and the orientation analysis of the structured information text corpus is carried out respectively, and the final orientation degree Sensibility (Text) of the text corpus is obtained:

Sensibility(A+B)=Sensibility(A)+Sensibility(B);

Sensibility(A+B)=level(A)×Sensibility(B);

Sensibility(A+B)=level(B)×Sensibility(A);

Negative Negative ((words words)) = = {Σ Σ}_{i i = = 00}^{N N} Sensibility Sensibility {((A A + + B B))}_{i i},, if if ((Sensibility Sensibility {((A A + + B B))}_{i i})) < < 00;;

Positive Positive ((words words)) = = {Σ Σ}_{i i = = 00}^{N N} Sensibility Sensibility {((A A + + B B))}_{j j},, if if ((Sensibility Sensibility {((A A + + B B))}_{i i})) > > = = 00;;

The final tendency of the text corpus is represented by Sensibility(Text), then

Sensibility Sensibility ((Text Text)) = = \frac{Negative Negative ((words words)) + + Positive Positive ((words words))}{Positvie Positvie ((words words)) - - Negative Negative ((words words))};;

Based on the text tendency analysis of the lexical statistical model, the systematic information source and negative tendency analysis are completed, and the emotional tendency value of the text is obtained:

Negative Negative ((Sentences Sentences)) = = {a a}_{i i = = 00}^{{N N}_{00}} {Vi Vi}_{i i} if if ((Vi Vi)) < < 00;;

Positive Positive ((Sentences Sentences)) = = {a a}_{j j = = 00}^{{N N}_{00}} {Vj Vj}_{j j} if if ((Vj Vj)) > > = = 00;;

Finally, the final text tendency is calculated as:

Sensibility Sensibility ((Text Text)) = = \frac{Negative Negative ((Sentences Sentences)) + + Positive Positive ((Sentences Sentences))}{Positive Positive ((Sentences Sentences)) - - Negative Negative ((Sentences Sentences))} . .

7. The public opinion vertical search analysis method as claimed in claim 6, wherein said web page crawling strategy comprises a depth-first search strategy, a breadth-first search strategy, and a best-first search strategy.

8. public opinion vertical search analysis method as claimed in claim 6, is characterized in that, described webpage crawling comprises four attributes including crawling URL, crawling webpage depth, crawling webpage number, URL filtering mark.

9. public opinion vertical search analysis method as claimed in claim 6, is characterized in that, described vertical search engine crawler module, its crawling operation adopts multithreading, comprises the following processing:

10. public opinion vertical search analysis method as claimed in claim 6, it is characterized in that, if described if searching certain clause S _n all the negative word quantity that is included in negative adverb table, when negative word number is odd number, Transform the attitude weight V _n of the sentence where S1 is located into -(1-t)×V _n ; where t is a fuzzy value, which is related to the position of the negative word in the degree adverb, the fuzzy value of the negative word is greater than zero, and the negative word After the fuzzy value is less than zero, set the fuzzy value between 0.2 and 0.4.