CN104537097A - Microblog public opinion monitoring system - Google Patents

Microblog public opinion monitoring system Download PDF

Info

Publication number
CN104537097A
CN104537097A CN 201510009995 CN201510009995A CN104537097A CN 104537097 A CN104537097 A CN 104537097A CN 201510009995 CN201510009995 CN 201510009995 CN 201510009995 A CN201510009995 A CN 201510009995A CN 104537097 A CN104537097 A CN 104537097A
Authority
CN
China
Prior art keywords
public opinion
twitter
hot
module
phrase
Prior art date
Application number
CN 201510009995
Other languages
Chinese (zh)
Other versions
CN104537097B (en
Inventor
张鹏
Original Assignee
成都布林特信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 成都布林特信息技术有限公司 filed Critical 成都布林特信息技术有限公司
Priority to CN201510009995.2A priority Critical patent/CN104537097B/en
Publication of CN104537097A publication Critical patent/CN104537097A/en
Application granted granted Critical
Publication of CN104537097B publication Critical patent/CN104537097B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2785Semantic analysis

Abstract

The invention discloses a microblog public opinion monitoring system which comprises a public opinion popularization degree obtaining module, an intelligent crawler crawling module, an extracting and preprocessing module, a feature phrase filtering module, a public opinion analyzing module, an emotion tendency analyzing module and a user interaction module. According to the system, by means of the distributed cloud computing mode, microblog public opinion hot spots are obtained through various microblog public opinion monitoring algorithms, the obtained microblog public opinion hot spots are comprehensively judged, classified and assessed, and accordingly microblog public opinion hot spot topics are efficiently and accurately monitored.

Description

微博舆情监测系统 Microblogging public opinion monitoring system

技术领域 FIELD

[0001] 本发明涉及互联网信息处理技术领域,具体来讲,涉及一种微博舆情监测系统。 [0001] The present invention relates to Internet information processing technologies, and specifically, relates to a micro-blog public opinion monitoring system.

背景技术 Background technique

[0002] 随着互联网在全球范围内的飞速发展,网络媒体已被公认为是继报纸、广播、电视之后的“第四媒体”,网络成为反映社会舆情的主要载体之一。 [0002] With the rapid development of the Internet worldwide, online media has been recognized after newspaper, radio, television, the "fourth media", the network has become one of the main carrier of public opinion reflects the society.

[0003] 网络舆情是通过互联网传播的,公众对现实生活中某些热点、焦点问题所持的有较强影响力、倾向性的情感、态度、意见、言论或观点,其主要通过论坛BBS上的发帖评论及跟贴、新闻、博客Blog等实现并加以强化。 [0003] Internet public opinion is spread through the Internet, in real life some public hot spots, focus has held a strong influence, tendentious emotions, attitudes, opinion, expression or opinion, on the main forum through the BBS post a comment and keep abreast of news, blog Blog and so achieve and be strengthened. 由于互联网具有虚拟性、隐蔽性、发散性、渗透性和随意性等特点,越来越多的网民乐意通过这种渠道来表达观点、传播思想。 As the Internet has a virtual, concealment, divergence, and randomness permeability characteristics, more and more users willing to express their views through this channel, to spread ideas.

[0004] 随着互联网技术的迅速发展,以微博媒体等为代表的新一代媒体打破信息的控制和垄断,在网络上人们自由表达自己的态度和意见,不再像过去那么容易地无条件接受,相反,不同阶层的利益诉求纷纷呈现,不同思想观点正面碰撞。 [0004] With the rapid development of Internet technology, micro-Bo as the representative of a new generation of media to break the media monopoly and control of information, people are free to express their attitudes and opinions on the web, as in the past no longer so easily accept unconditionally Instead, interest demands of different sectors have presented different ideas head-on collision. 对相关政府部门来说,如何及时准确的了解网络微博舆情,加强对网络微博舆论的及时监测、有效引导,成为网络微博舆情管理的一大难点。 The relevant government departments, how accurate understanding of network microblogging timely public opinion, strengthen the monitoring of timely microblogging network of public opinion and effective guidance, became a major difficulty microblogging network management of public opinion. 在这种情况下,建设能够覆盖微博数据源的微博舆情监测系统十分必要,此类系统可针对新的微博媒介传播环境,进一步深入研宄微博舆情的热点研判方法以及新媒体带来的影响,对微博舆情研宄进行丰富和完善。 In this case, the construction can be covered Weibo microblogging public opinion monitoring system data source is necessary, such a system can propagate environment for new media microblogging, microblogging further in-depth study based on public opinion judged the hot method and new media with to influence, the public opinion study based on Weibo be enriched and improved.

[0005] 虽然目前已经有很多单位针对网络微博舆情监控提出了一些不同的解决方案。 [0005] Although many institutions have made a number of different solutions for network microblogging monitoring public opinion. 但是,需要本领域技术人员解决的技术问题是如何提高判断网络微博舆情信息的效率和精确度。 However, it needs skilled in the art to solve the technical problem of how to improve the efficiency and accuracy to determine the network microblogging public opinion information. 因为截至目前,尚未有较为高效、准确的针对微博媒体数据的网络舆情监测系统。 Because as of now, not yet have a more efficient and accurate data network public opinion monitoring system for microblogging media.

发明内容 SUMMARY

[0006] 本发明就是针对上述背景技术中的不足之处,而提出的一种微博媒体的舆情监测系统,其具有较高的准确率。 Twitter opinion monitoring system of the present invention is a media for the shortcomings of the above-described background art, the proposed [0006] In the present, it has a higher accuracy. 本发明的目的是通过如下技术措施来实现的。 Object of the present invention is achieved by the following technical measures.

[0007] 本发明提出一种微博舆情监测系统,该系统包括:舆情热度获取模块1、智能爬虫爬取模块2、提取和预处理模块3、特征短语过滤模块4、舆情分析模块5、情感倾向性分析模块6、以及用户交互模块7,其中 [0007] The present invention provides a micro-blog opinion monitoring system, the system comprising: a heat opinion acquiring module 1, module 2 smart crawler crawling, extraction and preprocessing module 3, wherein the filtering module 4 phrase, the analysis module 5 opinion, emotion tendency analysis module 6, and the user interaction module 7, wherein

[0008] 舆情热度获取模块I用于根据微博的舆情热度权值来筛选需要进行舆情分析的微博页面; [0008] I acquisition module opinion heat heat according to public opinion weights Twitter screened public opinion analysis required micro-blog page;

[0009] 智能爬虫爬取模块2用于通过对指定的微博页面爬取指定时间内的微博数据,并根据预定义的事件对所爬取的微博数据进行分析,过滤掉与要监测的舆情无关的微博数据; [0009] Smart crawler crawling module 2 is used by a specified page crawling Twitter Twitter data within the specified time, and analyzing data Twitter crawled according to a predefined event, and filtered to be monitored microblogging data independent of public opinion;

[0010] 提取和预处理模块3用于将智能爬虫爬取模块2获取的微博数据中的信息进行提取和预处理; [0010] 3 extraction and pre-processing module for smart crawler crawling module information data acquired Twitter 2 and the pre-extraction;

[0011 ] 特征短语过滤模块4用于对提取和预处理模块3处理后的微博数据中的特征短语进行过滤筛选; [0011] wherein the filtering module 4 for phrase Twitter data extraction and processing of the preprocessing module 3 characterized in screening filter phrases;

[0012] 舆情分析模块5用于以特征短语过滤模块4处理后的微博数据为基础,发现微博舆情执占.1 H n.» w …, [0012] In a public opinion analysis module 5 wherein the filter module Twitter phrase data based on the fourth processing, public opinion found Twitter account executive .1 H n. »W ...,

[0013] 情感倾向性分析模块6用于对所发现的微博舆情热点执行情感倾向性分析; [0013] Emotion 6 tendency analysis module for performing analysis of the emotional tendency Twitter hot public opinion found;

[0014] 用户交互模块7用于以图表或报告形式显示输出微博舆情分析结果,实现用户交互功能。 [0014] The user interaction module for displaying the output 7 Twitter public opinion analysis results in the form of graphs or reports, user interaction functionality.

[0015] 优选地,所述舆情热度获取模块I计算所述微博的舆情热度权值P,若P大于预先设定的阈值TP,则将该微博作为舆情分析的数据来源和分析依据,具体地: [0015] Preferably, the heat opinion I acquisition module calculates the public opinion heat Twitter weights P, if P is greater than a preset threshold value TP, the source of the data as Twitter and analysis based on the analysis of the public opinion, specifically:

[0016] 假设微博的浏览点击数为Kl,评论数为Κ2,回复数为Κ3,点击支持数为Κ4,点击反对数为Κ5,转发数为Κ6,收藏数为Κ7,βΐ〜β 4为预先设定的且可调整的系数,则 Browse Hits [0016] assume that microblogging is Kl, the number of comments is Κ2, reply number Κ3, click Support number Κ4, click against the number Κ5, forwarding number Κ6, collection number Κ7, βΐ~β 4 to preset, adjustable coefficients, then

[0017] P= (Ig(Kl) 3/4+0.03) * β 1+ (lg((Κ2)2/3+ (Κ3)2/3) +0.02) * β 2+ (lg((Κ4)1/2+ (Κ5)1/2)+0.01) * β 3+ (lg ((Κ6)1/3+ (Κ7)1/3) +0.005) * β 4 ; [0017] P = (Ig (Kl) 3/4 + 0.03) * β 1+ (lg ((Κ2) 2/3 + (Κ3) 2/3) +0.02) * β 2+ (lg ((Κ4) 1/2 + (Κ5) 1/2) +0.01) * β 3+ (lg ((Κ6) 1/3 + (Κ7) 1/3) +0.005) * β 4;

[0018]其中,β I 〜β 4 可以设置为:β1 = 0.4;β2 = 0.2;β3 = 0.1;β4 = 0.1。 [0018] where, β I ~β 4 may be provided: β1 = 0.4; β2 = 0.2; β3 = 0.1; β4 = 0.1.

[0019] 优选地,所述智能爬虫爬取模块2执行以下步骤: [0019] Preferably, the smart crawler crawling module 2 perform the following steps:

[0020] 步骤2-1,通过系统预定义的事件对微博页面进行分析,以此将与要监测的预定义的事件无关的链接过滤掉,剩下与预定义的事件有关的链接,将这些与预定义的事件有关的链接保留下来,并把它们存入等待抓取页面的URL队列; [0020] Step 2-1, carried out by the system predefined event micro-blog page analysis, this has nothing to do with the pre-defined events to be monitored link filter out the remaining links relevant predefined event, these related to the predefined event links retained, and put them into a queue waiting to crawl the URL of the page;

[0021] 步骤2-2,根据预先定义的搜索策略,从所述URL队列中选出根据所述预先定义的搜索策略抓取的页面所对应的URL,重复步骤2-1,当满足了系统预设的停止条件后则停止爬取过程。 [0021] Step 2-2, according to a predefined search strategy, the search strategy is selected in accordance with the crawled pages predefined URL corresponding to the URL from the queue, repeat steps 2-1, when the system is satisfied after the preset stop conditions stop crawling process.

[0022] 优选地,所述提取和预处理模块3执行以下步骤: [0022] Preferably, the extraction and preprocessing module 3 to perform the following steps:

[0023] 首先,提取对微博舆情分析有用的微博正文部分的信息,对微博正文部分进行重构,将具有主题代表性的微博数据聚集在一起; [0023] First, the analysis of extract useful information on the body part Twitter Twitter public opinion on Twitter reconstructed body part, having a micro-blog data representative topic together;

[0024] 其次,对所述微博数据进行分词处理、过滤停用词、命名实体识别、语法解析、词性标注、情感识别、特征词提取;然后进行特征短语提取。 [0024] Next, the micro-blog data word, filtered stop words, named entity recognition, syntax analysis, speech tagging, emotion recognition, feature word extraction; then extract characteristic phrases.

[0025] 优选地,所述特征短语过滤模块4执行以下步骤: [0025] Preferably, the signature phrase of the filter module 4 performs the following steps:

[0026] 步骤4-1,对特征短语进行去重,包括:记录微博的文本中出现的重复性特征短语以及其出现的次数,过滤掉出现频率低于重复阈值的重复性特征短语和长度低于重复阈值的重复性特征短语; [0026] Step 4-1, for the de-emphasis characteristic phrases, comprising: a recording of the text Twitter repetitive character appearing phrases and their occurrence, filtered and repetitive character phrase length less than the frequency threshold value out repeated occurrence below the threshold repeated repetitive character phrase;

[0027] 步骤4-2,对特征短语进行分组,包括:计算每个特征短语与其他特征短语之间的相似度值,将相似度值高于相似度阈值的特征短语分入相同的组;如果一个特征短语与所有其他特征短语之间的相似度值都为0,则将该特征短语过滤掉;具体地,可以选择以下两个步骤之一来计算所述两个特征短语Χ、γ的相似度值Sims (X,Y),然后进行特征短语分组: [0027] Step 4-2, the characteristic phrase group, comprising: calculating a similarity between each feature value and other features phrase phrase, wherein the similarity values ​​above phrase similarity threshold classified into the same group; If a characteristic phrases and the similarity value between the other characteristic phrases are all 0, then filtering out the characteristic phrases; specifically, one of the following two steps calculates the two characteristic phrases Χ, γ of similarity value Sims (X, Y), then the phrase group wherein:

[0028]步骤 4-2-1: [0028] Step 4-2-1:

[0029] 首先,假设同时出现特征短语Χ、Υ的句子的数量为sum(XY);仅出现特征短语X,不出现特征短语Y的句子的数量为SUm(X);仅出现特征短语Y,不出现特征短语X的句子的数量为sum(Y);此时,特征短语X、Y的相似度值Sims (X,Y)计算公式如下: [0029] First, assume that occur simultaneously wherein the phrase [chi], the number Υ sentences as SUM (the XY); the number of feature phrase X appears only in sentences characteristic phrases Y does not appear as SUM (X); characterized phrase Y appears only, wherein the number of the phrase does not appear as X sentences SUM (Y); in this case, the signature phrase of X, Y similarity Sims values ​​(X, Y) is calculated as follows:

[0030] Sims (X, Y) = 1g2 (sum (XY)) /1g2 (sum (X)) +1g2 (sum (XY)) /1g2 (sum (Y)); [0030] Sims (X, Y) = 1g2 (sum (XY)) / 1g2 (sum (X)) + 1g2 (sum (XY)) / 1g2 (sum (Y));

[0031 ] 其次,如果Sims (X,Y)(阈值TDl,则将特征短语Y分入特征短语X所在的组; [0031] Next, if Sims (X, Y) (TDl threshold value, then the signature phrase of the phrase group Y wherein X divided into resides;

[0032]步骤4-2-2: [0032] Step 4-2-2:

[0033] 首先,假设两个特征短语X、Y中包括字符的个数分别为m和η,令k取m、n中的较小值,分别以X1、Yi代表特征短语X、Y中前i个字符组成的子短语,其中,i = 1,2,…,k ;定义IX1-YiI表示子短语X1、Yi的最长公共字符串中包含的字符数量,则特征短语X、Y的相似度值Sims (X,Y)计算公式如下: [0033] First, assume that two feature phrase X, Y included in the number of characters and [eta] m, respectively, so that k has m, n, small values ​​of, respectively, X1, Yi representative of the signature phrase of X, Y front sub-character phrase i, where, i = 1,2, ..., k; IX1-YiI defined sub-phrases denotes X1, the number of characters in the longest common string contained Yi, the characteristic phrases X, Y similar value Sims (X, Y) is calculated as follows:

[0034] Sims (X,Y) = (| Xl—Yl |3+| X2—Y2 |3+…+1 Xk-Yk |3) 1/3; [0034] Sims (X, Y) = (| Xl-Yl | 3+ | X2-Y2 | 3 + ... + 1 Xk-Yk | 3) 1/3;

[0035] 其次,如果Sims (X,Y)(阈值TD2,则将特征短语Y分入特征短语X所在的组; [0035] Next, if Sims (X, Y) (TD2 threshold value, then the signature phrase of the phrase group Y wherein X divided into resides;

[0036] 步骤4-3,对特征短语进行熵值过滤,包括:计算特征短语的熵值,过滤掉熵值低于预设的下阈值的特征短语以及熵值高于预设的上阈值的特征短语。 [0036] Step 4-3, the characteristic phrases entropy filter, comprising: entropy calculating the characteristic phrases, wherein a phrase was filtered off and the entropy entropy lower than the preset lower threshold is higher than a preset upper threshold value of characteristic phrases.

[0037] 优选地,所述舆情分析模块5用于分析并发现微博舆情热点,包括如下步骤: [0037] Preferably, the public opinion analysis module for analyzing and found 5 Twitter hot public opinion, comprising the steps of:

[0038] 首先,使用多个微博热点发现子模块,通过并行的MapReduce分布式计算方式来获取微博舆情热点,所述微博热点发现子模块包括: [0038] First, a plurality of sub-modules found hot Twitter, Twitter acquired by public opinion hot MapReduce distributed computing in parallel, the hot microblogging found sub-module comprises:

[0039] I) Single-Pass微博热点发现子模块5.1,采用single pass算法; [0039] I) Single-Pass hot microblogging found submodule 5.1, using the single pass algorithm;

[0040] 2) KNN微博热点发现子模块5.2,采用KNN最近邻分类算法; [0040] 2) KNN hot microblogging found submodule 5.2, using KNN nearest neighbor classification algorithm;

[0041 ] 3) SVM微博热点发现子模块5.3,采用支持向量机SVM算法; [0041] 3) SVM hot microblogging found submodule 5.3, using the SVM algorithm;

[0042] 4) K-means微博热点发现子模块5.4,采用K平均聚类算法;以及 [0042] 4) K-means hot microblogging found submodule 5.4, with K-means clustering algorithm; and

[0043] 5) SOM微博热点发现子模块5.5,采用自组织映射SOM神经网络聚类算法; [0043] 5) SOM hot microblogging found submodule 5.5, SOM neural network clustering algorithm using self-organizing maps;

[0044] 其次,对上述各个微博热点发现子模块所分别获取的所有微博舆情热点进行汇总,进行以下分类判断: [0044] Next, each of the above-described hot microblogging found submodule respectively acquire the public opinion all Twitter hot Summarizing, the following classification is determined:

[0045] 如果获取的微博舆情热点来源于上述三个以上热点发现子模块,则将该微博舆情热点的类别标记为高级微博舆情热点; [0045] If the acquired public opinion hot Twitter derived from the above three or more sub-modules hot found, then mark the micro-blog category for advanced public opinion hot Twitter hot public opinion;

[0046] 如果获取的微博舆情热点来源于上述两个热点发现子模块,则将该微博舆情热点的类别标记为中级微博舆情热点; [0046] If the acquired public opinion hot Twitter derived from the above two sub-modules hot found, then mark the micro-blog category for public opinion hot intermediate hot Twitter public opinion;

[0047] 如果获取的微博舆情热点仅来源于上述一个热点发现子模块,则将该微博舆情热点的类别标记为初级微博舆情热点; [0047] If the acquired public opinion hot Twitter from said one hot spot only submodule found, then mark the micro-blog category for public opinion hot primary hot Twitter public opinion;

[0048] 最后,将所述高级微博舆情热点、中级微博舆情热点和初级微博舆情热点依次发送到所述情感倾向性分析模块6。 [0048] Finally, the advanced public opinion hot Twitter, intermediate and primary focus of public opinion Twitter Twitter sequentially transmitted to the public opinion hot emotion tendency analysis module 6.

[0049] 优选地,所述情感倾向性分析模块6用于执行微博的文本情感倾向性分析,包括以下步骤: [0049] Preferably, the sentiment tendency analysis module 6 for text sentiment orientation analysis performed Twitter, comprising the steps of:

[0050] 步骤6-1,人工选取若干常见的情感倾向性的中文和英文的形容词、名词和动词和作为初始化种子集;其中,所述初始化种子集中,形容词的数量可以是50,名词和动词的数量可以是100 ; [0050] Step 6-1, select a number of the common artificial emotion propensity Chinese and English adjectives and nouns and verbs as the initialization seed set; wherein the initialization seed set, the number may be 50 adjectives, verbs and nouns the number may be 100;

[0051] 步骤6-2,将微博的文本中所有具有指代关系的代词还原为名词性的原始指代对象,以防止分析过程中对象的漏判或错判; [0051] Step 6-2, the text in all the pronouns Twitter having to refer to the relationship between reduction of the original nominal referents, to prevent the analysis of the object during Missing or wrongly;

[0052] 步骤6-3,以微博的句子为单位,利用词性标注POS和语义角色标注SRL分析微博中每个句子的句子成分,提取每个句子中的主观性词语; [0052] Step 6-3, the sentence as a unit micro-blog, and POS speech tagging using a semantic role labeling SRL sentence components analysis micro-blog each sentence, the words in each sentence extraction of subjectivity;

[0053] 步骤6-4,依次输入每个句子中的主观性词语,根据所述种子集对微博的句子中的主观性词语进行情感倾向性自动标注;对于无法自动标注的主观性词语,由人工判断其情感倾向性后,将该主观性词语补充入所述种子集。 [0053] Step 6-4, sequentially input subjective phrases in each sentence, the words in the sentence Twitter subjectivity in emotional tendency automatic annotation according to the Seeds; for subjective words can not be automatically marked, after manually determining its tendency emotion, the words subjective added into the seed set.

[0054] 优选地,所述用户交互模块7用于实现用户交互功能,可形成的图表或报告包括:微博舆情信息热度排行报表、微博舆情预警信息分布报表、微博舆情地理信息分布报表、微博舆情情感分析报表、微博舆情状态统计报表以及微博舆情趋势走向分析图。 [0054] Preferably, the user interaction module 7 for realizing user interaction, chart or report that may be formed include: Top heat Twitter public opinion information reports, public opinion Twitter warning information distribution reports, public opinion Twitter GIS Distribution Report , microblogging public opinion sentiment analysis reports, statistical reports microblogging state of public opinion and public opinion trend toward micro-blog analysis chart.

[0055] 现有技术中,网络舆情的主要数据来源一般是各种网站或论坛,而单独针对微博舆情数据的监测系统则比较少;即使是专门针对微博舆情数据的监测系统,也往往由于各种原因而准确率或效率较低。 [0055] prior art, the main source of data network public opinion generally various websites or forums, and a separate system for monitoring the microblogging public opinion data is relatively small; even a special monitoring system for microblogging public opinion data, often due to various reasons accuracy or less efficient. 而本发明提出了一种专门针对微博网络数据源的舆情数据的监测系统。 The present invention provides a system dedicated to monitoring data for public opinion Twitter network data source.

[0056] 与现有技术相比,本发明包括以下优点: [0056] Compared with the prior art, the present invention includes the following advantages:

[0057] 首先,本发明的微博舆情监测系统面向微博网络资源,所采集的微博数据经舆情热度获取、智能爬虫爬取、提取和预处理、特征短语过滤、舆情分析、情感倾向性分析等数据处理步骤,有效提高了微博网络数据源的微博舆情数据过滤效率; [0057] First, Twitter opinion monitoring system of the present invention for network resources Twitter, Twitter acquired data acquired by the public opinion heat, smart crawling reptiles, and pre-extraction, filtration characteristic phrases, public opinion analysis, emotional propensity analysis, data processing steps, improve the efficiency of filtration Twitter Twitter public opinion data network data source;

[0058] 其次,通过分布式的云计算方式,能够对大规模采集数据进行挖掘、分析,并能够基于多种微博舆情监测算法模块来获取微博舆情热点,对所述微博舆情热点综合判断分类,从而实现对微博舆情热点话题的发现及追踪、对微博的社会网络分析,分析结果可视化呈现,为党政机关、大型企业等单位和组织及时发现微博敏感信息、掌握微博舆情热点、把握微博舆情趋势、应对微博舆情危机提供自动化、系统化和科学化的信息支持。 [0058] Next, a distributed cloud, it is possible to carry out large-scale mining data collection, analysis, and can be acquired based on a variety of hot public opinion Twitter Twitter opinion monitoring algorithm module, the integrated micro-blog hot public opinion determine the classification, enabling discovery of the microblogging public opinion hot topics and tracking, social network microblogging analysis, visualization of results presented to government agencies, large enterprises and other units and organizations to detect micro Bomin sense of information, grasp microblogging public opinion hot, grasp the trend of public opinion microblogging, microblogging response to public opinion crisis provides an automated, systematic and scientific information support. 有效提高了所述微博舆情监测系统判断的准确性,为网络微博舆情信息的后续处理提供了更为真实、准确的基础。 Twitter effectively improve the accuracy of the determination opinion monitoring system, provides a more realistic, accurate basis for subsequent processing network Twitter public opinion information.

附图说明 BRIEF DESCRIPTION

[0059] 下面结合附图对本发明的技术方案进行进一步的说明。 [0059] DRAWINGS The technical solution of the present invention will be further described. 所述附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。 The drawings are only for purposes of illustrating a preferred embodiment and are not to be considered limiting of the present invention.

[0060]图1示出了根据本发明的实施例的微博舆情监测系统的功能结构图。 [0060] FIG. 1 shows a functional configuration view of a micro-blog public opinion monitoring system according to an embodiment of the present invention.

具体实施方式 Detailed ways

[0061] 通过下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。 [0061] The detailed description of the preferred embodiments Hereinafter, a variety of other advantages and benefits to those of ordinary skill in the art will become apparent. 所述描述仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂。 The description is only an overview of the technical solution of the present invention, in order to more fully understood from the present invention, but may be implemented in accordance with the contents of the specification, and in order to make the aforementioned and other objects, features and advantages of the present invention can be more conspicuously understand.

[0062] 下面将参照附图更详细地描述本公开的示例性实施例。 [0062] The following exemplary embodiments of the present disclosure will be described in more detail with reference to the drawings. 虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。 While the exemplary embodiment shows an exemplary embodiment of the present disclosure in the drawings, it should be understood that the present disclosure may be implemented embodiments and should not be set forth herein to limit in various forms. 相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。 Rather, these embodiments are able to more thorough understanding of the present disclosure, and the scope of the present disclosure can be completely conveying to those skilled in the art.

[0063] 本发明请求保护一种微博舆情监测系统,该系统包括:舆情热度获取模块、智能爬虫爬取模块、提取和预处理模块、特征短语过滤模块、舆情分析模块、情感倾向性分析模块、以及用户交互模块。 [0063] The present invention claims a Twitter opinion monitoring system, the system comprising: a heat opinion obtaining module, smart crawler crawling module, extraction and preprocessing module, wherein the filter module phrase, public opinion analysis module, a tendency analysis module sentiment and user interaction module. 其中,所述舆情分析模块通过分布式的云计算方式,使用多种微博舆情监测算法子模块来获取微博舆情热点,并对获取的微博舆情热点进行综合判断分类评估,从而实现对微博舆情热点话题较高效、准确的监测。 Wherein said distributed public opinion analysis module to cloud computing, using a variety of micro-blog opinion monitoring algorithm to obtain sub-module micro-blog hot public opinion, and the obtained hot Twitter public opinion classification assessment comprehensive judgment, in order to achieve micro Bo hot topic of public opinion more efficient and accurate monitoring.

[0064]图1是根据本发明的实施例的微博舆情监测系统的功能结构图。 [0064] FIG. 1 is a functional block diagram of Twitter opinion monitoring system according to an embodiment of the present invention.

[0065] 如图1所示,所述微博舆情监测系统包括7个模块,分别为:舆情热度获取模块1、智能爬虫爬取模块2、提取和预处理模块3、特征短语过滤模块4、舆情分析模块5、情感倾向性分析模块6、以及用户交互模块7。 [0065] As shown in FIG. 1, the micro-blog opinion monitoring system comprises seven modules, namely: the heat of public opinion acquiring module 1, module 2 smart crawler crawling, extraction and preprocessing module 3, wherein the filtering module 4 phrase, public opinion analysis module 5, 6 emotional tendency analysis module, and user interaction module 7. 其中: among them:

[0066] 舆情热度获取模块I用于根据微博的舆情热度权值来筛选需要进行舆情分析的微博页面; [0066] I acquisition module opinion heat heat according to public opinion weights Twitter screened public opinion analysis required micro-blog page;

[0067] 智能爬虫爬取模块2用于通过对指定的微博页面爬取指定时间内的微博数据,并根据预定义的事件对所爬取的微博数据进行分析,过滤掉与要监测的舆情无关的微博数据; [0067] Smart crawler crawling module 2 is used by a specified page crawling Twitter Twitter data within the specified time, and analyzing data Twitter crawled according to a predefined event, and filtered to be monitored microblogging data independent of public opinion;

[0068] 提取和预处理模块3用于将智能爬虫爬取模块2获取的微博数据中的信息进行提取和预处理; [0068] 3 extraction and pre-processing module for smart crawler crawling module information data acquired Twitter 2 and the pre-extraction;

[0069] 特征短语过滤模块4用于对提取和预处理模块3处理后的微博数据中的特征短语进行过滤筛选; [0069] wherein the filtering module 4 for phrase Twitter data extraction and processing of the preprocessing module 3 characterized in screening filter phrases;

[0070] 舆情分析模块5用于以特征短语过滤模块4处理后的微博数据为基础,发现微博舆情执占. [0070] In a public opinion analysis module 5 wherein the filter module Twitter phrase data based on the fourth processing, executed public opinion found Twitter account.

[0071] 情感倾向性分析模块6用于对所发现的微博舆情热点执行情感倾向性分析; [0071] Emotion 6 tendency analysis module for performing analysis of the emotional tendency Twitter hot public opinion found;

[0072] 用户交互模块7用于以图表或报告形式显示输出微博舆情分析结果,实现用户交互功能。 [0072] The user interaction module for displaying the output 7 Twitter public opinion analysis results in the form of graphs or reports, user interaction functionality.

[0073] 具体地,所述舆情热度获取模块I计算所述微博的舆情热度权值P,若P大于预先设定的阈值TP,则将该微博作为舆情分析的数据来源和分析依据,具体地: [0073] Specifically, the public opinion I acquisition module calculates the heat Twitter public opinion heat weights P, if P is greater than a preset threshold value TP, the source of the data as Twitter and analysis based on the analysis of the public opinion, specifically:

[0074] 假设微博的浏览点击数为Kl,评论数为Κ2,回复数为Κ3,点击支持数为Κ4,点击反对数为Κ5,转发数为Κ6,收藏数为Κ7,βΐ〜β 4为预先设定的且可调整的系数,则 Browse Hits [0074] assume that microblogging is Kl, the number of comments is Κ2, reply number Κ3, click Support number Κ4, click against the number Κ5, forwarding number Κ6, collection number Κ7, βΐ~β 4 to preset, adjustable coefficients, then

[0075] P = (Ig(Kl) 3/4+0.03) * β 1+ (lg((Κ2)2/3+ (Κ3)2/3) +0.02) * β 2+ (lg((Κ4)1/2+ (Κ5)1/2)+0.01) * β 3+ (lg ((K6)1/3+ (K7)1/3) +0.005) * β 4 ; [0075] P = (Ig (Kl) 3/4 + 0.03) * β 1+ (lg ((Κ2) 2/3 + (Κ3) 2/3) +0.02) * β 2+ (lg ((Κ4) 1/2 + (Κ5) 1/2) +0.01) * β 3+ (lg ((K6) 1/3 + (K7) 1/3) +0.005) * β 4;

[0076] 优选地,上述系数β I〜β 4可以设置为:β1 = 0.4;β2 = 0.2;β3 = 0.1;β4=0.10 [0076] Preferably, the coefficient β I~β 4 may be provided: β1 = 0.4; β2 = 0.2; β3 = 0.1; β4 = 0.10

[0077] 具体地,所述智能爬虫爬取模块2执行以下步骤: [0077] Specifically, the smart crawler crawling module 2 perform the following steps:

[0078] 步骤2-1,通过系统预定义的事件对微博页面进行分析,以此将与要监测的预定义的事件无关的链接过滤掉,剩下与预定义的事件有关的链接,将这些与预定义的事件有关的链接保留下来,并把它们存入等待抓取页面的URL队列; [0078] Step 2-1, carried out by the system predefined event micro-blog page analysis, this has nothing to do with the pre-defined events to be monitored link filter out the remaining links relevant predefined event, these related to the predefined event links retained, and put them into a queue waiting to crawl the URL of the page;

[0079] 步骤2-2,根据预先定义的搜索策略,从所述URL队列中选出根据所述预先定义的搜索策略抓取的页面所对应的URL,重复步骤2-1,当满足了系统预设的停止条件后则停止爬取过程。 [0079] Step 2-2, according to a predefined search strategy, the search strategy is selected in accordance with the crawled pages predefined URL corresponding to the URL from the queue, repeat steps 2-1, when the system is satisfied after the preset stop conditions stop crawling process.

[0080] 具体地,所述提取和预处理模块3执行以下步骤: [0080] In particular, the extraction and preprocessing module 3 performs the following steps:

[0081] 首先,提取对微博舆情分析有用的微博正文部分的信息,对微博正文部分进行重构,将具有主题代表性的微博数据聚集在一起; [0081] First, the analysis of extract useful information on the body part Twitter Twitter public opinion on Twitter reconstructed body part, having a micro-blog data representative topic together;

[0082] 其次,对所述微博数据进行分词处理、过滤停用词、命名实体识别、语法解析、词性标注、情感识别、特征词提取;然后进行特征短语提取。 [0082] Next, the micro-blog data word, filtered stop words, named entity recognition, syntax analysis, speech tagging, emotion recognition, feature word extraction; then extract characteristic phrases.

[0083] 具体地,所述特征短语过滤模块4执行以下步骤: [0083] Specifically, the characteristic phrases filtering module 4 performs the following steps:

[0084] 步骤4-1,对特征短语进行去重,包括:记录微博的文本中出现的重复性特征短语以及其出现的次数,过滤掉出现频率低于重复阈值的重复性特征短语和长度低于重复阈值的重复性特征短语; [0084] Step 4-1, for the de-emphasis characteristic phrases, comprising: a recording of the text Twitter repetitive character appearing phrases and their occurrence, filtered and repetitive character phrase length less than the frequency threshold value out repeated occurrence below the threshold repeated repetitive character phrase;

[0085] 步骤4-2,对特征短语进行分组,包括:计算每个特征短语与其他特征短语之间的相似度值,将相似度值高于相似度阈值的特征短语分入相同的组;如果一个特征短语与所有其他特征短语之间的相似度值都为0,则将该特征短语过滤掉;具体地,可以选择以下两个步骤之一来计算所述两个特征短语Χ、γ的相似度值Sims (X,Y),然后进行特征短语分组: [0085] Step 4-2, the characteristic phrase group, comprising: calculating a similarity between each feature value and other features phrase phrase, wherein the similarity values ​​above phrase similarity threshold classified into the same group; If a characteristic phrases and the similarity value between the other characteristic phrases are all 0, then filtering out the characteristic phrases; specifically, one of the following two steps calculates the two characteristic phrases Χ, γ of similarity value Sims (X, Y), then the phrase group wherein:

[0086]步骤 4-2-1: [0086] Step 4-2-1:

[0087] 首先,假设同时出现特征短语Χ、Υ的句子的数量为sum(XY);仅出现特征短语X,不出现特征短语Y的句子的数量为SUm(X);仅出现特征短语Y,不出现特征短语X的句子的数量为sum(Y);此时,特征短语X、Y的相似度值Sims (X,Y)计算公式如下: [0087] First, assume that occur simultaneously wherein the phrase [chi], the number Υ sentences as SUM (the XY); the number of feature phrase X appears only in sentences characteristic phrases Y does not appear as SUM (X); characterized phrase Y appears only, wherein the number of the phrase does not appear as X sentences SUM (Y); in this case, the signature phrase of X, Y similarity Sims values ​​(X, Y) is calculated as follows:

[0088] Sims (X, Y) = 1g2 (sum (XY)) /1g2 (sum (X)) +1g2 (sum (XY)) /1g2 (sum (Y)); [0088] Sims (X, Y) = 1g2 (sum (XY)) / 1g2 (sum (X)) + 1g2 (sum (XY)) / 1g2 (sum (Y));

[0089] 其次,如果Sims (X,Y)(阈值TDl,则将特征短语Y分入特征短语X所在的组; [0089] Next, if Sims (X, Y) (TDl threshold value, then the signature phrase of the phrase group Y wherein X divided into resides;

[0090]步骤 4-2-2: [0090] Step 4-2-2:

[0091] 首先,假设两个特征短语X、Y中包括字符的个数分别为m和η,令k取m、n中的较小值,分别以X1、Yi代表特征短语X、Y中前i个字符组成的子短语,其中,i = 1,2,…,k ;定义IX1-YiI表示子短语X1、Yi的最长公共字符串中包含的字符数量,则特征短语X、Y的相似度值Sims (X,Y)计算公式如下: [0091] First, assume that two feature phrase X, Y included in the number of characters and [eta] m, respectively, so that k has m, n, small values ​​of, respectively, X1, Yi representative of the signature phrase of X, Y front sub-character phrase i, where, i = 1,2, ..., k; IX1-YiI defined sub-phrases denotes X1, the number of characters in the longest common string contained Yi, the characteristic phrases X, Y similar value Sims (X, Y) is calculated as follows:

[0092] Sims (X,Y) = (| Xl—Yl |3+| X2—Y2 |3+…+1 Xk-Yk |3) 1/3; [0092] Sims (X, Y) = (| Xl-Yl | 3+ | X2-Y2 | 3 + ... + 1 Xk-Yk | 3) 1/3;

[0093] 其次,如果Sims (X,Y)(阈值TD2,则将特征短语Y分入特征短语X所在的组; [0093] Next, if Sims (X, Y) (TD2 threshold value, then the signature phrase of the phrase group Y wherein X divided into resides;

[0094] 步骤4-3,对特征短语进行熵值过滤,包括:计算特征短语的熵值,过滤掉熵值低于预设的下阈值的特征短语以及熵值高于预设的上阈值的特征短语。 [0094] Step 4-3, the characteristic phrases entropy filter, comprising: entropy calculating the characteristic phrases, wherein a phrase was filtered off and the entropy entropy lower than the preset lower threshold is higher than a preset upper threshold value of characteristic phrases.

[0095] 具体地,所述舆情分析模块5用于分析并发现微博舆情热点,所述舆情分析模块5的工作原理如下: [0095] Specifically, the public opinion analysis module for analyzing and found 5 Twitter hot public opinion, the public opinion analysis module 5 works as follows:

[0096] 本发明采用分布式的云计算方式,能够对大规模采集微博数据进行挖掘、分析;并能够基于多种舆情监测算法模块来获取微博舆情热点,对所述微博舆情热点综合判断分类,从而实现对微博舆情热点话题的发现及追踪、对微博的社会网络分析,分析结果可视化呈现,为党政机关、大型企业等单位和组织及时发现微博敏感信息、掌握微博舆情热点、把握微博舆情趋势、应对微博舆情危机提供自动化、系统化和科学化的信息支持。 [0096] The present invention uses a distributed cloud, it is possible to carry out excavation, large-scale analysis of the data collected Twitter; Twitter and can be acquired based on a variety of public opinion hot opinion monitoring algorithm module, the integrated micro-blog hot public opinion determine the classification, enabling discovery of the microblogging public opinion hot topics and tracking, social network microblogging analysis, visualization of results presented to government agencies, large enterprises and other units and organizations to detect micro Bomin sense of information, grasp microblogging public opinion hot, grasp the trend of public opinion microblogging, microblogging response to public opinion crisis provides an automated, systematic and scientific information support. 有效提高了所述微博舆情监测系统判断的准确性,为网络微博舆情信息的后续处理提供了更为真实、准确的基础。 Twitter effectively improve the accuracy of the determination opinion monitoring system, provides a more realistic, accurate basis for subsequent processing network Twitter public opinion information. 具体地: specifically:

[0097] 通过分布式存储层存储采集的微博数据以及分析结果,所述分布式存储层基于HDFS实现; [0097] The micro-blog data stored in the distributed storage layer acquisition and analysis, based on the distributed memory layer HDFS implemented;

[0098] 而在分布式计算层,采用MapReduce并行计算方法实现并行化计算; [0098] In a distributed computing layer, using parallel computing MapReduce parallelized computing;

[0099] 通过HDFS文件存储和传输优化、MapReduce并行计算优化,实现了海量的微博舆情监测的优化,并实现了稳定、高效的大数据存储优化,使得海量的微博舆情数据查询处理优化,具有良好的可扩展性、可靠性、安全性。 [0099] HDFS file storage and transmission by optimizing the MapReduce parallel computing optimized to achieve the optimization Twitter opinion monitoring mass and achieve a stable and efficient large data storage optimization, so that the micro-blog public opinion massive data query processing optimization, It has good scalability, reliability and security. 该系统基于云平台,具有良好的响应速度,支持海量微博数据分析与挖掘服务。 The system is cloud-based platform, has a good response time, support the massive data analysis and mining microblogging service.

[0100] 所述舆情分析模块5分析并发现微博舆情热点的步骤如下: [0100] The public opinion analysis module is analyzed and found to step 5 Twitter hot public opinion is as follows:

[0101] 首先,使用多个微博热点发现子模块,通过并行的分布式计算方式来获取微博舆情热点,所述微博热点发现子模块包括: [0101] First, a plurality of sub-modules found hot Twitter, Twitter acquired by public opinion hot parallel distributed computing, the hot microblogging found sub-module comprises:

[0102] I) Single-Pass微博热点发现子模块5.1,该子模块采用基于MapReduce的singlepass算法; [0102] I) Single-Pass hot microblogging found submodule 5.1, the sub-module uses the singlepass MapReduce algorithm;

[0103] 2) KNN微博热点发现子模块5.2,该子模块采用基于MapReduce的KNN最近邻分类算法; [0103] 2) KNN hot microblogging found submodule 5.2, the sub-module uses the MapReduce KNN nearest neighbor algorithm;

[0104] 3) SVM微博热点发现子模块5.3,该子模块采用基于MapReduce的支持向量机SVM [0104] 3) SVM hot microblogging found submodule 5.3, the sub-module based SVM with the MapReduce

算法; algorithm;

[0105] 4) K-means微博热点发现子模块5.4,该子模块采用基于MapReduce的K平均聚类(K-means)算法;以及 [0105] 4) K-means hot microblogging found submodule 5.4, using an algorithm based on the sub-module MapReduce K-means clustering (K-means); and

[0106] 5) SOM微博热点发现子模块5.5,该子模块采用基于MapReduce的自组织映射SOM神经网络聚类算法; [0106] 5) SOM hot microblogging found submodule 5.5, the sub-module based on SOM clustering of SOM MapReduce;

[0107] 其次,对上述各个微博热点发现子模块所分别获取的所有微博舆情热点进行汇总,进行以下分类判断: [0107] Next, each of the above-described hot microblogging found submodule respectively acquire the public opinion all Twitter hot Summarizing, the following classification is determined:

[0108] 如果获取的微博舆情热点来源于上述三个以上热点发现子模块,则将该微博舆情热点的类别标记为高级微博舆情热点; [0108] If the acquired public opinion hot Twitter derived from the above three or more sub-modules hot found, then mark the micro-blog category for advanced public opinion hot Twitter hot public opinion;

[0109] 如果获取的微博舆情热点来源于上述两个热点发现子模块,则将该微博舆情热点的类别标记为中级微博舆情热点; [0109] If the acquired public opinion hot Twitter derived from the above two sub-modules hot found, then mark the micro-blog category for public opinion hot intermediate hot Twitter public opinion;

[0110] 如果获取的微博舆情热点仅来源于上述一个热点发现子模块,则将该微博舆情热点的类别标记为初级微博舆情热点; [0110] If the acquired public opinion hot Twitter from said one hot spot only submodule found, then mark the micro-blog category for public opinion hot primary hot Twitter public opinion;

[0111] 最后,将所述高级微博舆情热点、中级微博舆情热点和初级微博舆情热点依次发送到所述情感倾向性分析模块6。 [0111] Finally, the advanced public opinion hot Twitter, intermediate and primary focus of public opinion Twitter Twitter sequentially transmitted to the public opinion hot emotion tendency analysis module 6.

[0112] 上述的热点发现子模块5.1〜5.5所采用的算法都采用一般意义上的本领域的通用算法。 [0112] The hot discovery algorithm employed submodule 5.1~5.5 use a common algorithm of the present art in general. 因此本发明的改进之处并非在于上述几种算法本身。 Thus the improvement of the present invention is not in that said several algorithms themselves. 因为在现有的微博舆情监测系统中,往往只是使用了其中的一种微博舆情热点发现算法,而尚未发现将上述多种微博舆情热点发现算法同时使用,并对集中算法的结果进行等级分类的系统。 Because the conventional micro-blog opinion monitoring system, often just using a micro-blog where public opinion hot discovery algorithm, but have not yet found that the above-mentioned plurality of micro-blog public opinion while using hot discovery algorithm, the algorithm and the results were concentrated system-level classification. 并且,虽然本发明的微博舆情监测系统使用了多种舆情热点发现算法,但由于本发明的系统采用了基于云计算的分布式架构,因此并不会带来难以承受的开销,并由于多种方式的组合,大大提高了微博舆情监测系统的准确性,取得了较好的技术效果。 Also, while the micro-blog opinion monitoring system of the present invention uses a variety of public opinion hot discovery algorithm, but the system of the invention uses a distributed architecture cloud-based, and thus will not bring unbearable expenses, due to multipath and a combination of ways, greatly improving the accuracy of microblogging public opinion monitoring system, and achieved good technical results.

[0113] 具体地,所述情感倾向性分析模块6用于执行微博的文本情感倾向性分析,包括以下步骤: [0113] In particular, the tendency analysis module 6 sentiment for text sentiment orientation analysis performed Twitter, comprising the steps of:

[0114] 步骤6-1,人工选取若干常见的情感倾向性的中文和英文的形容词、名词和动词和作为初始化种子集;作为优选,所述初始化种子集中,形容词的数量可以是50,名词和动词的数量可以是100 ; [0114] Step 6-1, select a number of the common artificial emotion propensity Chinese and English adjectives and nouns and verbs as the initialization Seeds; Advantageously, the initialization seed set, the number 50 may be an adjective, and a noun 100 may be a number of verbs;

[0115] 步骤6-2,将微博的文本中所有具有指代关系的代词还原为名词性的原始指代对象,以防止分析过程中对象的漏判或错判; [0115] Step 6-2, the text in all the pronouns Twitter having to refer to the relationship between reduction of the original nominal referents, to prevent the analysis of the object during Missing or wrongly;

[0116] 步骤6-3,以微博的句子为单位,利用词性标注POS和语义角色标注SRL分析微博中每个句子的句子成分,提取每个句子中的主观性词语; [0116] Step 6-3, the sentence as a unit micro-blog, and POS speech tagging using a semantic role labeling SRL sentence components analysis micro-blog each sentence, the words in each sentence extraction of subjectivity;

[0117] 步骤6-4,依次输入每个句子中的主观性词语,根据所述种子集对微博的句子中的主观性词语进行情感倾向性自动标注;对于无法自动标注的主观性词语,由人工判断其情感倾向性后,将该主观性词语补充入所述种子集。 [0117] Step 6-4, sequentially input subjective phrases in each sentence, the words in the sentence Twitter subjectivity in emotional tendency automatic annotation according to the Seeds; for subjective words can not be automatically marked, after manually determining its tendency emotion, the words subjective added into the seed set.

[0118] 具体地,所述用户交互模块7可为用户形成的图表或报告包括:微博舆情信息热度排行报表、微博舆情预警信息分布报表、微博舆情地理信息分布报表、微博舆情情感分析报表、微博舆情状态统计报表以及微博舆情趋势走向分析图。 [0118] Specifically, chart or report the user interaction module 7 may be formed by a user comprising: a heat Twitter Top public opinion information reports, public opinion Twitter warning information distribution reports, public opinion Twitter geographic information distribution reports, public opinion emotion Twitter analysis reports, statistical reports microblogging state of public opinion and public opinion trend toward micro-blog analysis chart.

[0119] 本说明书中所描述的系统及其组成模块的实施例仅仅是示意性的,可以根据实际的需要选择其中的部分或者全部模块来实现本发明实施例方案的目的。 [0119] The system described in the present specification and embodiments of constituent modules is merely exemplary, and part or all of the modules may be selected according to actual needs to achieve the objectives of the embodiments of the present invention. 本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。 Those of ordinary skill in the art without creative efforts, can be understood and implemented.

[0120] 综上所述,仅为本发明较佳的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。 [0120] In summary, the present invention is merely preferred specific embodiments, but the scope of the present invention is not limited thereto, any skilled in the art in the art within the scope of the invention disclosed can be easily variations or alternative contemplated, shall fall within the protection scope of the present invention. 因此,本发明的保护范围应该以权利要求的保护范围为准。 Accordingly, the scope of the present invention should be defined by the scope of the claims.

Claims (8)

1.一种微博舆情监测系统,该系统包括:舆情热度获取模块(I)、智能爬虫爬取模块(2)、提取和预处理模块(3)、特征短语过滤模块(4)、舆情分析模块(5)、情感倾向性分析模块(6)、以及用户交互模块(7),其中舆情热度获取模块(I)用于根据微博的舆情热度权值来筛选需要进行舆情分析的微博页面; 智能爬虫爬取模块(2)用于通过对指定的微博页面爬取指定时间内的微博数据,并根据预定义的事件对所爬取的微博数据进行分析,过滤掉与要监测的舆情无关的微博数据;提取和预处理模块(3)用于将智能爬虫爬取模块(2)获取的微博数据中的信息进行提取和预处理; 特征短语过滤模块(4)用于对提取和预处理模块(3)处理后的微博数据中的特征短语进行过滤筛选; 舆情分析模块(5)用于以特征短语过滤模块(4)处理后的微博数据为基础,发现微博舆情执占.情感倾向性分析模 1. A micro-blog opinion monitoring system, the system comprising: a heat opinion acquisition module (the I), smart crawler crawling module (2), extraction and preprocessing module (3), characterized in phrase filtering module (4), public opinion analysis module (5), emotional tendency analysis module (6), as well as user interaction module (7), in which public opinion Redu acquisition module (I) is used to filter based on the weight of public opinion Redu microblog Weibo page requires analysis of public opinion ; smart crawler crawling module (2) by a specified page crawling Twitter Twitter data within the specified time, and analyzing data Twitter crawled according to a predefined event, and filtered to be monitored micro-blog data unrelated to public opinion; extraction and preprocessing module (3) for a smart crawler crawling module (2) micro-blog information data acquired in the extraction and preprocessing; characteristic phrases filtering module (4) for extraction and preprocessing module (3) characterized in Twitter phrase data processed in screening filter; public opinion analysis module (5) is used to characterize the phrase filtering module (4) micro-blog data processed based micro found Bo public opinion executive account. emotional tendency analysis module 块(6)用于对所发现的微博舆情热点执行情感倾向性分析; 用户交互模块(7)用于以图表或报告形式显示输出微博舆情分析结果,实现用户交互功能。 Block (6) Analysis of predisposition for performing emotion Twitter public opinion hot discovered; user interaction module (7) for displaying the output Twitter public opinion analysis results in the form of graphs or reports, user interaction functionality.
2.根据权利要求1所述的微博舆情监测系统,其特征在于: 所述舆情热度获取模块(I)计算所述微博的舆情热度权值P,若P大于预先设定的阈值TP,则将该微博作为舆情分析的数据来源和分析依据,具体地: 假设微博的浏览点击数为Kl,评论数为Κ2,回复数为Κ3,点击支持数为Κ4,点击反对数为Κ5,转发数为Κ6,收藏数为Κ7,βΐ〜β 4为预先设定的且可调整的系数,则P = (Ig(Kl) 3/4+0.03) * β 1+ (lg((Κ2)2/3+ (Κ3)2/3) +0.02) * β 2+ (lg((Κ4)1/2+ (Κ5)1/2) +0.01) * β 3+ (lg ((K6)1/3+ (K7)1/3) +0.005) * β 4 ; 其中,β I 〜β 4 可以设置为:β1=0.4;β2 = 0.2;β3 = 0.1;β4 = 0.1。 2. Twitter opinion of the monitoring system according to claim 1, wherein: said heat opinion right opinion heat acquisition module (I) of the calculated value P Twitter, if P is greater than a preset threshold value TP, then the microblogging as a source of data and analysis based on public opinion analysis, specifically: View Hits assume microblogging is Kl, the number of comments is Κ2, reply number Κ3, click support number Κ4, click against the number Κ5, forwarding number Κ6, collection number Κ7, adjustable coefficients and a preset βΐ~β 4, then P = (Ig (Kl) 3/4 + 0.03) * β 1+ (lg ((Κ2) 2 / 3 + (Κ3) 2/3) +0.02) * β 2+ (lg ((Κ4) 1/2 + (Κ5) 1/2) +0.01) * β 3+ (lg ((K6) 1/3 + (K7) 1/3) +0.005) * β 4; wherein, β I ~β 4 may be provided: β1 = 0.4; β2 = 0.2; β3 = 0.1; β4 = 0.1.
3.根据权利要求2所述的微博舆情监测系统,其特征在于: 所述智能爬虫爬取模块(2)执行以下步骤: 步骤2-1,通过系统预定义的事件对微博页面进行分析,以此将与要监测的预定义的事件无关的链接过滤掉,剩下与预定义的事件有关的链接,将这些与预定义的事件有关的链接保留下来,并把它们存入等待抓取页面的URL队列; 步骤2-2,根据预先定义的搜索策略,从所述URL队列中选出根据所述预先定义的搜索策略抓取的页面所对应的URL,重复步骤2-1,当满足了系统预设的停止条件后则停止爬取过程。 3. The micro-blog public opinion monitoring system according to claim 2, wherein: the smart crawler crawling module (2) perform the following steps: Step 2-1, was analyzed by the system micro-blog page predefined event , this has nothing to do with the pre-defined events to be monitored link filter out the remaining links relevant predefined event, those associated with the predefined event links retained, and stores them for crawling page URL queue; step 2-2, according to a predefined search strategy, the search is selected in accordance with the predefined policy crawled pages corresponding to the URL from the URL queue, repeat steps 2-1, when satisfied the system default is stopped after stop condition crawling process.
4.根据权利要求3所述的微博舆情监测系统,其特征在于: 所述提取和预处理模块(3)执行以下步骤: 首先,提取对微博舆情分析有用的微博正文部分的信息,对微博正文部分进行重构,将具有主题代表性的微博数据聚集在一起; 其次,对所述微博数据进行分词处理、过滤停用词、命名实体识别、语法解析、词性标注、情感识别、特征词提取;然后进行特征短语提取。 4. The micro-blog public opinion monitoring system of claim 3, wherein: the extraction and preprocessing module (3) performs the following steps: First, public opinion analysis extraction Twitter Twitter useful information body part, microblog reconstructed body part, having a micro-blog data representative topic together; secondly, the micro-blog data word, filtered stop words, named entity recognition, syntax analysis, speech tagging, emotional recognition, feature word extraction; then extract characteristic phrases.
5.根据权利要求4所述的微博舆情监测系统,其特征在于: 所述特征短语过滤模块(4)执行以下步骤: 步骤4-1,对特征短语进行去重,包括:记录微博的文本中出现的重复性特征短语以及其出现的次数,过滤掉出现频率低于重复阈值的重复性特征短语和长度低于重复阈值的重复性特征短语; 步骤4-2,对特征短语进行分组,包括:计算每个特征短语与其他特征短语之间的相似度值,将相似度值高于相似度阈值的特征短语分入相同的组;如果一个特征短语与所有其他特征短语之间的相似度值都为O,则将该特征短语过滤掉;具体地,可以选择以下两个步骤之一来计算所述两个特征短语X、Y的相似度值Sims (X,Y),然后进行特征短语分组:步骤4-2-1: 首先,假设同时出现特征短语X、Y的句子的数量为sum(XY);仅出现特征短语X,不出现特征短语Y的句子的数量为Sum(X);仅出现特 The Twitter opinion monitoring system of claim 4, wherein: the signature phrase of the filtration module (4) performs the following steps: Step 4-1, for the de-emphasis characteristic phrases, comprising: a recording microblogging repetitive character appearing in the text phrases and their occurrence, filtering out repetitive character phrase is repeated below the threshold frequency repetitive character and phrase length less than the threshold value repeating occurs; step 4-2, packet characteristic phrases, comprising: calculating a similarity between each feature value and other features phrase phrase, wherein the similarity values ​​above phrase similarity threshold classified into the same group; if the degree of similarity between a feature of all other phrases characteristic phrases values ​​is O, then filtered off the characteristic phrases; specifically, one of the following two steps to calculate the signature phrase of the two X, Y values ​​similarity Sims (X, Y), then characteristic phrases packet: step 4-2-1: first, assume that the phrase simultaneous wherein X, Y is the number of sentences SUM (the XY); wherein X appears only phrase, sentence phrase feature quantity Y does not occur as the Sum (X); It appears only special 征短语Y,不出现特征短语X的句子的数量为sum (Y);此时,特征短语X、Y的相似度值Sims (X,Y)计算公式如下: Sims (X, Y) = 1g2 (sum (XY)) /1g2 (sum (X)) +1g2 (sum (XY)) /1g2 (sum (Y)); 其次,如果Sims (X,Y)(阈值TDl,则将特征短语Y分入特征短语X所在的组; 步骤4-2-2: 首先,假设两个特征短语X、Y中包括字符的个数分别为m和n,令k取m、η中的较小值,分别以X1、Yi代表特征短语X、Y中前i个字符组成的子短语,其中,i = l,2,一,k;定义|X1-Yi|表示子短语X1、Yi的最长公共字符串中包含的字符数量,则特征短语X、Y的相似度值Sims (X,Y)计算公式如下: Sims (X,Y) = (Ix1-YllilxS-YSl^-Jlxk-YkI3)"3; 其次,如果Sims (X,Y)(阈值TD2,则将特征短语Y分入特征短语X所在的组; 步骤4-3,对特征短语进行熵值过滤,包括:计算特征短语的熵值,过滤掉熵值低于预设的下阈值的特征短语以及熵值高于预设的上 Y Requisitioned phrase, sentence phrase characteristic X does not appear as SUM (Y); In this case, the signature phrase of X, Y similarity Sims values ​​(X, Y) is calculated as follows: Sims (X, Y) = 1g2 ( sum (XY)) / 1g2 (sum (X)) + 1g2 (sum (XY)) / 1g2 (sum (Y)); secondly, if Sims (X, Y) (threshold TDl, then the signature phrase of Y divided into wherein the group X is located phrase; step 4-2-2: first, assume that two feature phrase X, Y, respectively, included in the number of characters of m and n, letting m k takes, the smaller the value of η, respectively, sub-phrases X1, Yi representative of characteristic phrases X, Y i-th character in front thereof, wherein, i = l, 2, a, k; defined | X1-Yi | represents the sub-phrases X1, Yi longest common string number of characters contained in the signature phrase of the X, Y similarity Sims values ​​(X, Y) is calculated as follows: Sims (X, Y) = (Ix1-YllilxS-YSl ^ -Jlxk-YkI3) "3; Secondly, if the SIMS (X, Y) (TD2 threshold value, then the signature phrase of the phrase Y wherein X divided into groups located; step 4-3, the characteristic phrases entropy filter, comprising: entropy calculating characteristic phrases, filtered entropy lower than a preset lower threshold characteristic phrases and the entropy higher than a preset 阈值的特征短语。 Threshold characteristic phrases.
6.根据权利要求5所述的微博舆情监测系统,其特征在于: 所述舆情分析模块(5)用于分析并发现微博舆情热点,包括如下步骤: 首先,使用多个微博热点发现子模块,通过并行的MapReduce分布式计算方式来获取微博舆情热点,所述微博热点发现子模块包括: 1) Single-Pass微博热点发现子模块(5.1),采用single pass算法; 2) KNN微博热点发现子模块(5.2),采用KNN最近邻分类算法; 3) SVM微博热点发现子模块(5.3),采用支持向量机SVM算法; 4) K-means微博热点发现子模块(5.4),采用K平均聚类算法;以及5) SOM微博热点发现子模块(5.5),采用自组织映射SOM神经网络聚类算法; 其次,对上述各个微博热点发现子模块所分别获取的所有微博舆情热点进行汇总,进行以下分类判断: 如果获取的微博舆情热点来源于上述三个以上热点发现子模块,则将该微博舆情热点的类别标记为高级微 The Twitter opinion monitoring system according to claim 5, wherein: said public opinion analysis module (5) for analysis and found to Twitter hot public opinion, comprising the following steps: First, a plurality of hot microblogging found sub-module, be obtained by parallel distributed computing MapReduce Twitter hot public opinion, the hot microblogging found sub-module comprises: 1) Single-pass hot microblogging found submodule (5.1), using the single pass algorithm; 2) KNN hot microblogging found submodule (5.2), using KNN nearest neighbor classification algorithm;. 3) SVM hot microblogging found submodule (5.3), using the SVM algorithm; 4) K-means Twitter hot discovery submodule ( 5.4), using K-means clustering algorithm; and. 5) found that hot SOM Twitter submodule (5.5), SOM neural network clustering algorithm using self-organizing maps; Secondly, each of the above-described hot microblogging found respectively acquired submodule All public opinion hot Twitter Summarizing, the following classification determination: If the acquired public opinion hot Twitter derived from the above three or more sub-modules hot found, then mark the Twitter public opinion hot category for the advanced micro 博舆情热点; 如果获取的微博舆情热点来源于上述两个热点发现子模块,则将该微博舆情热点的类别标记为中级微博舆情热点; 如果获取的微博舆情热点仅来源于上述一个热点发现子模块,则将该微博舆情热点的类别标记为初级微博舆情热点; 最后,将所述高级微博舆情热点、中级微博舆情热点和初级微博舆情热点依次发送到所述情感倾向性分析模块(6)。 Bo hot public opinion; if the acquired public opinion hot Twitter derived from the above two sub-modules hot found, then mark the micro-blog category for public opinion hot intermediate hot Twitter public opinion; If the one acquired only from a hot public opinion Twitter hot submodule found, then mark the micro-blog category is a primary public opinion hot Twitter hot public opinion; Finally, the public opinion hot advanced micro-blog, intermediate and primary focus of public opinion Twitter Twitter sequentially transmitted to the public opinion hot emotion tendency analysis module (6).
7.根据权利要求6所述的微博舆情监测系统,其特征在于: 所述情感倾向性分析模块(6)用于执行微博的文本情感倾向性分析,包括以下步骤: 步骤6-1,人工选取若干常见的情感倾向性的中文和英文的形容词、名词和动词和作为初始化种子集;其中,所述初始化种子集中,形容词的数量可以是50,名词和动词的数量可以是100 ; 步骤6-2,将微博的文本中所有具有指代关系的代词还原为名词性的原始指代对象,以防止分析过程中对象的漏判或错判; 步骤6-3,以微博的句子为单位,利用词性标注POS和语义角色标注SRL分析微博中每个句子的句子成分,提取每个句子中的主观性词语; 步骤6-4,依次输入每个句子中的主观性词语,根据所述种子集对微博的句子中的主观性词语进行情感倾向性自动标注;对于无法自动标注的主观性词语,由人工判断其情感倾向性后, 7. Twitter opinion of the monitoring system according to claim 6, characterized in that: said emotion tendency analysis module (6) for performing micro-blog text sentiment orientation analysis, comprising the following steps: Step 6-1, artificial select a number of common English and Chinese propensity emotional adjectives and nouns and verbs as the initialization seed set; wherein the initialization seed set, the number of adjectives may be 50, the number of nouns and verbs may be 100; step 6 -2, the text in all the pronouns Twitter reduction having to refer to the relationship between the original nominal referents, in order to prevent false negatives or wrongly analysis process object; step 6-3, the sentence is Twitter units, speech tagging using POS analysis and semantic role labeling SRL micro-blog each sentence of the sentence, the words in each sentence extracted in subjectivity; step 6-4, the words in each sentence subjective sequentially input in accordance with the seed set on said subjective words in sentences Twitter emotional tendency automatic annotation; for the words can not be automatically marked subjective, it is determined by the artificial emotion propensity 将该主观性词语补充入所述种子集。 The subjective phrases added into the seed set.
8.根据权利要求7所述的微博舆情监测系统,其特征在于: 所述用户交互模块(7)用于实现用户交互功能,可形成的图表或报告包括:微博舆情信息热度排行报表、微博舆情预警信息分布报表、微博舆情地理信息分布报表、微博舆情情感分析报表、微博舆情状态统计报表以及微博舆情趋势走向分析图。 8. The micro-blog public opinion monitoring system according to claim 7, wherein: the user interaction module (7) for implementing a user interaction, chart or report that may be formed include: Top heat Twitter public opinion information report, microblogging public opinion warning information distribution reports, microblogging public opinion geographic information distribution report, microblogging public opinion sentiment analysis reports, statistical reports microblogging state of public opinion and public opinion trend toward micro-blog analysis chart.
CN201510009995.2A 2015-01-09 2015-01-09 Microblogging public opinion monitoring system CN104537097B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510009995.2A CN104537097B (en) 2015-01-09 2015-01-09 Microblogging public opinion monitoring system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510009995.2A CN104537097B (en) 2015-01-09 2015-01-09 Microblogging public opinion monitoring system

Publications (2)

Publication Number Publication Date
CN104537097A true CN104537097A (en) 2015-04-22
CN104537097B CN104537097B (en) 2017-08-11

Family

ID=52852625

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510009995.2A CN104537097B (en) 2015-01-09 2015-01-09 Microblogging public opinion monitoring system

Country Status (1)

Country Link
CN (1) CN104537097B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809108A (en) * 2015-05-20 2015-07-29 成都布林特信息技术有限公司 Information monitoring and analyzing system
CN104915386A (en) * 2015-05-25 2015-09-16 中国科学院自动化研究所 Short text clustering method based on deep semantic feature learning
CN105491117A (en) * 2015-11-26 2016-04-13 北京航空航天大学 Flow chart data processing system and method for real time data analysis
CN105630970A (en) * 2015-12-24 2016-06-01 哈尔滨工业大学 Social media data processing system and method
CN106230809A (en) * 2016-07-27 2016-12-14 南京快页数码科技有限公司 Mobile Internet public opinion monitoring method and system based on URL
WO2016206395A1 (en) * 2015-06-25 2016-12-29 中兴通讯股份有限公司 Weekly report information processing method and device
CN106339463A (en) * 2016-08-26 2017-01-18 中国传媒大学 Network public opinion early-warning system based on logistic model and early-warning method thereof
CN106598944A (en) * 2016-11-25 2017-04-26 中国民航大学 Civil aviation security public opinion emotion analysis method
CN106778895A (en) * 2016-12-29 2017-05-31 西安工程大学 Kernel k-means method based on local density and single-pass
CN106777040A (en) * 2016-12-09 2017-05-31 厦门大学 Cross-media microblog public opinion analysis method based on emotion polarity perception algorithm
WO2018184518A1 (en) * 2017-04-07 2018-10-11 平安科技(深圳)有限公司 Microblog data processing method and device, computer device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090248434A1 (en) * 2008-03-31 2009-10-01 Datanetics Ltd. Analyzing transactional data
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof
CN103092950A (en) * 2013-01-15 2013-05-08 重庆邮电大学 Online public opinion geographical location real time monitoring system and method
CN103544294A (en) * 2013-10-30 2014-01-29 北京京东尚科信息技术有限公司 Keyword popularity automatic control method
CN103559176A (en) * 2012-10-29 2014-02-05 中国人民解放军国防科学技术大学 Microblog emotional evolution analysis method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090248434A1 (en) * 2008-03-31 2009-10-01 Datanetics Ltd. Analyzing transactional data
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof
CN103559176A (en) * 2012-10-29 2014-02-05 中国人民解放军国防科学技术大学 Microblog emotional evolution analysis method and system
CN103092950A (en) * 2013-01-15 2013-05-08 重庆邮电大学 Online public opinion geographical location real time monitoring system and method
CN103544294A (en) * 2013-10-30 2014-01-29 北京京东尚科信息技术有限公司 Keyword popularity automatic control method

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809108A (en) * 2015-05-20 2015-07-29 成都布林特信息技术有限公司 Information monitoring and analyzing system
CN104915386B (en) * 2015-05-25 2018-04-27 中国科学院自动化研究所 Short text clustering based on semantic feature learning method depth
CN104915386A (en) * 2015-05-25 2015-09-16 中国科学院自动化研究所 Short text clustering method based on deep semantic feature learning
WO2016206395A1 (en) * 2015-06-25 2016-12-29 中兴通讯股份有限公司 Weekly report information processing method and device
CN105491117A (en) * 2015-11-26 2016-04-13 北京航空航天大学 Flow chart data processing system and method for real time data analysis
CN105491117B (en) * 2015-11-26 2018-12-21 北京航空航天大学 Streaming diagram data processing system and method towards real-time data analysis
CN105630970A (en) * 2015-12-24 2016-06-01 哈尔滨工业大学 Social media data processing system and method
CN106230809A (en) * 2016-07-27 2016-12-14 南京快页数码科技有限公司 Mobile Internet public opinion monitoring method and system based on URL
CN106339463A (en) * 2016-08-26 2017-01-18 中国传媒大学 Network public opinion early-warning system based on logistic model and early-warning method thereof
CN106598944A (en) * 2016-11-25 2017-04-26 中国民航大学 Civil aviation security public opinion emotion analysis method
CN106598944B (en) * 2016-11-25 2019-03-19 中国民航大学 A kind of civil aviaton's security public sentiment sentiment analysis method
CN106777040A (en) * 2016-12-09 2017-05-31 厦门大学 Cross-media microblog public opinion analysis method based on emotion polarity perception algorithm
CN106778895A (en) * 2016-12-29 2017-05-31 西安工程大学 Kernel k-means method based on local density and single-pass
WO2018184518A1 (en) * 2017-04-07 2018-10-11 平安科技(深圳)有限公司 Microblog data processing method and device, computer device and storage medium

Also Published As

Publication number Publication date
CN104537097B (en) 2017-08-11

Similar Documents

Publication Publication Date Title
Glance et al. Deriving marketing intelligence from online discussion
Guo et al. To link or not to link? a study on end-to-end tweet entity linking
Kontopoulos et al. Ontology-based sentiment analysis of twitter posts
JP5879260B2 (en) Method and apparatus for analyzing the content of the micro-blog message
Hassan et al. Twitter sentiment analysis: A bootstrap ensemble framework
Wulczyn et al. Ex machina: Personal attacks seen at scale
Ratkiewicz et al. Detecting and tracking the spread of astroturf memes in microblog streams
Gottipati et al. Finding relevant answers in software forums
Ghosh et al. Entropy-based classification of'retweeting'activity on twitter
Khorsheed et al. Comparative evaluation of text classification techniques using a large diverse Arabic dataset
CN101127042A (en) Sensibility classification method based on language model
Zanzotto et al. Linguistic redundancy in twitter
US20130159277A1 (en) Target based indexing of micro-blog content
Li et al. Application of a clustering method on sentiment analysis
US20130024407A1 (en) Text classifier system
Li et al. Spotting fake reviews using positive-unlabeled learning
CN102096680A (en) Method and device for analyzing information validity
CN103164521A (en) Keyword calculation method and device based on user browse and search actions
Liang et al. Dynamic clustering of streaming short documents
US9753916B2 (en) Automatic generation of a speech by processing raw claims to a set of arguments
US8965894B2 (en) Automated web page classification
Chisholm et al. Entity disambiguation with web links
Kothari et al. Detecting comments on news articles in microblogs
Furlan et al. Semantic similarity of short texts in languages with a deficient natural language processing support
CN103745000B (en) A method for detecting a hot topic of Chinese micro-blog

Legal Events

Date Code Title Description
C10 Entry into substantive examination
GR01