CN104199947A - Important person speech supervision and incidence relation excavating method - Google Patents

Important person speech supervision and incidence relation excavating method Download PDF

Info

Publication number
CN104199947A
CN104199947A CN201410459905.5A CN201410459905A CN104199947A CN 104199947 A CN104199947 A CN 104199947A CN 201410459905 A CN201410459905 A CN 201410459905A CN 104199947 A CN104199947 A CN 104199947A
Authority
CN
China
Prior art keywords
personnel
incidence relation
speech
data
supervision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410459905.5A
Other languages
Chinese (zh)
Inventor
范莹
于治楼
梁华勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Group Co Ltd
Original Assignee
Inspur Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Group Co Ltd filed Critical Inspur Group Co Ltd
Priority to CN201410459905.5A priority Critical patent/CN104199947A/en
Publication of CN104199947A publication Critical patent/CN104199947A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses an important person speech supervision and incidence relation excavating method. The method includes the following steps: (1) building a Hadoop big data platform; (2) collecting and resolving microblog data; (3) conducting data cleaning and person matching; (4) analyzing speech tendencies and incidence relations; (5) conducting data visualization displaying. Compared with the prior art, the important person speech supervision and incidence relation excavating method has the advantages of being reasonable in design, convenient to use and the like, the distributed storage and processing technology is applied to a system on the basis of the big data platform, log-on messages and browse messages of netizens on the microblog are collected, the speech tendencies and the incidence relations of the important attention-given persons are analyzed through message matching and incidence relation excavating, the excavated data are displayed in a visualization mode, and tracing is continuously carried out according to the microblog refreshing condition.

Description

A kind of method to emphasis personnel speech supervision and incidence relation excavation
Technical field
The present invention relates to public sentiment supervision and the technical field of incidence relation, specifically a kind of method to emphasis personnel speech supervision and incidence relation excavation based on the large data of cloud computing.
Background technology
Hadoop is a distributed system architecture, and by Apache fund, club develops.Hadoop carries out the instrument of classifying content on Internet to search key.
NameNode is the software moving on a common independent machine in HDFS example.It is in charge of file system title space and controls the access of external client.Whether NameNode determines File Mapping on the copy block on DataNode.
DataNode is also the software moving on a common independent machine in HDFS example.DataNode is conventionally with the form tissue of frame, and frame couples together all systems by a switch.
ZooKeeper is the formal sub-project of Hadoop, it be one for the reliable coherent system of large-scale distributed system, the function providing comprises: configuring maintenance, name Service, distributed synchronization, group service etc.
HBase be one distributed, towards row the database of increasing income.HBase is different from general relational database, and it is a database that is suitable for unstructured data storage.Another are different is the per-column rather than pattern based on row of HBase.
Microblogging is one and focuses on ageingly and random based on customer relationship Information Sharing, the platform propagating and obtain, and micro-blog more can give expression to thought and latest tendency all the time.In recent years, micro-blog number with send out rich quantity of information and explode, having become domestic netizen can be independent and sounding channel relatively freely, no matter the open platform of rich and honour poverty, data volume also reaches large data rank.According to supervision microblogging content, the thought of more can true, real-time tracking paying close attention to personnel dynamically, speech tendency and incidence relation.Meanwhile, the reaching its maturity of the distributed storage that the hadoop ecosystem provides, calculating, nosql database, data query handling implement and data mining algorithm etc., also for the large data mining of microblogging provides technology platform.At present, also do not process based on the large data of cloud computing the rational method to emphasis personnel speech supervision and incidence relation.
Summary of the invention
Technical assignment of the present invention is to provide a kind of method to emphasis personnel speech supervision and incidence relation excavation.
Technical assignment of the present invention is realized in the following manner, and the method step is as follows:
1) set up the large data platform of Hadoop: set up the Hadoop cluster being formed by 11 nodes;
2) microblogging data acquisition and parsing: web crawlers adopts the nutch through secondary development, realizes Theme Crawler of Content collection; To with the given relevant information of paying close attention to personnel as theme, crawl the microblogging data on internet, and carry out participle parsing according to self-defined dictionary, deposit predefined characteristic attribute value in database, form structural data;
3) data cleansing and personnel coupling: structural data is carried out to data pre-service, use Euclidean distance, carry out similarity calculating with the personnel that the pay close attention to eigen vector providing, choose netizen's information that similarity surpasses threshold value as analytic target;
4) speech tendency and incidence relation analysis: according to self-defined dictionary, adopt the technology such as semantic analysis and word frequency statistics to analyze paying close attention to personnel's speech tendency; According to the personnel's interactive information gathering from microblogging, adopt incidence relation algorithm to excavate and pay close attention to personnel's network of personal connections, and follow the trail of according to microblogging update status;
5) data visualization represents: to paying close attention to personnel's speech tendency and incidence relation, carry out visual representing.
In described step 1), 11 nodes comprise 1 NameNode node, 1 SecondaryNameNode node, 1 zookeeper node and 8 DataNode/Tasktracker nodes.
Described step 2) database in adopts hbase.
In described step 3), data pre-service comprises formulation vacancy value fill rule, difference computation rule.
A kind of method that emphasis personnel speech supervision and incidence relation are excavated of the present invention compared to the prior art, there is the features such as reasonable in design, easy to use, system is on large data platform basis, application distribution Storage and Processing technology, gather netizen at log-on message and the browsing information of microblogging, through information matches and incidence relation, excavate, analyze the given personnel's of paying close attention to speech tendency and incidence relation, mining data is carried out to visual representing, and continue to follow the tracks of according to microblogging refresh case.
Accompanying drawing explanation
Accompanying drawing 1 is a kind of schematic flow sheet to the method for emphasis personnel speech supervision and incidence relation excavation.
Embodiment
Embodiment 1:
This method step to emphasis personnel speech supervision and incidence relation excavation is as follows:
1) set up the large data platform of Hadoop: set up the Hadoop cluster being formed by 11 nodes;
2) microblogging data acquisition and parsing: web crawlers adopts the nutch through secondary development, realizes Theme Crawler of Content collection; To with the given relevant information of paying close attention to personnel as theme, crawl the microblogging data on internet, and carry out participle parsing according to self-defined dictionary, deposit predefined characteristic attribute value in database, form structural data;
3) data cleansing and personnel's coupling: structural data is carried out to data pre-service, formulate vacancy value fill rule, difference computation rule, use Euclidean distance, carry out similarity calculating with the personnel that the pay close attention to eigen vector providing, choose netizen's information that similarity surpasses threshold value as analytic target;
4) speech tendency and incidence relation analysis: according to self-defined dictionary, adopt the technology such as semantic analysis and word frequency statistics to analyze paying close attention to personnel's speech tendency; According to the personnel's interactive information gathering from microblogging, adopt incidence relation algorithm to excavate and pay close attention to personnel's network of personal connections, and follow the trail of according to microblogging update status;
5) data visualization represents: to paying close attention to personnel's speech tendency and incidence relation, carry out visual representing.
Embodiment 2:
This method step to emphasis personnel speech supervision and incidence relation excavation is as follows:
1) set up the large data platform of Hadoop: set up the Hadoop cluster being formed by 11 nodes, comprise 1 NameNode node, 1 SecondaryNameNode node, 1 zookeeper node and 8 DataNode/Tasktracker nodes.
2) microblogging data acquisition and parsing: web crawlers adopts the nutch through secondary development, realizes Theme Crawler of Content collection; To with the given relevant information of paying close attention to personnel as theme, crawl the microblogging data on internet, and carry out participle parsing according to self-defined dictionary, deposit predefined characteristic attribute value in hbase database, form structural data;
3) data cleansing and personnel's coupling: structural data is carried out to data pre-service, formulate vacancy value fill rule, difference computation rule, use Euclidean distance, carry out similarity calculating with the personnel that the pay close attention to eigen vector providing, choose netizen's information that similarity surpasses threshold value as analytic target;
4) speech tendency and incidence relation analysis: according to self-defined dictionary, adopt the technology such as semantic analysis and word frequency statistics to analyze paying close attention to personnel's speech tendency; According to the personnel's interactive information gathering from microblogging, adopt incidence relation algorithm to excavate and pay close attention to personnel's network of personal connections, and follow the trail of according to microblogging update status;
5) data visualization represents: to paying close attention to personnel's speech tendency and incidence relation, carry out visual representing.
By embodiment above, described those skilled in the art can be easy to realize the present invention.But should be appreciated that the present invention is not limited to above-mentioned several embodiments.On the basis of disclosed embodiment, described those skilled in the art can the different technical characterictic of combination in any, thereby realizes different technical schemes.

Claims (4)

1. a method of emphasis personnel speech supervision and incidence relation being excavated, is characterized in that the method step is as follows:
1) set up the large data platform of Hadoop: set up the Hadoop cluster being formed by 11 nodes;
2) microblogging data acquisition and parsing: web crawlers adopts the nutch through secondary development, realizes Theme Crawler of Content collection; To with the given relevant information of paying close attention to personnel as theme, crawl the microblogging data on internet, and carry out participle parsing according to self-defined dictionary, deposit predefined characteristic attribute value in database, form structural data;
3) data cleansing and personnel coupling: structural data is carried out to data pre-service, use Euclidean distance, carry out similarity calculating with the personnel that the pay close attention to eigen vector providing, choose netizen's information that similarity surpasses threshold value as analytic target;
4) speech tendency and incidence relation analysis: according to self-defined dictionary, adopt the technology such as semantic analysis and word frequency statistics to analyze paying close attention to personnel's speech tendency; According to the personnel's interactive information gathering from microblogging, adopt incidence relation algorithm to excavate and pay close attention to personnel's network of personal connections, and follow the trail of according to microblogging update status;
5) data visualization represents: to paying close attention to personnel's speech tendency and incidence relation, carry out visual representing.
2. a kind of method that emphasis personnel speech supervision and incidence relation are excavated according to claim 1, it is characterized in that, in described step 1), 11 nodes comprise 1 NameNode node, 1 SecondaryNameNode node, 1 zookeeper node and 8 DataNode/Tasktracker nodes.
3. a kind of method that emphasis personnel speech supervision and incidence relation are excavated according to claim 1, is characterized in that described step 2) in database adopt hbase.
4. a kind of method to emphasis personnel speech supervision and incidence relation excavation according to claim 1, is characterized in that, in described step 3), data pre-service comprises formulation vacancy value fill rule, difference computation rule.
CN201410459905.5A 2014-09-11 2014-09-11 Important person speech supervision and incidence relation excavating method Pending CN104199947A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410459905.5A CN104199947A (en) 2014-09-11 2014-09-11 Important person speech supervision and incidence relation excavating method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410459905.5A CN104199947A (en) 2014-09-11 2014-09-11 Important person speech supervision and incidence relation excavating method

Publications (1)

Publication Number Publication Date
CN104199947A true CN104199947A (en) 2014-12-10

Family

ID=52085240

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410459905.5A Pending CN104199947A (en) 2014-09-11 2014-09-11 Important person speech supervision and incidence relation excavating method

Country Status (1)

Country Link
CN (1) CN104199947A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598631A (en) * 2015-02-05 2015-05-06 北京航空航天大学 Distributed data processing platform
CN104915438A (en) * 2015-06-25 2015-09-16 西安交通大学 Method for acquiring PCU association data in specific topic microblogs
CN105718590A (en) * 2016-01-27 2016-06-29 福州大学 Multi-tenant oriented SaaS public opinion monitoring system and method
CN110555149A (en) * 2019-09-05 2019-12-10 深圳前海微众银行股份有限公司 Method, device and equipment for processing speech data and readable storage medium
CN113609403A (en) * 2021-06-21 2021-11-05 河南工学院 Internet public opinion information acquisition method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544196A (en) * 2012-07-16 2014-01-29 闫忠华 BigBase high-throughput big data online analysis software and hardware all-in-one machine
CN103617169A (en) * 2013-10-23 2014-03-05 杭州电子科技大学 Microblog hot topic extracting method based on Hadoop
CN103729420A (en) * 2013-12-20 2014-04-16 潘大庆 Microblog hotspot tracking system and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544196A (en) * 2012-07-16 2014-01-29 闫忠华 BigBase high-throughput big data online analysis software and hardware all-in-one machine
CN103617169A (en) * 2013-10-23 2014-03-05 杭州电子科技大学 Microblog hot topic extracting method based on Hadoop
CN103729420A (en) * 2013-12-20 2014-04-16 潘大庆 Microblog hotspot tracking system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
唐继禹: "《云环境下基于个性化模型的探索式搜索技术研究与实现》", 《中国优秀硕士学位论文全文数据库(CNKI)》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598631A (en) * 2015-02-05 2015-05-06 北京航空航天大学 Distributed data processing platform
CN104598631B (en) * 2015-02-05 2017-11-14 北京航空航天大学 Distributed data processing platform
CN104915438A (en) * 2015-06-25 2015-09-16 西安交通大学 Method for acquiring PCU association data in specific topic microblogs
CN104915438B (en) * 2015-06-25 2019-02-05 西安交通大学 A method of obtaining PCU associated data in specific topics microblogging
CN105718590A (en) * 2016-01-27 2016-06-29 福州大学 Multi-tenant oriented SaaS public opinion monitoring system and method
CN110555149A (en) * 2019-09-05 2019-12-10 深圳前海微众银行股份有限公司 Method, device and equipment for processing speech data and readable storage medium
CN113609403A (en) * 2021-06-21 2021-11-05 河南工学院 Internet public opinion information acquisition method
CN113609403B (en) * 2021-06-21 2024-03-26 河南工学院 Internet public opinion information acquisition method

Similar Documents

Publication Publication Date Title
EP3819792A2 (en) Method, apparatus, device, and storage medium for intention recommendation
Abrol et al. Tweethood: Agglomerative clustering on fuzzy k-closest friends with variable depth for location mining
TWI501097B (en) System and method of analyzing text stream message
Gad et al. ThemeDelta: Dynamic segmentations over temporal topic models
CN104281607A (en) Microblog hot topic analyzing method
CN104199947A (en) Important person speech supervision and incidence relation excavating method
Lee Unsupervised and supervised learning to evaluate event relatedness based on content mining from social-media streams
CN103699611B (en) Microblog flow information extracting method based on dynamic digest technology
Psomakelis et al. Big IoT and social networking data for smart cities: Algorithmic improvements on Big Data Analysis in the context of RADICAL city applications
CN110533212A (en) Urban waterlogging public sentiment monitoring and pre-alarming method based on big data
CN108108459A (en) Multi-source fusion and the associated dynamic data cleaning method of loop and electronic equipment
CN104408083A (en) Socialized media analyzing system
CN105678590A (en) topN recommendation method for social network based on cloud model
Chen et al. D-map+ interactive visual analysis and exploration of ego-centric and event-centric information diffusion patterns in social media
Demirbaga HTwitt: a hadoop-based platform for analysis and visualization of streaming Twitter data
Rani et al. A survey of tools for social network analysis
Junaidi et al. Analysis of Community Response to Disasters through Twitter Social Media
CN107239509A (en) Towards single Topics Crawling method and system of short text
Aslam et al. Opinion mining using live Twitter data
CN104035969A (en) Method and system for building feature word banks in social network
Zhang et al. Rumor detection with hierarchical representation on bipartite ad hoc event trees
Leung et al. Knowledge discovery from big social key-value data
CN108830735B (en) Online interpersonal relationship analysis method and system
US10511556B2 (en) Bursty detection for message streams
Kim et al. Construction of disaster knowledge graphs to enhance disaster resilience

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20141210

WD01 Invention patent application deemed withdrawn after publication