CN104199947A

CN104199947A - Important person speech supervision and incidence relation excavating method

Info

Publication number: CN104199947A
Application number: CN201410459905.5A
Authority: CN
Inventors: 范莹; 于治楼; 梁华勇
Original assignee: Inspur Group Co Ltd
Current assignee: Inspur Group Co Ltd
Priority date: 2014-09-11
Filing date: 2014-09-11
Publication date: 2014-12-10

Abstract

The invention discloses an important person speech supervision and incidence relation excavating method. The method includes the following steps: (1) building a Hadoop big data platform; (2) collecting and resolving microblog data; (3) conducting data cleaning and person matching; (4) analyzing speech tendencies and incidence relations; (5) conducting data visualization displaying. Compared with the prior art, the important person speech supervision and incidence relation excavating method has the advantages of being reasonable in design, convenient to use and the like, the distributed storage and processing technology is applied to a system on the basis of the big data platform, log-on messages and browse messages of netizens on the microblog are collected, the speech tendencies and the incidence relations of the important attention-given persons are analyzed through message matching and incidence relation excavating, the excavated data are displayed in a visualization mode, and tracing is continuously carried out according to the microblog refreshing condition.

Description

A kind of method to emphasis personnel speech supervision and incidence relation excavation

Technical field

The present invention relates to public sentiment supervision and the technical field of incidence relation, specifically a kind of method to emphasis personnel speech supervision and incidence relation excavation based on the large data of cloud computing.

Background technology

Hadoop is a distributed system architecture, and by Apache fund, club develops.Hadoop carries out the instrument of classifying content on Internet to search key.

NameNode is the software moving on a common independent machine in HDFS example.It is in charge of file system title space and controls the access of external client.Whether NameNode determines File Mapping on the copy block on DataNode.

DataNode is also the software moving on a common independent machine in HDFS example.DataNode is conventionally with the form tissue of frame, and frame couples together all systems by a switch.

ZooKeeper is the formal sub-project of Hadoop, it be one for the reliable coherent system of large-scale distributed system, the function providing comprises: configuring maintenance, name Service, distributed synchronization, group service etc.

HBase be one distributed, towards row the database of increasing income.HBase is different from general relational database, and it is a database that is suitable for unstructured data storage.Another are different is the per-column rather than pattern based on row of HBase.

Microblogging is one and focuses on ageingly and random based on customer relationship Information Sharing, the platform propagating and obtain, and micro-blog more can give expression to thought and latest tendency all the time.In recent years, micro-blog number with send out rich quantity of information and explode, having become domestic netizen can be independent and sounding channel relatively freely, no matter the open platform of rich and honour poverty, data volume also reaches large data rank.According to supervision microblogging content, the thought of more can true, real-time tracking paying close attention to personnel dynamically, speech tendency and incidence relation.Meanwhile, the reaching its maturity of the distributed storage that the hadoop ecosystem provides, calculating, nosql database, data query handling implement and data mining algorithm etc., also for the large data mining of microblogging provides technology platform.At present, also do not process based on the large data of cloud computing the rational method to emphasis personnel speech supervision and incidence relation.

Summary of the invention

Technical assignment of the present invention is to provide a kind of method to emphasis personnel speech supervision and incidence relation excavation.

Technical assignment of the present invention is realized in the following manner, and the method step is as follows:

1) set up the large data platform of Hadoop: set up the Hadoop cluster being formed by 11 nodes;

2) microblogging data acquisition and parsing: web crawlers adopts the nutch through secondary development, realizes Theme Crawler of Content collection; To with the given relevant information of paying close attention to personnel as theme, crawl the microblogging data on internet, and carry out participle parsing according to self-defined dictionary, deposit predefined characteristic attribute value in database, form structural data;

3) data cleansing and personnel coupling: structural data is carried out to data pre-service, use Euclidean distance, carry out similarity calculating with the personnel that the pay close attention to eigen vector providing, choose netizen's information that similarity surpasses threshold value as analytic target;

4) speech tendency and incidence relation analysis: according to self-defined dictionary, adopt the technology such as semantic analysis and word frequency statistics to analyze paying close attention to personnel's speech tendency; According to the personnel's interactive information gathering from microblogging, adopt incidence relation algorithm to excavate and pay close attention to personnel's network of personal connections, and follow the trail of according to microblogging update status;

5) data visualization represents: to paying close attention to personnel's speech tendency and incidence relation, carry out visual representing.

In described step 1), 11 nodes comprise 1 NameNode node, 1 SecondaryNameNode node, 1 zookeeper node and 8 DataNode/Tasktracker nodes.

Described step 2) database in adopts hbase.

In described step 3), data pre-service comprises formulation vacancy value fill rule, difference computation rule.

A kind of method that emphasis personnel speech supervision and incidence relation are excavated of the present invention compared to the prior art, there is the features such as reasonable in design, easy to use, system is on large data platform basis, application distribution Storage and Processing technology, gather netizen at log-on message and the browsing information of microblogging, through information matches and incidence relation, excavate, analyze the given personnel's of paying close attention to speech tendency and incidence relation, mining data is carried out to visual representing, and continue to follow the tracks of according to microblogging refresh case.

Accompanying drawing explanation

Accompanying drawing 1 is a kind of schematic flow sheet to the method for emphasis personnel speech supervision and incidence relation excavation.

Embodiment

Embodiment 1:

This method step to emphasis personnel speech supervision and incidence relation excavation is as follows:

3) data cleansing and personnel's coupling: structural data is carried out to data pre-service, formulate vacancy value fill rule, difference computation rule, use Euclidean distance, carry out similarity calculating with the personnel that the pay close attention to eigen vector providing, choose netizen's information that similarity surpasses threshold value as analytic target;

Embodiment 2:

1) set up the large data platform of Hadoop: set up the Hadoop cluster being formed by 11 nodes, comprise 1 NameNode node, 1 SecondaryNameNode node, 1 zookeeper node and 8 DataNode/Tasktracker nodes.

2) microblogging data acquisition and parsing: web crawlers adopts the nutch through secondary development, realizes Theme Crawler of Content collection; To with the given relevant information of paying close attention to personnel as theme, crawl the microblogging data on internet, and carry out participle parsing according to self-defined dictionary, deposit predefined characteristic attribute value in hbase database, form structural data;

By embodiment above, described those skilled in the art can be easy to realize the present invention.But should be appreciated that the present invention is not limited to above-mentioned several embodiments.On the basis of disclosed embodiment, described those skilled in the art can the different technical characterictic of combination in any, thereby realizes different technical schemes.

Claims

1. a method of emphasis personnel speech supervision and incidence relation being excavated, is characterized in that the method step is as follows:

2. a kind of method that emphasis personnel speech supervision and incidence relation are excavated according to claim 1, it is characterized in that, in described step 1), 11 nodes comprise 1 NameNode node, 1 SecondaryNameNode node, 1 zookeeper node and 8 DataNode/Tasktracker nodes.

3. a kind of method that emphasis personnel speech supervision and incidence relation are excavated according to claim 1, is characterized in that described step 2) in database adopt hbase.

4. a kind of method to emphasis personnel speech supervision and incidence relation excavation according to claim 1, is characterized in that, in described step 3), data pre-service comprises formulation vacancy value fill rule, difference computation rule.