CN110019090A

CN110019090A - Social networks big data acquisition system based on crowdsourcing thought

Info

Publication number: CN110019090A
Application number: CN201711239174.3A
Authority: CN
Inventors: 祁建明; 周峻松; 徐继峰; 陈墩金
Original assignee: Guangzhou Ming - Collar Gene Technology Co Ltd
Current assignee: Guangzhou Ming - Collar Gene Technology Co Ltd
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2019-07-16

Abstract

The social networks big data acquisition system based on crowdsourcing thought that the invention discloses a kind of, the system include: server module, client modules, storage subsystem module and crawler subsystem module；Wherein, the server module is the core of system control, the work such as the every operation and result verification of control task before issuing；The client modules are placed in distributed machines node, are communicated by socket socket with server-side, receive server module order, call Theme Crawler of Content program etc.；The storage subsystem module uses HDFS, and specific data acquisition work operates completion using Http Client object by Theme Crawler of Content program come simulation browser；The crawler subsystem module is Theme Crawler of Content, is placed in distributed machines node, is realized by Http Client simulating browser operation.The present invention program introduces crowdsourcing thought, and Hadoop distributed file system storage result data are utilized, and improves data acquisition speed and Information Retrieval Efficiency.

Description

Social networks big data acquisition system based on crowdsourcing thought

Technical field

The invention belongs to big data acquisition technique fields, are related to a kind of social networks big data acquisition based on crowdsourcing thought System.

Background technique

Traditional social interaction mode has been broken in the rise of internet, and simple, quick and without distance social experience pushes Social networks is fast-developing, and the growth of explosion type is presented in social network information.Social network information reflects the network of user Public opinion monitoring, network marketing, Stock Market Forecasting etc. may be implemented by the research to these information in behavioural characteristic.

How quickly, accurately and efficiently the important value of social network information is real-time, obtain target information very It is important.But social networks belongs to the proprietary network of Deep Web, contains much information, thematic strong, and traditional search engines can not index These Deep Web pages, the query interface or Website login only provided by website could access its information, which increase Obtain the difficulty of social network information.

Currently, the domestic acquisition method used is mostly that social platform account is utilized to obtain platform access rights, using simulation Login techniques are oriented acquisition to target information by the way that initiating task collection is arranged.However, in order to protect data, mitigating service The burden of device, social platform would generally take some anti-crawler measures, such as close down account, block login IP.And with social activity The application process of account is increasingly stringenter, and the acquisition of social account and validity maintenance also become the extensive acquisition social network of restriction One of the bottleneck of network data.

Summary of the invention

The social networks big data acquisition system based on crowdsourcing thought that it is an object of that present invention to provide a kind of, is searched for tradition Index holds up the status that can not utilize key search technology direct index social network-i i-platform information, crowdsourcing thought is introduced, number According to mission dispatching is obtained to different machine nodes, task is executed by Theme Crawler of Content, realizes and the orientation of webpage information is obtained, The acquisition and a large amount of social accounts of maintenance for solving the problems, such as computing resource and social account, improve data acquisition Speed；Also, Hadoop distributed file system storage result data are utilized, data are effectively stored in realization same When, improve Information Retrieval Efficiency.

In order to solve the above technical problems, the present invention adopts the following technical scheme that: a kind of social network based on crowdsourcing thought Network big data acquisition system, the system include: server module, client modules, storage subsystem module and crawler subsystem System module；Wherein, the server module is the core of system control, every operation and result of the control task before issuing The work such as verification；The client modules are placed in distributed machines node, are communicated, are connect with server-side by socket socket Receive server module order, call Theme Crawler of Content program etc.；The storage subsystem module uses HDFS, specific data acquisition Work operates completion using Http Client object by Theme Crawler of Content program come simulation browser；The crawler subsystem module It is Theme Crawler of Content, is placed in distributed machines node, is realized by Http Client simulating browser operation.

Further, the server module in systems one and only one, comprising control submodule, task schedule son Module, receiving submodule, verification submodule and Mysql database.

Further, the control submodule is responsible for the communication of management service end module and client modules, and receives the The mission requirements that tripartite submits.

Further, the Mysql database is used to store all data related with task, and receives task schedule The order of module generates task.

Further, the task schedule submodule receives the task requests of crawlers, and returns to assignment file, by climbing Worm program completes specific data acquisition operations.

Further, the receiving submodule obtains result for receiving data.

Further, the verification submodule is responsible for verifying and modifying task status in Mysql database.

Further, the client modules are realized using MFC frame, and machine node may be selected by installation client A certain project in addition system provides computing resource by the operational capability of node itself for project.

The present invention have compared with prior art it is below the utility model has the advantages that

The present invention program can not utilize key search technology direct index social network-i i-platform for traditional search engines The status of information introduces crowdsourcing thought, data acquisition task is issued to different machine nodes, is executed and is appointed by Theme Crawler of Content Business is realized and is obtained to the orientation of webpage information, improves data acquisition speed；Also, Hadoop distributed field system is utilized Storage result data of uniting improve Information Retrieval Efficiency while realization effectively stores data.

Detailed description of the invention

Fig. 1 is the general frame figure of the social networks big data acquisition system based on crowdsourcing thought.

Fig. 2 is the call relation of each functional module of social networks big data acquisition system at runtime based on crowdsourcing thought Figure.

Fig. 3 is the task state transition figure when social networks big data acquisition system based on crowdsourcing thought executes.

Fig. 4 is the system stream of the Theme Crawler of Content subsystem module of the social networks big data acquisition system based on crowdsourcing thought Cheng Tu.

Specific embodiment

With reference to the accompanying drawing and specific embodiment to the present invention carry out in further detail with complete explanation.It is understood that It is that described herein the specific embodiments are only for explaining the present invention, rather than limitation of the invention.

Referring to Fig.1, a kind of social networks big data acquisition system based on crowdsourcing thought of the invention, the system include: Server module, client modules, storage subsystem module and crawler subsystem module；Wherein, the server module is The core of system control, the work such as the every operation and result verification of control task before issuing；The client modules are set It in distributed machines node, is communicated by socket socket with server-side, receives server module order, theme is called to climb Worm program etc.；The storage subsystem module uses HDFS, and specific data acquisition work uses Http by Theme Crawler of Content program Client object, which carrys out simulation browser operation, to be completed；The crawler subsystem module is Theme Crawler of Content, is placed in distributed machines In node, realized by Http Client simulating browser operation.

The extensive social network data that obtains needs to solve the problems, such as 3: (1) obtaining the data acquisition permission of social platform； (2) Quick Acquisition of data；(3) effective storage of data.

This system is based on crowdsourcing thought, the basis using the C/S Framework Software system of independent development as distributed capture Frame, using Hadoop computing platform as the data processing shelf of collection model.The primary data that gets of crawler application is Unstructured and semi-structured webpage information after extracting to webpage information, obtains the objective result data of specified format, After the processing of MapReduce computation module, it is stored in HBase database.

This system framework contains 4 server-side, client, storage subsystem and crawler subsystem modules.Wherein, it takes Be engaged in end module be system control core, comprising control submodule, task schedule submodule, receiving submodule, verification submodule with And 5 parts of Mysql database, it is responsible for the work such as every operation and result verification of the control task before issuing, and system Only one server-side.Client modules are realized using MFC frame, are placed in distributed machines node, by installing client It may be optionally added a certain project in system, communicated by socket socket with server-side, receive service sort command, call Crawlers provide computing resource using the operational capability of node itself for project.Storage subsystem module uses HDFS, specifically Data acquisition work and by crawlers operate completion using Http Client object come simulation browser.Crawler subsystem mould Block realizes the calling of Theme Crawler of Content level of application by Http Client simulating browser operation.

Referring to Fig. 2, after client terminal start-up, crawlers are called to server request task and execute data acquisition times Business.The control module of server-side is responsible for the communication of management server end and client, and receives the mission requirements of third party's submission. Mysql database is used to store all data related with task, and the order for receiving task schedule center generates task.Task Control centre receives the task requests of crawlers, and returns to assignment file, completes specific data acquisition by crawlers and grasps Make.After data acquisition result returns to the receiving module of server-side, then verification is responsible for by correction verification module, and modifies Mysql data Task status in library.After verifying successfully, if task is that third party proposes, data acquisition result is sent to third party, it is no It is then transferred to HDFS, the storage of data result is completed using MapReduce.Wherein:

Control module is to client transmitting order to lower levels and receives the feedback information of client, and receives the task of third party's submission Demand parameter, and corresponding task type is translated into according to configuration file, it stores into Mysql database, waits task The scheduling of scheduler module.

The core function of task scheduling modules is to guarantee that client crawlers can get effective task in an orderly manner.Task Scheduler module, receiving module, correction verification module safeguard the taskinfo table in Mqsql database jointly.Taskinfo table devises 14 fields include index value taskhash, task type type, task status state, task creation time Createtime, mission dispatching time requestindex, task time-out time mactimelength etc..Taskinfo table packet Containing the index list of each task, but there is no task definition, task definition is stored in secondary task table, and every kind small It is engaged in corresponding to a secondary task table, each assignment file is made of several small tasks.Task scheduling modules create task When, the uniqueness of task is identified using taskhash, and according to the priority of createtime setting task, simultaneous selection External key of the taskhash as secondary task table, and task definition corresponding with taskhash is added in secondary task table.Appoint Business scheduler module determines issuing for task according to the state of state.State has 7 kinds of states, is created, dealing respectively, Dealed, analysing, failed, success and stop, state value conversion are as shown in Figure 3.Task scheduling modules connect After the task requests for receiving crawlers, select state in taskinfo table for created or failed state priority most The taskhash of high task inquires secondary task table by SQL statement, final task file is generated, with json file format Return to crawlers.

For only needing to be implemented primary task, after state is converted into success, task terminates.For needing to recycle The task of execution, such as crawl the registration information of social network user periodically to guarantee that the dynamic of information updates, state is being converted Into success certain time, if not being modified to stop state, it can be transformed into created state, resume waiting for node Request.

Receiving module is realized using HyperText Preprocessor (PHP) code, calls the canonical function move_ of PHP code Uploaded_file is received the file uploaded using repost, and stores and arrive disk designated position.If compressed package is sky, State is modified to failed, is otherwise revised as dealed.If compressed package non-empty, correction verification module is entered, modifies state State is analysing.Compressed package is made of the file comprising json character string, and correction verification module is according to C language standard function Json_object_object_get, json_object_get_string, json_object_array_length are to upload Json data carry out extraction analysis, if the result format with definition is consistent, verify successfully, modification state state be Success, otherwise state is modified to failed.After the task is re-issued, state is modified to dealing.School File after testing successfully, which is sent in HDFS, to be further processed, and data storage is finally completed.

Theme Crawler of Content application program is to realize the nucleus module of data acquisition.The characteristics of based on social network information, crawler Program includes simulation login, request task, executes task, data 4 functions of upload, and the difference with traditional crawlers is to increase Login function is simulated, realizes that the orientation of information obtains by the URL of the page where building target data.It is with Sina weibo herein Example, introduces the working principle of crawler.

As shown in figure 4, the configuration file comprising account information is read in being achieved in that for simulation login function, webpage is simulated The process of Sina weibo is logged in, required effective authentication information when the access Sina weibo page is obtained, that is, needs to be stored in this The cookie information on ground.Program sends the user name (username) and password by encryption to Sina's server (password), server, which extracts character string and decrypted from the URL parameter of transmitting, obtains original subscriber's name and password, wherein right The encryption of username and password is the committed step simulated in login process.Base64 is carried out to username to encode To the encrypted result of user name.But the ciphering process of password is more complicated.It is new first with HttpClient object accesses Unrestrained server obtains server time (servertime), 2 parameters of character string (nonce) generated at random.Then sharp Pubkey and rsakv value creation RSA Algorithm public key (key) provided with Sina's server.By servertime, nonce and Password is sequentially spliced into new character string message, carries out rsa encryption to message using key and turns encrypted result Hexadecimal is turned to, the encrypted result of password is obtained.By encrypted username and password together as the request pass The header information of URL request pass to Sina's server, Sina's server through verifying it is errorless after, return logins successfully letter Breath, HttpClient save effective Cookies value.After simulation logins successfully, program can be obtained to server-side request data appoints Business, otherwise terminates this subtask.

Task requests module obtains task to control centre's request data of server-side.Crawlers pass through httpClient HttpGet method, the specified PHP page is made requests, the corresponding taskhash of task is obtained.Then taskhash and machine The address device mac makes requests another PHP page as URL parameter, after the server-side PHP page receives request, server-side journey Ordered pair taskhash and mac is verified address, after detection is errorless, inquires secondary task table by SQL statement, and inquiry is tied Fruit is combined into assignment file, returns to crawlers in the form of json character string.

Task execution module parses assignment file, completes specific data acquisition operations.In crawlers, different data The acquisition function of type has been packaged into different class objects, so as to routine call.When executing data acquisition task, according to target The type of data calls corresponding class object, executes corresponding member function then to obtain data.

Canonical matching is carried out using regular expression to the webpage source code for the micro-blog information that orientation obtains, converts result to Json data are simultaneously stored into specified file, and the task until obtaining content of microblog is fully completed.After result data is compressed, on It is transmitted to the receiving module of server-side, this subtask terminates.

To execute convenient for program, MapReduce program is packaged into Jar packet by collection model, is controlled using Shell script System executes, and while simplifying data processing step, also by the centralized processing to data, improves MapReduce program Working efficiency.

The above description is only a preferred embodiment of the present invention, is not intended to restrict the invention, for those skilled in the art For, the invention can have various changes and changes.All any modifications made within the spirit and principles of the present invention are equal Replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. the social networks big data acquisition system based on crowdsourcing thought, which is characterized in that the system comprises: server-side mould Block, client modules, storage subsystem module and crawler subsystem module；Wherein, the server module is system control Core, the work such as every operation before issuing of control task and result verification；The client modules are placed in distribution It in machine node, is communicated by socket socket with server-side, receives server module order, calls Theme Crawler of Content program Deng；The storage subsystem module uses HDFS, and specific data acquisition work uses HttpClient by Theme Crawler of Content program Object, which carrys out simulation browser operation, to be completed；The crawler subsystem module is Theme Crawler of Content, is placed in distributed machines node, It is realized by HttpClient simulating browser operation.

2. the social networks big data acquisition system according to claim 1 based on crowdsourcing thought, which is characterized in that described Server module in systems one and only one, include control submodule, task schedule submodule, receiving submodule, verification Submodule and Mysql database.

3. the social networks big data acquisition system according to claim 2 based on crowdsourcing thought, which is characterized in that described Control submodule is responsible for the communication of management service end module and client modules, and receives the mission requirements of third party's submission.

4. the social networks big data acquisition system according to claim 2 based on crowdsourcing thought, which is characterized in that described Mysql database is used to store all data related with task, and the order for receiving task schedule submodule generates task.

5. the social networks big data acquisition system according to claim 2 based on crowdsourcing thought, which is characterized in that described Task schedule submodule receives the task requests of crawlers, and returns to assignment file, completes specific data by crawlers Obtain operation.

6. the social networks big data acquisition system according to claim 2 based on crowdsourcing thought, which is characterized in that described Receiving submodule obtains result for receiving data.

7. the social networks big data acquisition system according to claim 2 based on crowdsourcing thought, which is characterized in that described Verification submodule is responsible for verifying and modifying task status in Mysql database.

8. the social networks big data acquisition system according to claim 1 based on crowdsourcing thought, which is characterized in that described Client modules realize that machine node is a certain project that may be optionally added in system by installation client using MFC frame, Computing resource is provided by the operational capability of node itself for project.