CN110019090A - Social networks big data acquisition system based on crowdsourcing thought - Google Patents

Social networks big data acquisition system based on crowdsourcing thought Download PDF

Info

Publication number
CN110019090A
CN110019090A CN201711239174.3A CN201711239174A CN110019090A CN 110019090 A CN110019090 A CN 110019090A CN 201711239174 A CN201711239174 A CN 201711239174A CN 110019090 A CN110019090 A CN 110019090A
Authority
CN
China
Prior art keywords
data acquisition
task
module
social networks
big data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711239174.3A
Other languages
Chinese (zh)
Inventor
祁建明
周峻松
徐继峰
陈墩金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Ming - Collar Gene Technology Co Ltd
Original Assignee
Guangzhou Ming - Collar Gene Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Ming - Collar Gene Technology Co Ltd filed Critical Guangzhou Ming - Collar Gene Technology Co Ltd
Priority to CN201711239174.3A priority Critical patent/CN110019090A/en
Publication of CN110019090A publication Critical patent/CN110019090A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/48Indexing scheme relating to G06F9/48
    • G06F2209/484Precedence

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The social networks big data acquisition system based on crowdsourcing thought that the invention discloses a kind of, the system include: server module, client modules, storage subsystem module and crawler subsystem module;Wherein, the server module is the core of system control, the work such as the every operation and result verification of control task before issuing;The client modules are placed in distributed machines node, are communicated by socket socket with server-side, receive server module order, call Theme Crawler of Content program etc.;The storage subsystem module uses HDFS, and specific data acquisition work operates completion using Http Client object by Theme Crawler of Content program come simulation browser;The crawler subsystem module is Theme Crawler of Content, is placed in distributed machines node, is realized by Http Client simulating browser operation.The present invention program introduces crowdsourcing thought, and Hadoop distributed file system storage result data are utilized, and improves data acquisition speed and Information Retrieval Efficiency.

Description

Social networks big data acquisition system based on crowdsourcing thought
Technical field
The invention belongs to big data acquisition technique fields, are related to a kind of social networks big data acquisition based on crowdsourcing thought System.
Background technique
Traditional social interaction mode has been broken in the rise of internet, and simple, quick and without distance social experience pushes Social networks is fast-developing, and the growth of explosion type is presented in social network information.Social network information reflects the network of user Public opinion monitoring, network marketing, Stock Market Forecasting etc. may be implemented by the research to these information in behavioural characteristic.
How quickly, accurately and efficiently the important value of social network information is real-time, obtain target information very It is important.But social networks belongs to the proprietary network of Deep Web, contains much information, thematic strong, and traditional search engines can not index These Deep Web pages, the query interface or Website login only provided by website could access its information, which increase Obtain the difficulty of social network information.
Currently, the domestic acquisition method used is mostly that social platform account is utilized to obtain platform access rights, using simulation Login techniques are oriented acquisition to target information by the way that initiating task collection is arranged.However, in order to protect data, mitigating service The burden of device, social platform would generally take some anti-crawler measures, such as close down account, block login IP.And with social activity The application process of account is increasingly stringenter, and the acquisition of social account and validity maintenance also become the extensive acquisition social network of restriction One of the bottleneck of network data.
Summary of the invention
The social networks big data acquisition system based on crowdsourcing thought that it is an object of that present invention to provide a kind of, is searched for tradition Index holds up the status that can not utilize key search technology direct index social network-i i-platform information, crowdsourcing thought is introduced, number According to mission dispatching is obtained to different machine nodes, task is executed by Theme Crawler of Content, realizes and the orientation of webpage information is obtained, The acquisition and a large amount of social accounts of maintenance for solving the problems, such as computing resource and social account, improve data acquisition Speed;Also, Hadoop distributed file system storage result data are utilized, data are effectively stored in realization same When, improve Information Retrieval Efficiency.
In order to solve the above technical problems, the present invention adopts the following technical scheme that: a kind of social network based on crowdsourcing thought Network big data acquisition system, the system include: server module, client modules, storage subsystem module and crawler subsystem System module;Wherein, the server module is the core of system control, every operation and result of the control task before issuing The work such as verification;The client modules are placed in distributed machines node, are communicated, are connect with server-side by socket socket Receive server module order, call Theme Crawler of Content program etc.;The storage subsystem module uses HDFS, specific data acquisition Work operates completion using Http Client object by Theme Crawler of Content program come simulation browser;The crawler subsystem module It is Theme Crawler of Content, is placed in distributed machines node, is realized by Http Client simulating browser operation.
Further, the server module in systems one and only one, comprising control submodule, task schedule son Module, receiving submodule, verification submodule and Mysql database.
Further, the control submodule is responsible for the communication of management service end module and client modules, and receives the The mission requirements that tripartite submits.
Further, the Mysql database is used to store all data related with task, and receives task schedule The order of module generates task.
Further, the task schedule submodule receives the task requests of crawlers, and returns to assignment file, by climbing Worm program completes specific data acquisition operations.
Further, the receiving submodule obtains result for receiving data.
Further, the verification submodule is responsible for verifying and modifying task status in Mysql database.
Further, the client modules are realized using MFC frame, and machine node may be selected by installation client A certain project in addition system provides computing resource by the operational capability of node itself for project.
The present invention have compared with prior art it is below the utility model has the advantages that
The present invention program can not utilize key search technology direct index social network-i i-platform for traditional search engines The status of information introduces crowdsourcing thought, data acquisition task is issued to different machine nodes, is executed and is appointed by Theme Crawler of Content Business is realized and is obtained to the orientation of webpage information, improves data acquisition speed;Also, Hadoop distributed field system is utilized Storage result data of uniting improve Information Retrieval Efficiency while realization effectively stores data.
Detailed description of the invention
Fig. 1 is the general frame figure of the social networks big data acquisition system based on crowdsourcing thought.
Fig. 2 is the call relation of each functional module of social networks big data acquisition system at runtime based on crowdsourcing thought Figure.
Fig. 3 is the task state transition figure when social networks big data acquisition system based on crowdsourcing thought executes.
Fig. 4 is the system stream of the Theme Crawler of Content subsystem module of the social networks big data acquisition system based on crowdsourcing thought Cheng Tu.
Specific embodiment
With reference to the accompanying drawing and specific embodiment to the present invention carry out in further detail with complete explanation.It is understood that It is that described herein the specific embodiments are only for explaining the present invention, rather than limitation of the invention.
Referring to Fig.1, a kind of social networks big data acquisition system based on crowdsourcing thought of the invention, the system include: Server module, client modules, storage subsystem module and crawler subsystem module;Wherein, the server module is The core of system control, the work such as the every operation and result verification of control task before issuing;The client modules are set It in distributed machines node, is communicated by socket socket with server-side, receives server module order, theme is called to climb Worm program etc.;The storage subsystem module uses HDFS, and specific data acquisition work uses Http by Theme Crawler of Content program Client object, which carrys out simulation browser operation, to be completed;The crawler subsystem module is Theme Crawler of Content, is placed in distributed machines In node, realized by Http Client simulating browser operation.
The extensive social network data that obtains needs to solve the problems, such as 3: (1) obtaining the data acquisition permission of social platform; (2) Quick Acquisition of data;(3) effective storage of data.
This system is based on crowdsourcing thought, the basis using the C/S Framework Software system of independent development as distributed capture Frame, using Hadoop computing platform as the data processing shelf of collection model.The primary data that gets of crawler application is Unstructured and semi-structured webpage information after extracting to webpage information, obtains the objective result data of specified format, After the processing of MapReduce computation module, it is stored in HBase database.
This system framework contains 4 server-side, client, storage subsystem and crawler subsystem modules.Wherein, it takes Be engaged in end module be system control core, comprising control submodule, task schedule submodule, receiving submodule, verification submodule with And 5 parts of Mysql database, it is responsible for the work such as every operation and result verification of the control task before issuing, and system Only one server-side.Client modules are realized using MFC frame, are placed in distributed machines node, by installing client It may be optionally added a certain project in system, communicated by socket socket with server-side, receive service sort command, call Crawlers provide computing resource using the operational capability of node itself for project.Storage subsystem module uses HDFS, specifically Data acquisition work and by crawlers operate completion using Http Client object come simulation browser.Crawler subsystem mould Block realizes the calling of Theme Crawler of Content level of application by Http Client simulating browser operation.
Referring to Fig. 2, after client terminal start-up, crawlers are called to server request task and execute data acquisition times Business.The control module of server-side is responsible for the communication of management server end and client, and receives the mission requirements of third party's submission. Mysql database is used to store all data related with task, and the order for receiving task schedule center generates task.Task Control centre receives the task requests of crawlers, and returns to assignment file, completes specific data acquisition by crawlers and grasps Make.After data acquisition result returns to the receiving module of server-side, then verification is responsible for by correction verification module, and modifies Mysql data Task status in library.After verifying successfully, if task is that third party proposes, data acquisition result is sent to third party, it is no It is then transferred to HDFS, the storage of data result is completed using MapReduce.Wherein:
Control module is to client transmitting order to lower levels and receives the feedback information of client, and receives the task of third party's submission Demand parameter, and corresponding task type is translated into according to configuration file, it stores into Mysql database, waits task The scheduling of scheduler module.
The core function of task scheduling modules is to guarantee that client crawlers can get effective task in an orderly manner.Task Scheduler module, receiving module, correction verification module safeguard the taskinfo table in Mqsql database jointly.Taskinfo table devises 14 fields include index value taskhash, task type type, task status state, task creation time Createtime, mission dispatching time requestindex, task time-out time mactimelength etc..Taskinfo table packet Containing the index list of each task, but there is no task definition, task definition is stored in secondary task table, and every kind small It is engaged in corresponding to a secondary task table, each assignment file is made of several small tasks.Task scheduling modules create task When, the uniqueness of task is identified using taskhash, and according to the priority of createtime setting task, simultaneous selection External key of the taskhash as secondary task table, and task definition corresponding with taskhash is added in secondary task table.Appoint Business scheduler module determines issuing for task according to the state of state.State has 7 kinds of states, is created, dealing respectively, Dealed, analysing, failed, success and stop, state value conversion are as shown in Figure 3.Task scheduling modules connect After the task requests for receiving crawlers, select state in taskinfo table for created or failed state priority most The taskhash of high task inquires secondary task table by SQL statement, final task file is generated, with json file format Return to crawlers.
For only needing to be implemented primary task, after state is converted into success, task terminates.For needing to recycle The task of execution, such as crawl the registration information of social network user periodically to guarantee that the dynamic of information updates, state is being converted Into success certain time, if not being modified to stop state, it can be transformed into created state, resume waiting for node Request.
Receiving module is realized using HyperText Preprocessor (PHP) code, calls the canonical function move_ of PHP code Uploaded_file is received the file uploaded using repost, and stores and arrive disk designated position.If compressed package is sky, State is modified to failed, is otherwise revised as dealed.If compressed package non-empty, correction verification module is entered, modifies state State is analysing.Compressed package is made of the file comprising json character string, and correction verification module is according to C language standard function Json_object_object_get, json_object_get_string, json_object_array_length are to upload Json data carry out extraction analysis, if the result format with definition is consistent, verify successfully, modification state state be Success, otherwise state is modified to failed.After the task is re-issued, state is modified to dealing.School File after testing successfully, which is sent in HDFS, to be further processed, and data storage is finally completed.
Theme Crawler of Content application program is to realize the nucleus module of data acquisition.The characteristics of based on social network information, crawler Program includes simulation login, request task, executes task, data 4 functions of upload, and the difference with traditional crawlers is to increase Login function is simulated, realizes that the orientation of information obtains by the URL of the page where building target data.It is with Sina weibo herein Example, introduces the working principle of crawler.
As shown in figure 4, the configuration file comprising account information is read in being achieved in that for simulation login function, webpage is simulated The process of Sina weibo is logged in, required effective authentication information when the access Sina weibo page is obtained, that is, needs to be stored in this The cookie information on ground.Program sends the user name (username) and password by encryption to Sina's server (password), server, which extracts character string and decrypted from the URL parameter of transmitting, obtains original subscriber's name and password, wherein right The encryption of username and password is the committed step simulated in login process.Base64 is carried out to username to encode To the encrypted result of user name.But the ciphering process of password is more complicated.It is new first with HttpClient object accesses Unrestrained server obtains server time (servertime), 2 parameters of character string (nonce) generated at random.Then sharp Pubkey and rsakv value creation RSA Algorithm public key (key) provided with Sina's server.By servertime, nonce and Password is sequentially spliced into new character string message, carries out rsa encryption to message using key and turns encrypted result Hexadecimal is turned to, the encrypted result of password is obtained.By encrypted username and password together as the request pass The header information of URL request pass to Sina's server, Sina's server through verifying it is errorless after, return logins successfully letter Breath, HttpClient save effective Cookies value.After simulation logins successfully, program can be obtained to server-side request data appoints Business, otherwise terminates this subtask.
Task requests module obtains task to control centre's request data of server-side.Crawlers pass through httpClient HttpGet method, the specified PHP page is made requests, the corresponding taskhash of task is obtained.Then taskhash and machine The address device mac makes requests another PHP page as URL parameter, after the server-side PHP page receives request, server-side journey Ordered pair taskhash and mac is verified address, after detection is errorless, inquires secondary task table by SQL statement, and inquiry is tied Fruit is combined into assignment file, returns to crawlers in the form of json character string.
Task execution module parses assignment file, completes specific data acquisition operations.In crawlers, different data The acquisition function of type has been packaged into different class objects, so as to routine call.When executing data acquisition task, according to target The type of data calls corresponding class object, executes corresponding member function then to obtain data.
Canonical matching is carried out using regular expression to the webpage source code for the micro-blog information that orientation obtains, converts result to Json data are simultaneously stored into specified file, and the task until obtaining content of microblog is fully completed.After result data is compressed, on It is transmitted to the receiving module of server-side, this subtask terminates.
To execute convenient for program, MapReduce program is packaged into Jar packet by collection model, is controlled using Shell script System executes, and while simplifying data processing step, also by the centralized processing to data, improves MapReduce program Working efficiency.
The above description is only a preferred embodiment of the present invention, is not intended to restrict the invention, for those skilled in the art For, the invention can have various changes and changes.All any modifications made within the spirit and principles of the present invention are equal Replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims (8)

1. the social networks big data acquisition system based on crowdsourcing thought, which is characterized in that the system comprises: server-side mould Block, client modules, storage subsystem module and crawler subsystem module;Wherein, the server module is system control Core, the work such as every operation before issuing of control task and result verification;The client modules are placed in distribution It in machine node, is communicated by socket socket with server-side, receives server module order, calls Theme Crawler of Content program Deng;The storage subsystem module uses HDFS, and specific data acquisition work uses HttpClient by Theme Crawler of Content program Object, which carrys out simulation browser operation, to be completed;The crawler subsystem module is Theme Crawler of Content, is placed in distributed machines node, It is realized by HttpClient simulating browser operation.
2. the social networks big data acquisition system according to claim 1 based on crowdsourcing thought, which is characterized in that described Server module in systems one and only one, include control submodule, task schedule submodule, receiving submodule, verification Submodule and Mysql database.
3. the social networks big data acquisition system according to claim 2 based on crowdsourcing thought, which is characterized in that described Control submodule is responsible for the communication of management service end module and client modules, and receives the mission requirements of third party's submission.
4. the social networks big data acquisition system according to claim 2 based on crowdsourcing thought, which is characterized in that described Mysql database is used to store all data related with task, and the order for receiving task schedule submodule generates task.
5. the social networks big data acquisition system according to claim 2 based on crowdsourcing thought, which is characterized in that described Task schedule submodule receives the task requests of crawlers, and returns to assignment file, completes specific data by crawlers Obtain operation.
6. the social networks big data acquisition system according to claim 2 based on crowdsourcing thought, which is characterized in that described Receiving submodule obtains result for receiving data.
7. the social networks big data acquisition system according to claim 2 based on crowdsourcing thought, which is characterized in that described Verification submodule is responsible for verifying and modifying task status in Mysql database.
8. the social networks big data acquisition system according to claim 1 based on crowdsourcing thought, which is characterized in that described Client modules realize that machine node is a certain project that may be optionally added in system by installation client using MFC frame, Computing resource is provided by the operational capability of node itself for project.
CN201711239174.3A 2017-12-01 2017-12-01 Social networks big data acquisition system based on crowdsourcing thought Pending CN110019090A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711239174.3A CN110019090A (en) 2017-12-01 2017-12-01 Social networks big data acquisition system based on crowdsourcing thought

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711239174.3A CN110019090A (en) 2017-12-01 2017-12-01 Social networks big data acquisition system based on crowdsourcing thought

Publications (1)

Publication Number Publication Date
CN110019090A true CN110019090A (en) 2019-07-16

Family

ID=67186526

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711239174.3A Pending CN110019090A (en) 2017-12-01 2017-12-01 Social networks big data acquisition system based on crowdsourcing thought

Country Status (1)

Country Link
CN (1) CN110019090A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110636116A (en) * 2019-08-29 2019-12-31 武汉烽火众智数字技术有限责任公司 Multidimensional data acquisition system and method
CN114461930A (en) * 2022-04-13 2022-05-10 四川大学 Social network data acquisition method and device and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110636116A (en) * 2019-08-29 2019-12-31 武汉烽火众智数字技术有限责任公司 Multidimensional data acquisition system and method
CN110636116B (en) * 2019-08-29 2022-05-10 武汉烽火众智数字技术有限责任公司 Multidimensional data acquisition system and method
CN114461930A (en) * 2022-04-13 2022-05-10 四川大学 Social network data acquisition method and device and storage medium

Similar Documents

Publication Publication Date Title
CN108306877B (en) NODE JS-based user identity information verification method and device and storage medium
EP3726411B1 (en) Data desensitising method, server, terminal, and computer-readable storage medium
Liu et al. A task scheduling algorithm based on classification mining in fog computing environment
JP6494609B2 (en) Method and apparatus for generating a customized software development kit (SDK)
Ciavotta et al. A microservice-based middleware for the digital factory
US20210165686A1 (en) Task processing method, system, device, and storage medium
JP6494608B2 (en) Method and apparatus for code virtualization and remote process call generation
Wei et al. Application scheduling in mobile cloud computing with load balancing
EP2590113B1 (en) On demand multi-objective network optimization
CN109284430A (en) Visualization subject web page content based on distributed structure/architecture crawls system and method
CN113377805B (en) Data query method and device, electronic equipment and computer readable storage medium
CN110162559B (en) Block chain processing method based on universal JSON synchronous and asynchronous data API (application program interface) interface call
Amoretti et al. DEUS: a discrete event universal simulator
CN107453900B (en) Cloud analysis parameter setting management system and method for realizing parameter setting
CN105302885B (en) full-text data extraction method and device
Kornienko et al. The Single Page Application architecture when developing secure Web services
CN110019090A (en) Social networks big data acquisition system based on crowdsourcing thought
CN102255969B (en) Representational-state-transfer-based web service security model
Ren et al. Joint optimization of VNF placement and flow scheduling in mobile core network
CN118337786A (en) Service container scheduling method and system based on Kubernetes under cloud edge cooperation
CN105184559B (en) A kind of payment system and method
Fujdiak et al. IP traffic generator using container virtualization technology
CN114785526B (en) Multi-user multi-batch weight distribution calculation and storage processing system based on block chain
CN117453922A (en) Rapid construction and storage system for electric power threat information knowledge graph
Xia et al. Distributed web crawling: A framework for crawling of micro-blog data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190716

WD01 Invention patent application deemed withdrawn after publication