CN104375826A - High-availability microblog collecting platform and method - Google Patents

High-availability microblog collecting platform and method Download PDF

Info

Publication number
CN104375826A
CN104375826A CN201410535111.2A CN201410535111A CN104375826A CN 104375826 A CN104375826 A CN 104375826A CN 201410535111 A CN201410535111 A CN 201410535111A CN 104375826 A CN104375826 A CN 104375826A
Authority
CN
China
Prior art keywords
data
vest
microblogging
module
acquisition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410535111.2A
Other languages
Chinese (zh)
Inventor
王鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongsou Network Technology Co ltd
Original Assignee
Beijing Zhongsou Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongsou Network Technology Co ltd filed Critical Beijing Zhongsou Network Technology Co ltd
Priority to CN201410535111.2A priority Critical patent/CN104375826A/en
Publication of CN104375826A publication Critical patent/CN104375826A/en
Pending legal-status Critical Current

Links

Abstract

The invention provides a high-availability microblog collecting platform and method. The platform comprises a collecting system, a management background and a data processing system. The management background, the collecting system and the data processing system are sequentially connected. The collecting system comprises an instruction interaction module, a processing module, a data sending module, a ZDP calling module and an OpenAPI calling module. The method comprises the steps that (1) the management background is started; (2) the instruction of the management background is executed, and data are collected; (3) sockpuppet distribution type login and e-mail sending are conducted. According to the high-availability microblog collecting platform and method, the authentication mode is more efficient, the calling method and the calling frequency of an api are automatically controlled by a program, and large-scale data extracting is achieved. The management background conducts manual intervention on the extracted data, and efficient data management is achieved. A user only needs to input a sockpuppet, a blogger and an application so that the data can be extracted, and thus the automatic degree is very high.

Description

A kind of microblogging acquisition platform of High Availabitity and method thereof
Technical field
The invention belongs to a kind of microblogging acquisition system, specifically relate to a kind of microblogging acquisition platform and method thereof of High Availabitity.
Background technology
Microblogging, as newborn network application form, obtains swift and violent development in recent years, and along with the growth of microblog users colony, the acquisition of microblog data plays vital role at microblogging search field.
Current microblogging web page extraction mode is various, is mainly divided into two classes: based on the data capture method of microblog page parsing with based on microblogging api data capture method.
Data capture method based on microblog page is resolved: this method is mainly realized by web crawlers, program requires to be kept in local storage system with the form of text by web page contents, until creep complete or stop after meeting established condition according to template.
Based on microblogging api data capture method: this method, mainly by the interface that microblogging open platform provides, is then resolved according to call format the data obtained.
The data capture method that tradition is resolved based on microblog page, needs manual compiling template, if template changes, maintenance cost is higher, and the data polytype that extraction obtains is entrained in together, and data are succinct not, needing writes a program again is distinguished, and efficiency comparison is low.
Based on microblogging api data capture method, what first will solve is the problem of user authentication, and four large microblog media site certificate methods are different, and these are unfavorable for that large-scale data extracts.
Summary of the invention
For the deficiencies in the prior art, the present invention proposes a kind of microblogging acquisition platform and method thereof of High Availabitity, to the mechanism of microblog users automatic authorization, and the authentication method of four large microblog media websites is carried out regular, for resolving the defect obtaining data method based on microblog page, adopt based on microblogging api data capture method, procedure logical control system api call method and frequency, acquisition json object is also resolved and is realized data efficient acquisition.
The object of the invention is to adopt following technical proposals to realize:
A microblogging acquisition platform for High Availabitity, its improvements are, described platform comprises acquisition system, management backstage and data handling system;
Described management backstage, acquisition system are connected successively with data handling system;
Described acquisition system comprises command interaction module, processing module, data transmission blocks, ZDP calling module, OpenAPI calling module.
Preferably, described platform comprises distributed log-in module, carries out multiple machine distributing checking by Gearman to vest.
Preferably, described platform comprises mail sending module, for log statistic information is sent to associated mail group.
The present invention is based on the microblogging acquisition method of a kind of High Availabitity that another object provides, its improvements are, described method comprises:
(1) management backstage is started;
(2) perform management backstage instruction and carry out data acquisition;
(3) vest Distributed login and mail sending.
Preferably, described step (1) comprises
(1.1) bloger's data, application data, vest data are increased respectively, delete, revise and inquired about;
(1.2) vest associates with application;
(1.3) vest login authentication;
(1.4) aforesaid operations is spliced into instruction type and sends to collection backstage by management backstage.
Preferably, described step (2) comprises
(2.1) bloger, vest and application are received in backstage associative operation with the form of instruction is gathered, and by the operating result write local data file of bloger, vest and application;
(2.2) vest is adopted to pay close attention to bloger, and by Data Update in vest file;
(2.3) authorization flow reads vest local file, calls login module and carries out login authentication to vest;
(2.4) start microblogging blog article, bloger's information and topic to obtain flow process and form a URL to be downloaded;
(2.5) using URL as a downloading task, be committed to downloader, wait returns results;
(2.6) reading returns results data, and deposits in the class object of correspondence by data according to type;
(2.7) the bloger's information, blog article and the topic data that parse are sent to data processing.
Further, described step (2.3) comprises parameters for authentication is write vest local file in the lump, for the application call microblogging opening API after authorizing.
Further, described step (2.4) comprises, obtain a vest downloaded for blog article, bloger's information and topic, judge the dispatching cycle of vest, according to the requirement of microblogging opening API, splicing required parameter, parameter reads from vest file, and API adds required parameter and forms a URL to be downloaded.
Further, it is json form that described step (2.6) comprises returning results of obtaining, and by json Data import in json container, reads data, and data deposited in the class object of correspondence according to type from json container by field.
Preferably, described step (3) vest Distributed login comprises employing multimachine and logs in, and logs in task matching and adopts Gearman.
Preferably, described step (3) mail sending comprises
(3.1) daily record of acquisition system image data is added up;
(3.2) to the data download time of four large microblog media websites, failed download number of times, successfully resolved number of times calculates;
(3.3) the data acquisition amount report of acquisition system is generated;
(3.4) start mail sending program, report is sent to responsible official.
Compared with the prior art, beneficial effect of the present invention is:
(1) although four large microblog media certifications are different, this program is regular to this has been, and certification mode is more efficient.
(2) call method of procedure auto-control api and call number, achieves large-scale data and extracts.
(3) manage backstage and manual intervention is carried out to the data extracted, achieve efficient data management.
(4) user only needs input vest, bloger, application, and get final product extracted data, automaticity is high.
Accompanying drawing explanation
Fig. 1 is the microblogging acquisition platform structural drawing of a kind of High Availabitity provided by the invention.
Fig. 2 is user provided by the invention (vest) Distributed login figure.
Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in further detail.
As shown in Figure 1, the microblogging acquisition platform of a kind of High Availabitity of the present invention comprises: acquisition system and management backstage mutual with it, data handling system.Command interaction module, processing module, data transmission blocks, ZDP calling module, OpenAPI calling module can be divided into by interaction flow acquisition system.
1. three systems are specific as follows:
A.ManagerServer (management backstage): management background service system, acquisition system needs the order on process management backstage.
B.IndexServer (data handling system): data handling system, the data of collection need the data handling system sent to be for further processing.
C.MicCollector (acquisition system): acquisition system.
2. acquisition system comprises:
A.DataRecv (data reception module) & DataSendToDB (data transmission blocks): command interaction module, carries out information interaction with management backstage.
B.DataProcess (data processing module): processing module, carries out data acquisition and background commands performs.
C.DataSend (data transmission blocks): data transmission blocks, sends data to data handling system.
D.ZDP (downloader): ZDP calling module.
E.OpenAPIModule (opening API calling module): OpenAPI calling module.
F.Gearman (task matching system): be used for work assignment to other machines, distributed call be more suitable for doing a certain work machine, concurrent do a certain system being operated in the function doing load balancing between multiple calling.
3. other accessory module
A.DistributeLogin (Distributed login module): distributed log-in module, utilizes pearman to carry out multiple machine distributing checking to vest.
B.MailReporter (mail sending module): mail sending module, is sent to associated mail group by log statistic information.
Wherein, the microblogging acquisition platform acquisition system of a kind of High Availabitity of the present invention, as lower module, is specially:
1) command interaction module, process and the interactive information managing backstage, labor management is mainly carried out to the data in acquisition system in management backstage, the concern of such as bloger, delete, amendment, draw black operation of Denging, DataRecv (management back-end data receives) class is used for the instruction that receiving management backstage sends over, and instruction is stored in local queue m_pJobQueue (task deposits queue) according to specified format, the result loopback of instruction that acquisition system sends over management backstage by DataSendToDB (result transmission) class is to management backstage.
2) processing module, DataProcess (data processing) class mainly comprises two parts, and the execution of data collection and administration backstage instruction mainly comprises following flow process:
A. automatic authorization, because the certification of microblog users has term of life, Sina user needs regularly to log in obtain certification; The certification of Sohu and Netease obtains once, does not need regular refreshing; After taking certification, authentication information and user are bound together
B. refresh certification, the certification of Tengxun user is more special, only needs to log in once, after taking certification, and the new certification of regular transfer mouth brush;
C. task work for the treatment of flow process, the groundwork of this thread obtains task from task queue, carry out protocol analysis, each task in task agent is resolved and processed, after all Job process complete, result is returned to spooler (adopting the Socket link of reception task).
D. bloger pays close attention to and pays close attention to cancellation, bloger's information is downloaded, blog article is downloaded, the download of these data, need the requirement according to microblogging open platform corresponding interface, splicing parameter list, then submits to zdp (downloader) downloader by the url spliced (data download link) according to specified format.
E. result treatment flow process is downloaded, after submitting to zdp (downloader) task, zdp (downloader) can return download result, the result obtained is json (data memory format) form, by json (data memory format) Data import in json (data memory format) container, fetch data by field from json (data memory format) container, and data are deposited in the class object of correspondence according to type (bloger's information, blog article).
The data downloaded (blog article, bloger's information, topic) are spliced well according to the form of protocol requirement, are sent to data processing by f.DataSend (data send to data processing) data transmission flow.
G.ZDP (downloader) downloads, and in whole acquisition system, all data are downloaded and all adopted zdp downloader.
Calling of h.OpenAPIModule (microblogging open interface calls) module management all microblogging api (open interface).
3) Distributed login module and mail sending module:
A. Sina user needs every day and logs in certification again, logs in interval restriction according to present, and each ip logs in being not less than 5min, and ip can not be limited, and it is one day that user logs in the cycle, within many days, runs, more stable.In view of above restriction, when number of users is many, unit can not complete and all log in once in certification available period, so adopt multimachine to log in.Log in task matching and adopt gearman (task matching system), as shown in Figure 2; Tengxun user only needs when user's input system, logs in once, the new certification of later transfer mouth brush; The certification obtained after Sohu and Netease log in is never expired, does not need to refresh.
B. mail sending module, the data that every day, acquisition system collected are added up, respectively to the data download time of four large microblog media websites, failed download number of times, successfully resolved number of times calculates, the running status of objective reaction acquisition system, obtains a report, report is sent to responsible official.
Embodiment
1) management backstage is started:
Management backstage mainly manages the data of three types, bloger, application, vest.
(1) bloger's data, application data, vest data are increased respectively, delete, revise, inquired about;
(2) vest associates with application;
(3) vest login authentication, ensures that vest can be used;
(4) manage backstage and issue collection backstage by operating the form being spliced into instruction above.
2) perform management backstage instruction and carry out data acquisition:
(1) bloger, vest, application are received in backstage associative operation with the form of instruction is gathered, and by local data file corresponding for the write of the operating result of bloger, vest, application;
(2) start concern flow process, pay close attention to a collection of bloger with vest, and by Data Update in vest file;
(3) authorization flow reads vest local file, and call login module and carry out login authentication to vest, and parameters for authentication is write vest local file in the lump, the application after mandate just can call microblogging opening API;
(4) after having authorized, start microblogging blog article, bloger's information, topic acquisition flow process, obtain a vest downloaded for blog article, bloger's information, topic, judge the dispatching cycle of vest, according to the requirement of microblogging opening API, splicing required parameter, parameter reads from vest file, and API adds required parameter and forms a url to be downloaded;
(5) using URL as a downloading task, submit to downloader, wait returns results;
(6) returning results of obtaining is json form, by json Data import in json container, fetches data from json container by field, and deposits in the class object of correspondence by data according to type (bloger's information, blog article, topic);
(7) the bloger's information will parsed, blog article, topic data sends to data processing.
3) vest Distributed login and mail sending
A. the account number of the user application of four large microbloggings, accesses its microblogging account, carries out content read-write, all need in use, and I authorizes to obtain user's (vest).This just relates to the problem of user's (vest) login authentication, logs in interval restriction according to present, and each ip logs in being not less than 5min, and ip can not be limited, and all users (vest) will all log in once within the fixed cycle.In view of above restriction, when number of users is many, unit can not complete and all log in once in certification available period, so adopt multimachine to log in.Log in task matching and adopt gearman (task matching system), as shown in Figure 2.
B. mail sending module
(1) daily record of acquisition system image data is added up;
(2) to the data download time of four large microblog media websites, failed download number of times, successfully resolved number of times calculates;
(3) the data acquisition amount report of a acquisition system is obtained;
(4) start mail sending program, report is sent to responsible official.
Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit; those of ordinary skill in the field still can modify to the specific embodiment of the present invention with reference to above-described embodiment or equivalent replacement; these do not depart from any amendment of spirit and scope of the invention or equivalent replacement, are all applying within the claims of the present invention awaited the reply.

Claims (11)

1. a microblogging acquisition platform for High Availabitity, is characterized in that, described platform comprises acquisition system, management backstage and data handling system;
Described management backstage, acquisition system are connected successively with data handling system;
Described acquisition system comprises command interaction module, processing module, data transmission blocks, ZDP calling module, OpenAPI calling module.
2. the microblogging acquisition platform of a kind of High Availabitity as claimed in claim 1, is characterized in that, described platform comprises distributed log-in module, carries out multiple machine distributing checking by Gearman to vest.
3. the microblogging acquisition platform of a kind of High Availabitity as claimed in claim 1, is characterized in that, described platform comprises mail sending module, for log statistic information is sent to associated mail group.
4. a microblogging acquisition method for High Availabitity, is characterized in that, described method comprises:
(1) management backstage is started;
(2) perform management backstage instruction and carry out data acquisition;
(3) vest Distributed login and mail sending.
5. the microblogging acquisition method of a kind of High Availabitity as claimed in claim 4, is characterized in that, described step (1) comprises
(1.1) bloger's data, application data, vest data are increased respectively, delete, revise and inquired about;
(1.2) vest associates with application;
(1.3) vest login authentication;
(1.4) aforesaid operations is spliced into instruction type and sends to collection backstage by management backstage.
6. the microblogging acquisition method of a kind of High Availabitity as claimed in claim 4, is characterized in that, described step (2) comprises
(2.1) bloger, vest and application are received in backstage associative operation with the form of instruction is gathered, and by the operating result write local data file of bloger, vest and application;
(2.2) vest is adopted to pay close attention to bloger, and by Data Update in vest file;
(2.3) authorization flow reads vest local file, calls login module and carries out login authentication to vest;
(2.4) start microblogging blog article, bloger's information and topic to obtain flow process and form a URL to be downloaded;
(2.5) using URL as a downloading task, be committed to downloader, wait returns results;
(2.6) reading returns results data, and deposits in the class object of correspondence by data according to type;
(2.7) the bloger's information, blog article and the topic data that parse are sent to data processing.
7. the microblogging acquisition method of a kind of High Availabitity as claimed in claim 6, is characterized in that, described step (2.3) comprises parameters for authentication is write vest local file in the lump, for the application call microblogging opening API after authorizing.
8. the microblogging acquisition method of a kind of High Availabitity as claimed in claim 6, it is characterized in that, described step (2.4) comprises, obtain a vest downloaded for blog article, bloger's information and topic, judge the dispatching cycle of vest, according to the requirement of microblogging opening API, splicing required parameter, parameter reads from vest file, and API adds required parameter and forms a URL to be downloaded.
9. the microblogging acquisition method of a kind of High Availabitity as claimed in claim 6, it is characterized in that, it is json form that described step (2.6) comprises returning results of obtaining, by json Data import in json container, from json container, read data by field, and data are deposited in the class object of correspondence according to type.
10. the microblogging acquisition method of a kind of High Availabitity as claimed in claim 4, is characterized in that, described step (3) vest Distributed login comprises employing multimachine and logs in, and logs in task matching and adopts Gearman.
The microblogging acquisition method of 11. a kind of High Availabitity as claimed in claim 4, is characterized in that, described step (3) mail sending comprises
(3.1) daily record of acquisition system image data is added up;
(3.2) to the data download time of four large microblog media websites, failed download number of times, successfully resolved number of times calculates;
(3.3) the data acquisition amount report of acquisition system is generated;
(3.4) start mail sending program, report is sent to responsible official.
CN201410535111.2A 2014-10-11 2014-10-11 High-availability microblog collecting platform and method Pending CN104375826A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410535111.2A CN104375826A (en) 2014-10-11 2014-10-11 High-availability microblog collecting platform and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410535111.2A CN104375826A (en) 2014-10-11 2014-10-11 High-availability microblog collecting platform and method

Publications (1)

Publication Number Publication Date
CN104375826A true CN104375826A (en) 2015-02-25

Family

ID=52554769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410535111.2A Pending CN104375826A (en) 2014-10-11 2014-10-11 High-availability microblog collecting platform and method

Country Status (1)

Country Link
CN (1) CN104375826A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391130A (en) * 2017-07-07 2017-11-24 千寻位置网络有限公司 API is managed automatically and SDK, document automatic creation method
CN110545528A (en) * 2019-09-19 2019-12-06 白浩 Social method, device and storage medium fusing multiple identities

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622443A (en) * 2012-03-13 2012-08-01 北京邮电大学 Customized screening system and method for microblog
US20130080911A1 (en) * 2011-09-27 2013-03-28 Avaya Inc. Personalizing web applications according to social network user profiles
CN103399968A (en) * 2013-07-16 2013-11-20 中国科学院计算技术研究所 Microblog information acquisition method and microblog information acquisition system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130080911A1 (en) * 2011-09-27 2013-03-28 Avaya Inc. Personalizing web applications according to social network user profiles
CN102622443A (en) * 2012-03-13 2012-08-01 北京邮电大学 Customized screening system and method for microblog
CN103399968A (en) * 2013-07-16 2013-11-20 中国科学院计算技术研究所 Microblog information acquisition method and microblog information acquisition system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391130A (en) * 2017-07-07 2017-11-24 千寻位置网络有限公司 API is managed automatically and SDK, document automatic creation method
CN107391130B (en) * 2017-07-07 2020-05-05 千寻位置网络有限公司 API automatic management and SDK, document automatic generation method
CN110545528A (en) * 2019-09-19 2019-12-06 白浩 Social method, device and storage medium fusing multiple identities
CN110545528B (en) * 2019-09-19 2021-12-10 白浩 Social method, device and storage medium fusing multiple identities

Similar Documents

Publication Publication Date Title
CN105824744B (en) A kind of real-time logs capturing analysis method based on B2B platform
CN103729195B (en) A kind of control method and system of software version
CN105956087B (en) Data version management system and method
CN101127783A (en) A website buffering method and device
CN104156255B (en) A kind of virtual machine migration method, virtual machine (vm) migration device and source physical host
CN101420458B (en) Multimedia content monitoring system, method and device based on content distributing network
CN108052374A (en) A kind of method and device of deployment container micro services
CN106294826A (en) A kind of company-data Query method in real time and system
CN106776829A (en) A kind of data guiding system and its method of work
CN108881485A (en) The method for ensureing the high concurrent system response time under big data packet
CN107807815A (en) The method and apparatus of distributed treatment task
CN109145040A (en) A kind of data administering method based on double message queues
CN102325061A (en) Method for monitoring network, equipment and system
CN105138679A (en) Data processing system and method based on distributed caching
CN110083600A (en) A kind of method, apparatus, calculating equipment and the storage medium of log collection processing
CN108173840A (en) Intelligent logistics terminal integration middleware based on cloud platform
CN109800081A (en) A kind of management method and relevant device of big data task
CN104375826A (en) High-availability microblog collecting platform and method
CN104298671B (en) data statistical analysis method and device
CN103440302B (en) The method and system of Real Data Exchangs
CN103440165A (en) Individual-oriented task assignment and processing method
CN102799609A (en) Data acquisition method based on data monitoring
CN106570151A (en) Data collection processing method and system for mass files
CN101021916A (en) Business process analysis method
CN107423035B (en) Product data management system in software development process

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150225

RJ01 Rejection of invention patent application after publication