CN109933701A

CN109933701A - A kind of microblog data acquisition methods based on more strategy fusions

Info

Publication number: CN109933701A
Application number: CN201910175559.0A
Authority: CN
Inventors: 王文贤; 陈兴蜀; 王海舟; 严丹; 王培名; 唐瑞
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2019-03-08
Filing date: 2019-03-08
Publication date: 2019-06-25
Anticipated expiration: 2039-03-08
Also published as: CN109933701B

Abstract

The invention discloses a kind of microblog data acquisition methods based on more strategy fusions, simulation first is logged in, and obtains the Cookie logined successfully；The Cookie that will acquire is saved in Cookie queue, obtains initiating task；Then user is crawled using more account load balancings pay close attention to list, subscriber data；User ID is extracted, concern relation and user information queue to be crawled are generated, then crawls user and pays close attention to list, subscriber data, while generating the queue to be crawled of user's microblogging；Visitor Cookie is constructed, content of microblog is crawled using the acceleration of IP agent pool, information is stored in database；Microblogging ID is extracted, comment information queue to be crawled is generated；Microblogging comment information is crawled, information is stored in database.The present invention finds the concurrent request number of suitable current network environment and Cookie queue length by adaptive algorithm, obtains balance between acquisition speed and account number safety；High Availabitity Agent IP module is realized simultaneously to accelerate data to acquire, and is provided basic data for Internet public opinion analysis and is supported.

Description

A kind of microblog data acquisition methods based on more strategy fusions

Technical field

The present invention relates to technical field of network data collection, specially a kind of microblog data based on more strategy fusions is obtained Method.

Background technique

The universal and development of internet promotes flourishing for social networks.Microblogging is as presently most popular social activity The features such as one of network application, state information updating big with its number of users radix is frequently, information propagation is rapid, in recent years In obtain swift and violent development, it has also become one of main communication media of China.According to " the 42nd China Internet network state of development Statistical report " display, by June, 2018, microblogging comes social application third position with 42.1% subscriber usage, compared with 2017 Year December increases by 1.2%, bean vermicelli interaction and in terms of further strengthen.Social network user is huge, Information spread speed is fast, abundant in content, and coverage is wide, and Internet public opinion analysis is had a very important significance.

Current microblog data acquisition method generally passes through microblogging application programming interface (Application Programming Interface, API) or based on simulation log in mode carry out data acquisition.It is adopted using microblogging API Collection will receive microblog system API authorization and the limitation of call number, the data volume of acquisition are few daily；Line number is logged by simulation Although breaching the limitation of microblogging API according to acquisition, multiple accounts is needed to cooperate certain load balancing that can just accomplish It more quickly acquires, and has the risk of title in larger scale data acquisition, the difficulty of account keep-alive is larger.It is existing Microblog data acquisition method is often used single acquisition strategies, causes the data volume of acquisition unstable, low efficiency is lower.

Summary of the invention

In view of the above-mentioned problems, the purpose of the present invention is to provide one kind to stablize, efficiently acquires microblog data, for net Network the analysis of public opinion provides the microblog data acquisition methods based on more strategy fusions that basic data is supported.Technical solution is as follows:

A kind of microblog data acquisition methods based on more strategy fusions, comprising the following steps:

Step 1: simulation logs in, and obtains the Cookie logined successfully；

Step 2: the Cookie that step 1 is obtained is saved in Cookie queue；

Step 3: obtaining initiating task: choosing the more microblog users of multiple number of fans as the node initially crawled；

Step 4: crawling user using more account load balancings and pay close attention to list, subscriber data: accessing initial microblogging User's set carries out the acquisition of subscriber data, customer relationship according to User ID structuring user's data URL；

Step 5: extracting User ID, generate concern relation and user information queue to be crawled, go to step 4；Meanwhile it is raw At the queue to be crawled of user's microblogging；

Step 6: construction visitor Cookie: to be not logged in conditional access microblogging homepage, crawler is generated using microblog system Indicate the related content of the Cookie acquisition microblog of tourist's identity；

Step 7: capture program accesses user's microblogging collection of queues to be crawled, and microblogging URL is constructed according to microblogging ID, based on visit The microblogging URL of objective Cookie and Agent IP pond access construction, carry out the acquisition of content of microblog；

Step 8: after capture program downloads the microblogging URL page, the page being explained, extracts microblogging ID, generates comment letter Cease queue to be crawled；

Step 9: capture program accesses comment information collection of queues to be crawled, and is downloaded based on visitor Cookie and Agent IP pond Microblogging review pages, explain the page, microblogging comment information, and information is stored in database.

Further, the simulation logs in are as follows: program analog subscriber login service device is utilized, to obtain login account Cookie, its step are as follows:

The pre- logging request of step 1): program carries out base64 coding to user name, and constructs pre- logging request address；

Step 2) obtain encryption nonce and servertime: send GET request obtain nonce, servertime, For encrypting to login password, pubkey and rsakv are fixed by pubkey and rsakv variable, nonce and servertime Value, writes direct in program；

Step 3) uses RSA2 encrypted login password: the nonce and rsakv obtained using step 2), in conjunction with microblogging Public key rsakt encrypts user password using RSA2 algorithm, obtains encrypted password；

Step 4) obtains credential server: sending key parameter, after completing request by POST method, server will be passed back Response message, including retcode and arrURL two parts content；

Step 5) obtains the Cookie logined successfully: by GET way access arrURL, server will return to active user Personal information, request return Cookie be effective Cookie, carry out data acquisition with it.

Further, simulation login is carried out with the interval timing less than 24 hours, replacing with newest Cookie will The Cookie of failure.

Further, in more account load balancings, the Cookie for obtaining multiple accounts is logged in by simulation, And these Cookie are saved in a queue；Crawler obtains a Cookie when request, from team's head, and gives Initial ttl value, each crawler carry after the Cookie makes requests, just subtract 1 for the ttl value for corresponding to Cookie, work as ttl value Just the Cookie is put to tail of the queue when being kept to 0, then takes head of the queue Cookie to carry out page request from Cookie queue.

Further, described after having requested a page, crawler is by random suspend mode certain time, to guarantee account Safety.

Further, using adaptive concurrent acquisition strategies in the step 4, in conjunction with current network environment and Cookie Queue length is that the web crawlers searching logged in based on simulation can stablize the concurrent thread number threshold value for quickly carrying out data grabber；Institute Stating strategy includes fast increase and two stages of adjustment slowly:

Exponentially increase request thread number in the fast increase stage, after thread increase, program is in a time Whether account state used in judging in window is normal；It is normal then continue the Thread Count that is multiplied and equal according to the load Weigh tactful rotation Cookie；Abnormal then rejecting abnormalities Cookie, and a new account Cookie is added in tail of the queue, make Cookie Queue length is consistent with initial value, and sets the Thread Count of next time window to the half of current thread number, And enter the slow adjusting stage；

Wherein, N_t+1Indicate the gathering line number of passes of next time window, N_tIndicate the collecting thread of actual time window Number；State indicate in actual time window data acquire, whether Cookie state normal, 1 be it is normal, 0 be extremely；

Increase request thread number according to linear rule in the slow adjusting stage, after thread increase, program is in a time Whether account state used in judging in window is normal；It is normal then continue to increase in a linear manner Thread Count, and according to described Load balancing rotation Cookie；Abnormal then rejecting abnormalities Cookie, and a new Cookie is added in tail of the queue, make Cookie queue length is consistent with initial value, and reduces current thread number by linear mode；It is tied until the slow adjusting stage Beam, then current thread number be able to carry out under the conditions of current network environment and concurrent thread Cookie queue length it is continual and steady Acquire the best Thread Count of microblog data；

Wherein, N_t+1Indicate the gathering line number of passes of next time window, N_tIndicate the collecting thread of actual time window Number；State indicate in actual time window data acquire, whether Cookie state normal, 1 be it is normal, 0 be extremely.

Further, the visitor Cookie make is as follows:

Step a) obtains tri- parameters of tid, c and w

Analysis browser header obtains the acquisition modes of tid: firstly the need of constructing variable fp and cb, fp parameter is by browser Relevant information is constituted, including parameter os, brower, fonts, plugins and screenInfo；Cb parameter is fixed value, value For " gen_callback "；After the completion of fp and cb parametric configuration, get parms tid；Meanwhile server-side will return to new_ Two parameters of tid and confidence, the value of new_tid are true or false；When new_tid is true, w 3；When When new_tid is false, w 2；The value of parameter c and the value of confidence are identical；

Step b) obtains the Cookie being not logged under state

A new Cookie is constructed by the tid that step a) is obtained first, the content of the Cookie includes a key assignments Right, content is { " tid ": tid+ " _ _ "+c }；Then request is completed by GET method, then by checking in the content returned Whether the value of msg field is succ, is not logged in whether Cookie succeeds to judge to obtain；If msg value is succ, then it represents that obtain Cookie success is taken, visitor Cookie can be obtained from the header of response.

Further, the IP agent pool includes: Agent IP collector, Agent IP checker and Agent IP scheduler； Agent IP collector is responsible for timing disclosed Agent IP source Collection agent IP, including agent IP address, port and support from network Agreement；Agent IP checker is responsible for the Agent IP resource of acquisition being timed verification；Agent IP scheduler is responsible for meet The Agent IP of condition is supplied to crawler use.

Further, the IP agent pool the specific implementation process is as follows:

Step A) the transparent IP of filtering

When verifying to the Agent IP being newly put in storage, Agent IP checker will access https: //httpbin.org/ip clothes Business (could what service this is with verbal description), which returns to the IP of corresponding HTTP request, if the IP and crawler service Device real IP is identical, then abandons the IP；If the content returned is different with server real IP, the IP mono- is given initially Score value；If there is the mistake of Agent IP port shutdown, then it is assumed that the IP is unavailable, directly deletes；

Step B) it is verified for microblogging website itself

Microblogging homepage is accessed using Agent IP, if comprising " microblogging-finds fresh whenever and wherever possible in the page that microblogging returns Thing " character string, then the Agent IP can be used for microblog data acquisition；If occurring yzm_input in response page, directly delete The Agent IP；If there is request timed out, then subtract 1 to the score value of the Agent IP；If there is port shutdown mistake, then directly delete Except the Agent IP；For the Agent IP that verification passes through, its score in Agent IP pond, the last checking time and sound are updated Speed is answered, using the standard as scheduler from Agent IP pond screening Agent IP；

Step C) Agent IP scheduling

Agent IP scheduler is according to three score of Agent IP, response time and nearest checking time attributes in Agent IP pond Preset value selects the Agent IP for meeting specified requirement and is ranked up from Agent IP pond, forms an end to end chained list； When each crawler request microblog page, the Agent IP for being located at linked list head node for its distribution is scheduled by Agent IP scheduler, When successfully obtaining response results, which is placed into tail of the queue；If request failure, deletes the Agent IP from the chained list；

It accesses after IP agent pool, all HTTP requests is managed using downloader middleware；For access authority More demanding microblog users data collection module, downloader middleware obtain a Cookie from Cookie queue head, then It carries the Cookie and carries out data acquisition；Lower content of microblog acquisition module, downloader middleware are required for access authority An Agent IP is obtained by Agent IP scheduler, and the visitor Cookie constructed progress data are carried by the Agent IP and are adopted Collection.

The beneficial effects of the present invention are:

(1) present invention logs in crawl microblogging according to simulation and constructs the feature of visitor Cookie crawl microblogging respectively, proposes The method of optimization.

(2) IP agent pool proposed by the present invention, can be used for the number of other social networks, news website, forum or blog etc. According to collection process, capture program is avoided to cause data acquisition to be interrupted because of IP access limitation.

(3) the microblog data acquisition method for more strategy fusions that the present invention designs, can stablize, efficiently acquire microblog number According to.

Detailed description of the invention

Fig. 1 is microblog data acquisition system architecture diagram of the invention.

Fig. 2 is the microblog data collecting flowchart of the invention based on more strategy fusions.

Fig. 3 is microblog system simulation login process figure of the invention.

Fig. 4 is Agent IP crawl and checking process of the invention.

Fig. 5 is user's concern relation acquisition performance comparison diagram of the invention.

Fig. 6 is user information acquisition performance comparison diagram of the invention.

Fig. 7 is micro-blog information acquisition performance comparison diagram of the invention.

Fig. 8 is comment information acquisition performance comparison diagram of the invention.

Specific embodiment

The present invention is described in further details in the following with reference to the drawings and specific embodiments.

It includes under microblogging personal information, microblog users relationship, hot topic content, hot topic that microblog data, which acquires content, All microbloggings, all microbloggings of user, all comments of microblogging and forwarding content etc..Due to subscriber data, content of microblog and microblogging Comment has a very important significance in the analysis of public opinion, thus the subsequent experimental selection subscriber data of the present invention, customer relationship, User's microblogging and its comment are used as acquisition target.

The present invention is based on more strategy fusion acquisition methods to propose a microblog data acquisition system, framework such as Fig. 1 institute Show, system uses breadth-first strategy, is first depending on the artificial selected seed node of bean vermicelli quantity, acquires the concern column of initial user Table successively obtains active user's everyone concern list of interest, extends to the outside in layer, while acquiring user information And all micro-blog informations of the user and its comment, specific collecting flowchart it is as shown in Figure 2.

Embodiment of the present invention is unfolded to describe below in conjunction with concrete case.

Step 1: simulation logs in, and obtains the Cookie logined successfully；

Simulation, which logs in, to be referred to using program analog subscriber login service device, to obtain the process of login account Cookie. The process that microblog system simulation logs in is as shown in Figure 3.

1, pre- logging request

Program carries out base64 coding to user name, then constructs pre- logging request address, make are as follows: http: // Login.sina.com.cn/sso/prelogin.php? entry=weibo&callback=sinaSSOController. PreloginCallBack&su=MTg3MDgxMDMwMzM%3D&rsakt=mod&checkpi n=1&client= (v1.4.18) &_=1526959231, wherein the value MTg3MDgxMDMwMzM%3D expression of su passes through ssologin.js (note: the user after base64 coding is entitled " MTg3MDgxMDMwMzM=", in the URL of request for user name after base64 coding In "=" use " %3D " replacement), _ value 1526959231 be current time stamp.

2, the nonce and servertime of encryption are obtained

Send the variables such as GET request available nonce, servertime, pubkey and rsakv, nonce and For servertime for encrypting in next step to login password, pubkey and rsakv are fixed values, can write direct journey In sequence.

3, using RSA2 encrypted login password

Microblogging encrypts login password password using RSA2 algorithm.Using previous step obtain nonce and Rsakv encrypts user password using RSA2 algorithm in conjunction with the public key rsakt of microblogging, obtains encrypted password.

4, credential server is obtained

Http:// login.sina.com.cn/sso/login.php is requested by POST method? client= Ssologin.js (v1.4.19), the key parameter for needing to send are as follows:

Entry=weibo//login source

Savestate=7//whether save password

Useticket=1//whether logged in using user credential

User name after su=MTg3MDgxMDMwMzM=//base64 coding

The server time stamp that servertime=1526959231//pre- entry stage obtains

Nonce=ES6HQ1//pre- server the random code for logging in acquisition

Password after sp=password//acquisition encryption

After request is completed, server can pass a response message, including retcode and arrURL two parts content back. Wherein, the URL that the corresponding value of arrURL needs for next step verifying.

5, the Cookie logined successfully is obtained

By GET way access arrURL, server will return to the personal information of active user, request the Cookie returned As effective Cookie can carry out data acquisition with it.In addition, present invention discover that the expired time of microblogging Cookie is 24 small When, if to meet the needs of efficiently carrying out data acquisition steady in a long-term, it should carry out mould with the interval timing less than 24 hours It is quasi- to log in, the Cookie that will be failed is replaced with newest Cookie.

Step 2: the Cookie that step 1 is obtained is saved in Cookie queue；

For a large amount of microblog datas of Quick Acquisition, multiple Cookie logined successfully need to be obtained, these Cookie are saved Into Cookie queue, guarantee that step 4 acquires microblog data using more account load balancing.

Step 3: obtaining initiating task；

The more microblog users of multiple number of fans are chosen as the node initially crawled.The concern behavior of microblog users is micro- The bean vermicelli quantity of the form of expression of rich topological structure, microblog users can be from his influence power size of side illustration and the account Quality.Choosing the more microblog users of multiple number of fans can be effectively avoided the user node formation ring acquired or acquisition To a large amount of corpse users.

Step 4: crawling user using more account load balancings and pay close attention to list, subscriber data；

Initial microblog users set is accessed, according to User ID structuring user's data URL, carries out subscriber data, Yong Huguan The acquisition of system.

Under logging state, microblog system limits the request amount of single account within a certain period of time, if currently The request rate of account is limited more than micro blog server, then abnormality can be labeled as by microblogging anti-crawler system.In order to add Fast microblog data acquisition, this implementation use 10 accounts and carry out data acquisition using certain access strategy.

Firstly, logging in the Cookie for obtaining 10 accounts by simulation, these Cookie are saved in a queue.It climbs Worm obtains a Cookie when request, from team's head, and giving initial TTL is 100, and each crawler carries the Cookie After making requests, just the TTL for corresponding to Cookie is subtracted 1, just puts the Cookie to tail of the queue when TTL is kept to 0, then from Head of the queue Cookie is taken to carry out page request in Cookie queue.In order to simulate realistically the operation of people, a page is being requested After face, crawler can random suspend mode certain time, to guarantee the safety of account.In order to improve collecting efficiency, the present invention is used Multithreading carries out concurrent request.Thread will lead to the increase of the amount of access in same time window too much, to increase the wind of title Danger, it is therefore desirable to find an energy in conjunction with current network environment and Cookie queue length for the web crawlers logged in based on simulation Stablize the concurrent request threshold value for quickly carrying out data grabber.Based on the thought of TCP congestion avoidance algorithm, the present invention is using adaptive Concurrent acquisition strategies, to find the concurrent thread threshold value for being able to carry out the acquisition of stability and high efficiency data.The strategy includes fast increases Adduction adjusts two stages slowly.

1, increase the stage fastly

Exponentially increase request thread number at this stage, after thread increase, program is in a time window Whether account state used in interior judgement is normal, normally then continues the Thread Count that is multiplied, and according to above-mentioned load balancing side Formula rotation Cookie, abnormal then rejecting abnormalities Cookie, and a new account Cookie is added in tail of the queue, make Cookie queue Length is consistent with initial value, and sets the Thread Count of next time window to the half of current thread number, is gone forward side by side Enter the slow adjusting stage.

N_t+1Indicate the gathering line number of passes of next time window, N_tIndicate the gathering line number of passes of actual time window； State indicates that data acquire in actual time window, and whether Cookie state is normal.

2, the slow adjusting stage

Increase request thread number according to linear rule at this stage, after thread increase, program is in a time window Whether account state used in interior judgement is normal, normally then continues to increase in a linear manner Thread Count, and according to above-mentioned load Balanced way rotation Cookie, abnormal then rejecting abnormalities Cookie, and a new Cookie is added in tail of the queue, make Cookie team Column length is consistent with initial value, and reduces current thread number by linear mode.The final slow adjusting stage terminates, and works as front Number of passes is to be able to carry out the best of continual and steady acquisition microblog data under the conditions of the network environment and the Cookie queue length Thread Count.

Step 6: construction visitor Cookie；

It is not logged in access microblogging homepage under state, microblog system can generate Cookie for current tourist, to indicate tourist's body Part, crawler can acquire the related content of microblog using the Cookie.Visitor's Cookie make is as follows:

1, tri- parameters of tid, c and w are obtained

Analyze the acquisition modes of the available tid of browser header.Firstly the need of constructing variable fp and cb, fp is by browser Relevant information is constituted, including parameter os, brower, fonts, plugins and screenInfo etc., these information can carry out puppet It makes, a legal fp content of parameter is as follows:

{"os":"1","browser":"Chrome57,0,2110,104","fonts":"undefined"," screenInfo":"1436*752*24","plugins":"Portable Document Format::internal-pdf- viewer::Chrome PDF Plugin|::mhjfbmdgcfjbbpaeojofohoefgie hjai::Chrome PDF Viewer|::internal-nacl-plugin::Nati ve Client|Enables Widevine licenses for playback of HTML audio/video content.(version:1.4.8.1008)::wi devinecdmadapter.dll::Widevine Content Decryption Module"}

Cb parameter is a fixed value, is worth for " gen_callback ".After the completion of fp and cb parametric configuration, pass through POST method requests https: //passport.weibo.com/visitor/genvisitor, can get parms tid. Meanwhile server-side can return to two parameters of new_tid and confidence, the value of new_tid is true or false.When When new_tid is true, w 3；When new_tid is false, w 2.The value of parameter c and the value of confidence are identical.

2, the Cookie being not logged under state is obtained

A new Cookie is constructed by the tid that first step obtains first, the content of the Cookie includes a key Value pair, content are { " tid ": tid+ " _ _ "+c }；Then https is requested by GET method again: // Passport.weibo.com/visitor/visitor? a=incarnate&t=tid&w=w&c=c&gc=&cb= Cross_domain&from=weibo.After request is completed, whether the value of the msg field in content by checking return For succ, it is not logged in whether Cookie succeeds to judge to obtain.It is succ if it is msg value, then it represents that Cookie success is obtained, Visitor Cookie can be obtained from the header of response.

Step 7: crawling content of microblog using the acceleration of IP agent pool, information is stored in database；

Capture program accesses user's microblogging collection of queues to be crawled, and constructs microblogging URL according to microblogging ID, is based on visitor The microblogging URL of Cookie and Agent IP pond access construction, carry out the acquisition of content of microblog.

It is not logged in microblog system under state and mainly passes through IP and the acquisition behavior of crawler is limited, in order to accelerate data to adopt Collection, the present invention have designed and Implemented an IP agent pool.IP agent pool is made of 3 parts: Agent IP collector, Agent IP verification Device and Agent IP scheduler.Agent IP collector is responsible for timing disclosed Agent IP source Collection agent IP, including agency from network IP address, port and the agreement of support.Agent IP checker is responsible for the Agent IP resource of acquisition being timed verification.Agent IP Scheduler is responsible for for qualified Agent IP being supplied to crawler use.The specific implementation flow of IP agent pool is as shown in Figure 4.

1, transparent IP is filtered

When verifying to the Agent IP being newly put in storage, Agent IP checker can access HTTP request and response service (URL Location is https: //httpbin.org/ip, which is that internet freely verifies service, also can according to need and takes in Intranet Build privately owned HTTP request and response service), which returns to the IP of corresponding HTTP request, if the IP and crawler server are true IP is identical, then abandons the IP；If the content returned is different with server real IP, mono- initial value of the IP is given, The present invention is set as 5；In the event of the situation of time-out, it is 4 points that the present invention, which gives one initial score value of the Agent IP,；If There are the mistakes such as Agent IP port shutdown, then then thinking that the IP is unavailable, directly deletes.

2, it is verified for microblogging website itself

Agent IP after filtering might not can be used for microblog data acquisition, it is possible to which the IP is by microblogging screen It covers, it is also possible to which the Agent IP is second-rate, and it is not fast enough to act on behalf of speed.Microblogging homepage is accessed by using Agent IP, is compared micro- It whether include that " microblogging-finds strange thing whenever and wherever possible " this character string can determine whether that the Agent IP whether may be used in the rich page returned It is acquired for microblog data；If occurring yzm_input in response page, it was demonstrated that the Agent IP has been identified as by microblogging different Often, then the Agent IP is directly deleted；In the event of request timed out, then subtract 1 to the score value of the Agent IP；In the event of port The mistakes such as closing, then directly delete the Agent IP.For the Agent IP that passes through of verification, update its score in Agent IP pond, The last checking time and response speed, using the standard as scheduler from Agent IP pond screening Agent IP.It will verify recently Time as screening criteria first is that because the online disclosed Agent IP service life it is all shorter, generally 3~10 minutes, it is possible to After this verification passes through, through just no longer valid after a period of time.The score of Agent IP is as judge Agent IP stability Standard, response speed is as the standard for judging Agent IP request rate.

3, Agent IP is dispatched

In addition to good Agent IP verifies strategy, Agent IP scheduling strategy is also extremely important.Agent IP scheduler is according to generation Three score, response time and nearest checking time attribute preset values of Agent IP select to meet from Agent IP pond in the reason pond IP Specified desired Agent IP is simultaneously ranked up, and forms an end to end chained list.When each crawler request microblog page, agency The Agent IP that IP scheduler can be located at linked list head node for its distribution is scheduled, and when successfully obtaining response results, which is put Set tail of the queue；If request failure, deletes the Agent IP from the chained list.Agent IP scheduler is equipped with a subprocess simultaneously Timing screening from Agent IP pond meets the Agent IP of above three condition, chained list tail portion is placed into, to prevent in scheduler Agent IP lazy weight.

It accesses after IP agent pool, all HTTP requests is managed using downloader middleware.For access authority More demanding microblog users data collection module, downloader middleware can obtain a Cookie from Cookie queue head, The Cookie is carried again carries out data acquisition.Lower content of microblog acquisition module required for access authority, among downloader Part can obtain an Agent IP by Agent IP scheduler, and carry the visitor Cookie constructed by the Agent IP and count According to acquisition.

Step 8: extracting microblogging ID, generate comment information queue to be crawled；

After capture program downloads the microblogging URL page, the page is explained, extracts microblogging ID, comment information is generated and waits climbing Take queue.

Step 9: crawling microblogging comment information, information is stored in database.

Capture program accesses comment information collection of queues to be crawled, and is commented based on visitor Cookie and Agent IP pond downloading microblogging By the page, the page is explained, microblogging comment information, information is stored in database.

It is more than the 1000000 big V User ID of microblogging that Fig. 5, which has selected 10 number of fans, is stepped on using them as seed ID by simulation Record is paid close attention in a manner of single thread list Cookie, the more Cookie of multithreading, the adaptive concurrent thread number of more Cookie respectively Relationship continuous collecting, it can be seen that in the acquisition initial stage, the efficiency far that 10 threads concurrently acquire is higher than other schemes, when even Continue after acquiring 5 hours, the data volume that the acquisition scheme of 10 threads obtains but is much smaller than adopting for adaptive concurrent thread scheme Collection amount, this is because the concurrent amount of access of 10 threads is larger, after program has run a period of time, account is determined by microblog system It is abnormality so as to cause subsequent acquisition failure.

Using using the collected all User ID of adaptive concurrent thread number in one day as seed, respectively using being based on The mode of API, based on simulation log in mode, based on API and construction visitor Cookie fusion mode, based on simulation log in and The mode of construction visitor Cookie fusion has carried out data to subscriber data, all microbloggings and the microblogging comment of seed user and has adopted Collection.

Fig. 6-8 is successively the case where comment using 5 hours user informations of each scheme continuous acquisition, content of microblog and microblogging. It can be seen that in conjunction with Fig. 6-8, the speed with the visitor Cookie scheme for merging and agent pool being cooperated to be acquired logged in using simulation Considerably beyond other schemes.This is because being the Agent IP quality of agent pool using program bottleneck, good generation is used Reason IP verification and screening strategy can guarantee the availability of Agent IP.Other schemes are there are many restrictions, the acquisition side based on API Case will limit the request number of times of current grant user and current IP, very strict due to limiting, so causing acquisition speed most slow； It will limit the access speed of currently logged on user based on the mode that simulation logs in, it is therefore desirable to which what is proposed through the invention is adaptive Method finds the threshold value of concurrent collecting thread, just can be carried out and stablizes and more efficiently acquire；Based on construction visitor Cookie's Scheme is limited in single IP in set time window to the request frequency of microblog system.Limitation ratio of the microblog system for IP It is looser for the limitation of account.

Claims

1. a kind of microblog data acquisition methods based on more strategy fusions, which comprises the following steps:

Step 1: simulation logs in, and obtains the Cookie logined successfully；

Step 2: the Cookie that step 1 is obtained is saved in Cookie queue；

Step 4: crawling user using more account load balancings and pay close attention to list, subscriber data: accessing initial microblog users Set carries out the acquisition of subscriber data, customer relationship according to User ID structuring user's data URL；

Step 5: extracting User ID, generate concern relation and user information queue to be crawled, go to step 4；Meanwhile it generating and using The queue to be crawled of family microblogging；

Step 6: construction visitor Cookie: to be not logged in conditional access microblogging homepage, crawler is indicated using what microblog system generated The related content of the Cookie acquisition microblog of tourist's identity；

Step 7: capture program accesses user's microblogging collection of queues to be crawled, and constructs microblogging URL according to microblogging ID, is based on visitor The microblogging URL of Cookie and Agent IP pond access construction, carry out the acquisition of content of microblog；

Step 8: after capture program downloads the microblogging URL page, the page being explained, extracts microblogging ID, comment information is generated and waits for Crawl queue；

Step 9: capture program accesses comment information collection of queues to be crawled, and downloads microblogging based on visitor Cookie and Agent IP pond Review pages explain the page, microblogging comment information, and information is stored in database.

2. the microblog data acquisition methods according to claim 1 based on more strategy fusions, which is characterized in that the simulation It logs in are as follows: utilize program analog subscriber login service device, to obtain login account Cookie, its step are as follows:

Step 2) obtains the nonce and servertime of encryption: sending GET request and obtains nonce, servertime, pubkey With rsakv variable, nonce and servertime are for encrypting login password, and pubkey and rsakv are fixed values, directly It connects in write-in program；

Step 3) uses RSA2 encrypted login password: the nonce and rsakv obtained using step 2), in conjunction with the public key of microblogging Rsakt encrypts user password using RSA2 algorithm, obtains encrypted password；

Step 4) obtains credential server: key parameter sent, is completed after requesting by POST method, the sound that server will be passed back Answer information, including retcode and arrURL two parts content；

Step 5) obtains the Cookie logined successfully: by GET way access arrURL, server will return to of active user People's information, requesting the Cookie returned is effective Cookie, carries out data acquisition with it.

3. the microblog data acquisition methods according to claim 2 based on more strategy fusions, which is characterized in that less than 24 The interval timing of hour carries out simulation login, and the Cookie that will be failed is replaced with newest Cookie.

4. the microblog data acquisition methods according to claim 1 based on more strategy fusions, which is characterized in that more accounts In number load balancing, the Cookie for obtaining multiple accounts is logged in by simulation, and these Cookie are saved in a team In column；Crawler obtains a Cookie when request, from team's head, and gives initial ttl value, and each crawler carries should After Cookie makes requests, just the ttl value for corresponding to Cookie is subtracted 1, just puts the Cookie to team when ttl value is kept to 0 Tail, then take head of the queue Cookie to carry out page request from Cookie queue.

5. the microblog data acquisition methods according to claim 4 based on more strategy fusions, which is characterized in that described to ask After having sought a page, crawler is by random suspend mode certain time, to guarantee the safety of account.

6. the microblog data acquisition methods according to claim 4 based on more strategy fusions, which is characterized in that the step It is the net logged in based on simulation in conjunction with current network environment and Cookie queue length using adaptive concurrent acquisition strategies in 4 Network crawler, which finds, can stablize the concurrent thread number threshold value for quickly carrying out data grabber；The strategy includes fast increases and slow adjustment two A stage:

Exponentially increase request thread number in the fast increase stage, after thread increase, program is in a time window Whether account state used in interior judgement is normal；It is normal then continue the Thread Count that is multiplied, and according to the load balancing plan Slightly rotation Cookie；Abnormal then rejecting abnormalities Cookie, and a new account Cookie is added in tail of the queue, make Cookie queue Length is consistent with initial value, and sets the Thread Count of next time window to the half of current thread number, is gone forward side by side Enter the slow adjusting stage；

Wherein, N_t+1Indicate the gathering line number of passes of next time window, N_tIndicate the gathering line number of passes of actual time window； State indicate in actual time window data acquire, whether Cookie state normal, 1 be it is normal, 0 be extremely；Slowly it is adjusting Stage increases request thread number according to linear rule, and after thread increase, program judges to be used in a time window Account state it is whether normal；It is normal then continue to increase in a linear manner Thread Count, and according to the load balancing rotation Cookie；Abnormal then rejecting abnormalities Cookie, and add a new Cookie in tail of the queue makes Cookie queue length and initial Value is consistent, and reduces current thread number by linear mode；Terminate until the slow adjusting stage, then current thread number is to work as The best line of continual and steady acquisition microblog data is able to carry out under the conditions of preceding network environment and concurrent thread Cookie queue length Number of passes；

Wherein, N_t+1Indicate the gathering line number of passes of next time window, N_tIndicate the gathering line number of passes of actual time window； State indicate in actual time window data acquire, whether Cookie state normal, 1 be it is normal, 0 be extremely.

7. the microblog data acquisition methods according to claim 1 based on more strategy fusions, which is characterized in that the visitor Cookie make is as follows:

Step a) obtains tri- parameters of tid, c and w

Analysis browser header obtains the acquisition modes of tid: firstly the need of constructing variable fp and cb, fp parameter is by browser correlation Information is constituted, including parameter os, brower, fonts, plugins and screenInfo；Cb parameter is fixed value, is worth and is "gen_callback"；After the completion of fp and cb parametric configuration, get parms tid；Meanwhile server-side will return to new_tid With two parameters of confidence, the value of new_tid is true or false；When new_tid is true, w 3；Work as new_ When tid is false, w 2；The value of parameter c and the value of confidence are identical；

Step b) obtains the Cookie being not logged under state

A new Cookie is constructed by the tid that step a) is obtained first, the content of the Cookie includes a key-value pair, interior Hold for { " tid ": tid+ " _ _ "+c }；Then request is completed by GET method, then by checking the msg word in the content returned Whether the value of section is succ, is not logged in whether Cookie succeeds to judge to obtain；If msg value is succ, then it represents that obtain Cookie success, visitor Cookie can be obtained from the header of response.

8. the microblog data acquisition methods according to claim 1 based on more strategy fusions, which is characterized in that the IP generation Managing pond includes: Agent IP collector, Agent IP checker and Agent IP scheduler；It is public from network that Agent IP collector is responsible for timing The Agent IP source Collection agent IP opened, including agent IP address, port and the agreement of support；Agent IP checker is responsible for acquire Agent IP resource be timed verification；Agent IP scheduler is responsible for for qualified Agent IP being supplied to crawler use.

9. the microblog data acquisition methods according to claim 8 based on more strategy fusions, which is characterized in that the IP generation Manage pond the specific implementation process is as follows:

Step A) the transparent IP of filtering

When verifying to the Agent IP being newly put in storage, Agent IP checker will access HTTP request and response service, which returns The IP for returning corresponding HTTP request abandons the IP if the IP is identical with crawler server real IP；If return content and Server real IP is different, then giving mono- initial value of the IP；If there is the mistake of Agent IP port shutdown, then it is assumed that The IP is unavailable, directly deletes；

Step B) it is verified for microblogging website itself

Microblogging homepage is accessed using Agent IP, if in the page that microblogging returns including " microblogging-finds strange thing whenever and wherever possible " word Symbol string, then the Agent IP can be used for microblog data acquisition；If occurring yzm_input in response page, the generation is directly deleted Manage IP；If there is request timed out, then subtract 1 to the score value of the Agent IP；If there is port shutdown mistake, then directly deleting should Agent IP；For the Agent IP that verification passes through, its score in Agent IP pond, the last checking time and response speed are updated Degree, using the standard as scheduler from Agent IP pond screening Agent IP；Step C) Agent IP scheduling

Agent IP scheduler is default according to three score of Agent IP, response time and nearest checking time attributes in Agent IP pond Value selects the Agent IP for meeting specified requirement and is ranked up from Agent IP pond, forms an end to end chained list；Every time When crawler requests microblog page, the Agent IP for being located at linked list head node for its distribution is scheduled by Agent IP scheduler, success When obtaining response results, which is placed into tail of the queue；If request failure, deletes the Agent IP from the chained list；

It accesses after IP agent pool, all HTTP requests is managed using downloader middleware；For access authority requirement Higher microblog users data collection module, downloader middleware obtains a Cookie from Cookie queue head, then carries The Cookie carries out data acquisition；Lower content of microblog acquisition module is required for access authority, downloader middleware passes through Agent IP scheduler obtains an Agent IP, and carries the visitor Cookie constructed by the Agent IP and carry out data acquisition.