CN109933701A - A kind of microblog data acquisition methods based on more strategy fusions - Google Patents
A kind of microblog data acquisition methods based on more strategy fusions Download PDFInfo
- Publication number
- CN109933701A CN109933701A CN201910175559.0A CN201910175559A CN109933701A CN 109933701 A CN109933701 A CN 109933701A CN 201910175559 A CN201910175559 A CN 201910175559A CN 109933701 A CN109933701 A CN 109933701A
- Authority
- CN
- China
- Prior art keywords
- cookie
- agent
- microblogging
- microblog
- queue
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of microblog data acquisition methods based on more strategy fusions, simulation first is logged in, and obtains the Cookie logined successfully;The Cookie that will acquire is saved in Cookie queue, obtains initiating task;Then user is crawled using more account load balancings pay close attention to list, subscriber data;User ID is extracted, concern relation and user information queue to be crawled are generated, then crawls user and pays close attention to list, subscriber data, while generating the queue to be crawled of user's microblogging;Visitor Cookie is constructed, content of microblog is crawled using the acceleration of IP agent pool, information is stored in database;Microblogging ID is extracted, comment information queue to be crawled is generated;Microblogging comment information is crawled, information is stored in database.The present invention finds the concurrent request number of suitable current network environment and Cookie queue length by adaptive algorithm, obtains balance between acquisition speed and account number safety;High Availabitity Agent IP module is realized simultaneously to accelerate data to acquire, and is provided basic data for Internet public opinion analysis and is supported.
Description
Technical field
The present invention relates to technical field of network data collection, specially a kind of microblog data based on more strategy fusions is obtained
Method.
Background technique
The universal and development of internet promotes flourishing for social networks.Microblogging is as presently most popular social activity
The features such as one of network application, state information updating big with its number of users radix is frequently, information propagation is rapid, in recent years
In obtain swift and violent development, it has also become one of main communication media of China.According to " the 42nd China Internet network state of development
Statistical report " display, by June, 2018, microblogging comes social application third position with 42.1% subscriber usage, compared with 2017
Year December increases by 1.2%, bean vermicelli interaction and in terms of further strengthen.Social network user is huge,
Information spread speed is fast, abundant in content, and coverage is wide, and Internet public opinion analysis is had a very important significance.
Current microblog data acquisition method generally passes through microblogging application programming interface (Application
Programming Interface, API) or based on simulation log in mode carry out data acquisition.It is adopted using microblogging API
Collection will receive microblog system API authorization and the limitation of call number, the data volume of acquisition are few daily;Line number is logged by simulation
Although breaching the limitation of microblogging API according to acquisition, multiple accounts is needed to cooperate certain load balancing that can just accomplish
It more quickly acquires, and has the risk of title in larger scale data acquisition, the difficulty of account keep-alive is larger.It is existing
Microblog data acquisition method is often used single acquisition strategies, causes the data volume of acquisition unstable, low efficiency is lower.
Summary of the invention
In view of the above-mentioned problems, the purpose of the present invention is to provide one kind to stablize, efficiently acquires microblog data, for net
Network the analysis of public opinion provides the microblog data acquisition methods based on more strategy fusions that basic data is supported.Technical solution is as follows:
A kind of microblog data acquisition methods based on more strategy fusions, comprising the following steps:
Step 1: simulation logs in, and obtains the Cookie logined successfully;
Step 2: the Cookie that step 1 is obtained is saved in Cookie queue;
Step 3: obtaining initiating task: choosing the more microblog users of multiple number of fans as the node initially crawled;
Step 4: crawling user using more account load balancings and pay close attention to list, subscriber data: accessing initial microblogging
User's set carries out the acquisition of subscriber data, customer relationship according to User ID structuring user's data URL;
Step 5: extracting User ID, generate concern relation and user information queue to be crawled, go to step 4;Meanwhile it is raw
At the queue to be crawled of user's microblogging;
Step 6: construction visitor Cookie: to be not logged in conditional access microblogging homepage, crawler is generated using microblog system
Indicate the related content of the Cookie acquisition microblog of tourist's identity;
Step 7: capture program accesses user's microblogging collection of queues to be crawled, and microblogging URL is constructed according to microblogging ID, based on visit
The microblogging URL of objective Cookie and Agent IP pond access construction, carry out the acquisition of content of microblog;
Step 8: after capture program downloads the microblogging URL page, the page being explained, extracts microblogging ID, generates comment letter
Cease queue to be crawled;
Step 9: capture program accesses comment information collection of queues to be crawled, and is downloaded based on visitor Cookie and Agent IP pond
Microblogging review pages, explain the page, microblogging comment information, and information is stored in database.
Further, the simulation logs in are as follows: program analog subscriber login service device is utilized, to obtain login account
Cookie, its step are as follows:
The pre- logging request of step 1): program carries out base64 coding to user name, and constructs pre- logging request address;
Step 2) obtain encryption nonce and servertime: send GET request obtain nonce, servertime,
For encrypting to login password, pubkey and rsakv are fixed by pubkey and rsakv variable, nonce and servertime
Value, writes direct in program;
Step 3) uses RSA2 encrypted login password: the nonce and rsakv obtained using step 2), in conjunction with microblogging
Public key rsakt encrypts user password using RSA2 algorithm, obtains encrypted password;
Step 4) obtains credential server: sending key parameter, after completing request by POST method, server will be passed back
Response message, including retcode and arrURL two parts content;
Step 5) obtains the Cookie logined successfully: by GET way access arrURL, server will return to active user
Personal information, request return Cookie be effective Cookie, carry out data acquisition with it.
Further, simulation login is carried out with the interval timing less than 24 hours, replacing with newest Cookie will
The Cookie of failure.
Further, in more account load balancings, the Cookie for obtaining multiple accounts is logged in by simulation,
And these Cookie are saved in a queue;Crawler obtains a Cookie when request, from team's head, and gives
Initial ttl value, each crawler carry after the Cookie makes requests, just subtract 1 for the ttl value for corresponding to Cookie, work as ttl value
Just the Cookie is put to tail of the queue when being kept to 0, then takes head of the queue Cookie to carry out page request from Cookie queue.
Further, described after having requested a page, crawler is by random suspend mode certain time, to guarantee account
Safety.
Further, using adaptive concurrent acquisition strategies in the step 4, in conjunction with current network environment and Cookie
Queue length is that the web crawlers searching logged in based on simulation can stablize the concurrent thread number threshold value for quickly carrying out data grabber;Institute
Stating strategy includes fast increase and two stages of adjustment slowly:
Exponentially increase request thread number in the fast increase stage, after thread increase, program is in a time
Whether account state used in judging in window is normal;It is normal then continue the Thread Count that is multiplied and equal according to the load
Weigh tactful rotation Cookie;Abnormal then rejecting abnormalities Cookie, and a new account Cookie is added in tail of the queue, make Cookie
Queue length is consistent with initial value, and sets the Thread Count of next time window to the half of current thread number,
And enter the slow adjusting stage;
Wherein, Nt+1Indicate the gathering line number of passes of next time window, NtIndicate the collecting thread of actual time window
Number;State indicate in actual time window data acquire, whether Cookie state normal, 1 be it is normal, 0 be extremely;
Increase request thread number according to linear rule in the slow adjusting stage, after thread increase, program is in a time
Whether account state used in judging in window is normal;It is normal then continue to increase in a linear manner Thread Count, and according to described
Load balancing rotation Cookie;Abnormal then rejecting abnormalities Cookie, and a new Cookie is added in tail of the queue, make
Cookie queue length is consistent with initial value, and reduces current thread number by linear mode;It is tied until the slow adjusting stage
Beam, then current thread number be able to carry out under the conditions of current network environment and concurrent thread Cookie queue length it is continual and steady
Acquire the best Thread Count of microblog data;
Wherein, Nt+1Indicate the gathering line number of passes of next time window, NtIndicate the collecting thread of actual time window
Number;State indicate in actual time window data acquire, whether Cookie state normal, 1 be it is normal, 0 be extremely.
Further, the visitor Cookie make is as follows:
Step a) obtains tri- parameters of tid, c and w
Analysis browser header obtains the acquisition modes of tid: firstly the need of constructing variable fp and cb, fp parameter is by browser
Relevant information is constituted, including parameter os, brower, fonts, plugins and screenInfo;Cb parameter is fixed value, value
For " gen_callback ";After the completion of fp and cb parametric configuration, get parms tid;Meanwhile server-side will return to new_
Two parameters of tid and confidence, the value of new_tid are true or false;When new_tid is true, w 3;When
When new_tid is false, w 2;The value of parameter c and the value of confidence are identical;
Step b) obtains the Cookie being not logged under state
A new Cookie is constructed by the tid that step a) is obtained first, the content of the Cookie includes a key assignments
Right, content is { " tid ": tid+ " _ _ "+c };Then request is completed by GET method, then by checking in the content returned
Whether the value of msg field is succ, is not logged in whether Cookie succeeds to judge to obtain;If msg value is succ, then it represents that obtain
Cookie success is taken, visitor Cookie can be obtained from the header of response.
Further, the IP agent pool includes: Agent IP collector, Agent IP checker and Agent IP scheduler;
Agent IP collector is responsible for timing disclosed Agent IP source Collection agent IP, including agent IP address, port and support from network
Agreement;Agent IP checker is responsible for the Agent IP resource of acquisition being timed verification;Agent IP scheduler is responsible for meet
The Agent IP of condition is supplied to crawler use.
Further, the IP agent pool the specific implementation process is as follows:
Step A) the transparent IP of filtering
When verifying to the Agent IP being newly put in storage, Agent IP checker will access https: //httpbin.org/ip clothes
Business (could what service this is with verbal description), which returns to the IP of corresponding HTTP request, if the IP and crawler service
Device real IP is identical, then abandons the IP;If the content returned is different with server real IP, the IP mono- is given initially
Score value;If there is the mistake of Agent IP port shutdown, then it is assumed that the IP is unavailable, directly deletes;
Step B) it is verified for microblogging website itself
Microblogging homepage is accessed using Agent IP, if comprising " microblogging-finds fresh whenever and wherever possible in the page that microblogging returns
Thing " character string, then the Agent IP can be used for microblog data acquisition;If occurring yzm_input in response page, directly delete
The Agent IP;If there is request timed out, then subtract 1 to the score value of the Agent IP;If there is port shutdown mistake, then directly delete
Except the Agent IP;For the Agent IP that verification passes through, its score in Agent IP pond, the last checking time and sound are updated
Speed is answered, using the standard as scheduler from Agent IP pond screening Agent IP;
Step C) Agent IP scheduling
Agent IP scheduler is according to three score of Agent IP, response time and nearest checking time attributes in Agent IP pond
Preset value selects the Agent IP for meeting specified requirement and is ranked up from Agent IP pond, forms an end to end chained list;
When each crawler request microblog page, the Agent IP for being located at linked list head node for its distribution is scheduled by Agent IP scheduler,
When successfully obtaining response results, which is placed into tail of the queue;If request failure, deletes the Agent IP from the chained list;
It accesses after IP agent pool, all HTTP requests is managed using downloader middleware;For access authority
More demanding microblog users data collection module, downloader middleware obtain a Cookie from Cookie queue head, then
It carries the Cookie and carries out data acquisition;Lower content of microblog acquisition module, downloader middleware are required for access authority
An Agent IP is obtained by Agent IP scheduler, and the visitor Cookie constructed progress data are carried by the Agent IP and are adopted
Collection.
The beneficial effects of the present invention are:
(1) present invention logs in crawl microblogging according to simulation and constructs the feature of visitor Cookie crawl microblogging respectively, proposes
The method of optimization.
(2) IP agent pool proposed by the present invention, can be used for the number of other social networks, news website, forum or blog etc.
According to collection process, capture program is avoided to cause data acquisition to be interrupted because of IP access limitation.
(3) the microblog data acquisition method for more strategy fusions that the present invention designs, can stablize, efficiently acquire microblog number
According to.
Detailed description of the invention
Fig. 1 is microblog data acquisition system architecture diagram of the invention.
Fig. 2 is the microblog data collecting flowchart of the invention based on more strategy fusions.
Fig. 3 is microblog system simulation login process figure of the invention.
Fig. 4 is Agent IP crawl and checking process of the invention.
Fig. 5 is user's concern relation acquisition performance comparison diagram of the invention.
Fig. 6 is user information acquisition performance comparison diagram of the invention.
Fig. 7 is micro-blog information acquisition performance comparison diagram of the invention.
Fig. 8 is comment information acquisition performance comparison diagram of the invention.
Specific embodiment
The present invention is described in further details in the following with reference to the drawings and specific embodiments.
It includes under microblogging personal information, microblog users relationship, hot topic content, hot topic that microblog data, which acquires content,
All microbloggings, all microbloggings of user, all comments of microblogging and forwarding content etc..Due to subscriber data, content of microblog and microblogging
Comment has a very important significance in the analysis of public opinion, thus the subsequent experimental selection subscriber data of the present invention, customer relationship,
User's microblogging and its comment are used as acquisition target.
The present invention is based on more strategy fusion acquisition methods to propose a microblog data acquisition system, framework such as Fig. 1 institute
Show, system uses breadth-first strategy, is first depending on the artificial selected seed node of bean vermicelli quantity, acquires the concern column of initial user
Table successively obtains active user's everyone concern list of interest, extends to the outside in layer, while acquiring user information
And all micro-blog informations of the user and its comment, specific collecting flowchart it is as shown in Figure 2.
Embodiment of the present invention is unfolded to describe below in conjunction with concrete case.
Step 1: simulation logs in, and obtains the Cookie logined successfully;
Simulation, which logs in, to be referred to using program analog subscriber login service device, to obtain the process of login account Cookie.
The process that microblog system simulation logs in is as shown in Figure 3.
1, pre- logging request
Program carries out base64 coding to user name, then constructs pre- logging request address, make are as follows: http: //
Login.sina.com.cn/sso/prelogin.php? entry=weibo&callback=sinaSSOController.
PreloginCallBack&su=MTg3MDgxMDMwMzM%3D&rsakt=mod&checkpi n=1&client=
(v1.4.18) &_=1526959231, wherein the value MTg3MDgxMDMwMzM%3D expression of su passes through ssologin.js
(note: the user after base64 coding is entitled " MTg3MDgxMDMwMzM=", in the URL of request for user name after base64 coding
In "=" use " %3D " replacement), _ value 1526959231 be current time stamp.
2, the nonce and servertime of encryption are obtained
Send the variables such as GET request available nonce, servertime, pubkey and rsakv, nonce and
For servertime for encrypting in next step to login password, pubkey and rsakv are fixed values, can write direct journey
In sequence.
3, using RSA2 encrypted login password
Microblogging encrypts login password password using RSA2 algorithm.Using previous step obtain nonce and
Rsakv encrypts user password using RSA2 algorithm in conjunction with the public key rsakt of microblogging, obtains encrypted password.
4, credential server is obtained
Http:// login.sina.com.cn/sso/login.php is requested by POST method? client=
Ssologin.js (v1.4.19), the key parameter for needing to send are as follows:
Entry=weibo//login source
Savestate=7//whether save password
Useticket=1//whether logged in using user credential
User name after su=MTg3MDgxMDMwMzM=//base64 coding
The server time stamp that servertime=1526959231//pre- entry stage obtains
Nonce=ES6HQ1//pre- server the random code for logging in acquisition
Password after sp=password//acquisition encryption
After request is completed, server can pass a response message, including retcode and arrURL two parts content back.
Wherein, the URL that the corresponding value of arrURL needs for next step verifying.
5, the Cookie logined successfully is obtained
By GET way access arrURL, server will return to the personal information of active user, request the Cookie returned
As effective Cookie can carry out data acquisition with it.In addition, present invention discover that the expired time of microblogging Cookie is 24 small
When, if to meet the needs of efficiently carrying out data acquisition steady in a long-term, it should carry out mould with the interval timing less than 24 hours
It is quasi- to log in, the Cookie that will be failed is replaced with newest Cookie.
Step 2: the Cookie that step 1 is obtained is saved in Cookie queue;
For a large amount of microblog datas of Quick Acquisition, multiple Cookie logined successfully need to be obtained, these Cookie are saved
Into Cookie queue, guarantee that step 4 acquires microblog data using more account load balancing.
Step 3: obtaining initiating task;
The more microblog users of multiple number of fans are chosen as the node initially crawled.The concern behavior of microblog users is micro-
The bean vermicelli quantity of the form of expression of rich topological structure, microblog users can be from his influence power size of side illustration and the account
Quality.Choosing the more microblog users of multiple number of fans can be effectively avoided the user node formation ring acquired or acquisition
To a large amount of corpse users.
Step 4: crawling user using more account load balancings and pay close attention to list, subscriber data;
Initial microblog users set is accessed, according to User ID structuring user's data URL, carries out subscriber data, Yong Huguan
The acquisition of system.
Under logging state, microblog system limits the request amount of single account within a certain period of time, if currently
The request rate of account is limited more than micro blog server, then abnormality can be labeled as by microblogging anti-crawler system.In order to add
Fast microblog data acquisition, this implementation use 10 accounts and carry out data acquisition using certain access strategy.
Firstly, logging in the Cookie for obtaining 10 accounts by simulation, these Cookie are saved in a queue.It climbs
Worm obtains a Cookie when request, from team's head, and giving initial TTL is 100, and each crawler carries the Cookie
After making requests, just the TTL for corresponding to Cookie is subtracted 1, just puts the Cookie to tail of the queue when TTL is kept to 0, then from
Head of the queue Cookie is taken to carry out page request in Cookie queue.In order to simulate realistically the operation of people, a page is being requested
After face, crawler can random suspend mode certain time, to guarantee the safety of account.In order to improve collecting efficiency, the present invention is used
Multithreading carries out concurrent request.Thread will lead to the increase of the amount of access in same time window too much, to increase the wind of title
Danger, it is therefore desirable to find an energy in conjunction with current network environment and Cookie queue length for the web crawlers logged in based on simulation
Stablize the concurrent request threshold value for quickly carrying out data grabber.Based on the thought of TCP congestion avoidance algorithm, the present invention is using adaptive
Concurrent acquisition strategies, to find the concurrent thread threshold value for being able to carry out the acquisition of stability and high efficiency data.The strategy includes fast increases
Adduction adjusts two stages slowly.
1, increase the stage fastly
Exponentially increase request thread number at this stage, after thread increase, program is in a time window
Whether account state used in interior judgement is normal, normally then continues the Thread Count that is multiplied, and according to above-mentioned load balancing side
Formula rotation Cookie, abnormal then rejecting abnormalities Cookie, and a new account Cookie is added in tail of the queue, make Cookie queue
Length is consistent with initial value, and sets the Thread Count of next time window to the half of current thread number, is gone forward side by side
Enter the slow adjusting stage.
Nt+1Indicate the gathering line number of passes of next time window, NtIndicate the gathering line number of passes of actual time window;
State indicates that data acquire in actual time window, and whether Cookie state is normal.
2, the slow adjusting stage
Increase request thread number according to linear rule at this stage, after thread increase, program is in a time window
Whether account state used in interior judgement is normal, normally then continues to increase in a linear manner Thread Count, and according to above-mentioned load
Balanced way rotation Cookie, abnormal then rejecting abnormalities Cookie, and a new Cookie is added in tail of the queue, make Cookie team
Column length is consistent with initial value, and reduces current thread number by linear mode.The final slow adjusting stage terminates, and works as front
Number of passes is to be able to carry out the best of continual and steady acquisition microblog data under the conditions of the network environment and the Cookie queue length
Thread Count.
Step 5: extracting User ID, generate concern relation and user information queue to be crawled, go to step 4;Meanwhile it is raw
At the queue to be crawled of user's microblogging;
Step 6: construction visitor Cookie;
It is not logged in access microblogging homepage under state, microblog system can generate Cookie for current tourist, to indicate tourist's body
Part, crawler can acquire the related content of microblog using the Cookie.Visitor's Cookie make is as follows:
1, tri- parameters of tid, c and w are obtained
Analyze the acquisition modes of the available tid of browser header.Firstly the need of constructing variable fp and cb, fp is by browser
Relevant information is constituted, including parameter os, brower, fonts, plugins and screenInfo etc., these information can carry out puppet
It makes, a legal fp content of parameter is as follows:
{"os":"1","browser":"Chrome57,0,2110,104","fonts":"undefined","
screenInfo":"1436*752*24","plugins":"Portable Document Format::internal-pdf-
viewer::Chrome PDF Plugin|::mhjfbmdgcfjbbpaeojofohoefgie hjai::Chrome PDF
Viewer|::internal-nacl-plugin::Nati ve Client|Enables Widevine licenses for
playback of HTML audio/video content.(version:1.4.8.1008)::wi
devinecdmadapter.dll::Widevine Content Decryption Module"}
Cb parameter is a fixed value, is worth for " gen_callback ".After the completion of fp and cb parametric configuration, pass through
POST method requests https: //passport.weibo.com/visitor/genvisitor, can get parms tid.
Meanwhile server-side can return to two parameters of new_tid and confidence, the value of new_tid is true or false.When
When new_tid is true, w 3;When new_tid is false, w 2.The value of parameter c and the value of confidence are identical.
2, the Cookie being not logged under state is obtained
A new Cookie is constructed by the tid that first step obtains first, the content of the Cookie includes a key
Value pair, content are { " tid ": tid+ " _ _ "+c };Then https is requested by GET method again: //
Passport.weibo.com/visitor/visitor? a=incarnate&t=tid&w=w&c=c&gc=&cb=
Cross_domain&from=weibo.After request is completed, whether the value of the msg field in content by checking return
For succ, it is not logged in whether Cookie succeeds to judge to obtain.It is succ if it is msg value, then it represents that Cookie success is obtained,
Visitor Cookie can be obtained from the header of response.
Step 7: crawling content of microblog using the acceleration of IP agent pool, information is stored in database;
Capture program accesses user's microblogging collection of queues to be crawled, and constructs microblogging URL according to microblogging ID, is based on visitor
The microblogging URL of Cookie and Agent IP pond access construction, carry out the acquisition of content of microblog.
It is not logged in microblog system under state and mainly passes through IP and the acquisition behavior of crawler is limited, in order to accelerate data to adopt
Collection, the present invention have designed and Implemented an IP agent pool.IP agent pool is made of 3 parts: Agent IP collector, Agent IP verification
Device and Agent IP scheduler.Agent IP collector is responsible for timing disclosed Agent IP source Collection agent IP, including agency from network
IP address, port and the agreement of support.Agent IP checker is responsible for the Agent IP resource of acquisition being timed verification.Agent IP
Scheduler is responsible for for qualified Agent IP being supplied to crawler use.The specific implementation flow of IP agent pool is as shown in Figure 4.
1, transparent IP is filtered
When verifying to the Agent IP being newly put in storage, Agent IP checker can access HTTP request and response service (URL
Location is https: //httpbin.org/ip, which is that internet freely verifies service, also can according to need and takes in Intranet
Build privately owned HTTP request and response service), which returns to the IP of corresponding HTTP request, if the IP and crawler server are true
IP is identical, then abandons the IP;If the content returned is different with server real IP, mono- initial value of the IP is given,
The present invention is set as 5;In the event of the situation of time-out, it is 4 points that the present invention, which gives one initial score value of the Agent IP,;If
There are the mistakes such as Agent IP port shutdown, then then thinking that the IP is unavailable, directly deletes.
2, it is verified for microblogging website itself
Agent IP after filtering might not can be used for microblog data acquisition, it is possible to which the IP is by microblogging screen
It covers, it is also possible to which the Agent IP is second-rate, and it is not fast enough to act on behalf of speed.Microblogging homepage is accessed by using Agent IP, is compared micro-
It whether include that " microblogging-finds strange thing whenever and wherever possible " this character string can determine whether that the Agent IP whether may be used in the rich page returned
It is acquired for microblog data;If occurring yzm_input in response page, it was demonstrated that the Agent IP has been identified as by microblogging different
Often, then the Agent IP is directly deleted;In the event of request timed out, then subtract 1 to the score value of the Agent IP;In the event of port
The mistakes such as closing, then directly delete the Agent IP.For the Agent IP that passes through of verification, update its score in Agent IP pond,
The last checking time and response speed, using the standard as scheduler from Agent IP pond screening Agent IP.It will verify recently
Time as screening criteria first is that because the online disclosed Agent IP service life it is all shorter, generally 3~10 minutes, it is possible to
After this verification passes through, through just no longer valid after a period of time.The score of Agent IP is as judge Agent IP stability
Standard, response speed is as the standard for judging Agent IP request rate.
3, Agent IP is dispatched
In addition to good Agent IP verifies strategy, Agent IP scheduling strategy is also extremely important.Agent IP scheduler is according to generation
Three score, response time and nearest checking time attribute preset values of Agent IP select to meet from Agent IP pond in the reason pond IP
Specified desired Agent IP is simultaneously ranked up, and forms an end to end chained list.When each crawler request microblog page, agency
The Agent IP that IP scheduler can be located at linked list head node for its distribution is scheduled, and when successfully obtaining response results, which is put
Set tail of the queue;If request failure, deletes the Agent IP from the chained list.Agent IP scheduler is equipped with a subprocess simultaneously
Timing screening from Agent IP pond meets the Agent IP of above three condition, chained list tail portion is placed into, to prevent in scheduler
Agent IP lazy weight.
It accesses after IP agent pool, all HTTP requests is managed using downloader middleware.For access authority
More demanding microblog users data collection module, downloader middleware can obtain a Cookie from Cookie queue head,
The Cookie is carried again carries out data acquisition.Lower content of microblog acquisition module required for access authority, among downloader
Part can obtain an Agent IP by Agent IP scheduler, and carry the visitor Cookie constructed by the Agent IP and count
According to acquisition.
Step 8: extracting microblogging ID, generate comment information queue to be crawled;
After capture program downloads the microblogging URL page, the page is explained, extracts microblogging ID, comment information is generated and waits climbing
Take queue.
Step 9: crawling microblogging comment information, information is stored in database.
Capture program accesses comment information collection of queues to be crawled, and is commented based on visitor Cookie and Agent IP pond downloading microblogging
By the page, the page is explained, microblogging comment information, information is stored in database.
It is more than the 1000000 big V User ID of microblogging that Fig. 5, which has selected 10 number of fans, is stepped on using them as seed ID by simulation
Record is paid close attention in a manner of single thread list Cookie, the more Cookie of multithreading, the adaptive concurrent thread number of more Cookie respectively
Relationship continuous collecting, it can be seen that in the acquisition initial stage, the efficiency far that 10 threads concurrently acquire is higher than other schemes, when even
Continue after acquiring 5 hours, the data volume that the acquisition scheme of 10 threads obtains but is much smaller than adopting for adaptive concurrent thread scheme
Collection amount, this is because the concurrent amount of access of 10 threads is larger, after program has run a period of time, account is determined by microblog system
It is abnormality so as to cause subsequent acquisition failure.
Using using the collected all User ID of adaptive concurrent thread number in one day as seed, respectively using being based on
The mode of API, based on simulation log in mode, based on API and construction visitor Cookie fusion mode, based on simulation log in and
The mode of construction visitor Cookie fusion has carried out data to subscriber data, all microbloggings and the microblogging comment of seed user and has adopted
Collection.
Fig. 6-8 is successively the case where comment using 5 hours user informations of each scheme continuous acquisition, content of microblog and microblogging.
It can be seen that in conjunction with Fig. 6-8, the speed with the visitor Cookie scheme for merging and agent pool being cooperated to be acquired logged in using simulation
Considerably beyond other schemes.This is because being the Agent IP quality of agent pool using program bottleneck, good generation is used
Reason IP verification and screening strategy can guarantee the availability of Agent IP.Other schemes are there are many restrictions, the acquisition side based on API
Case will limit the request number of times of current grant user and current IP, very strict due to limiting, so causing acquisition speed most slow;
It will limit the access speed of currently logged on user based on the mode that simulation logs in, it is therefore desirable to which what is proposed through the invention is adaptive
Method finds the threshold value of concurrent collecting thread, just can be carried out and stablizes and more efficiently acquire;Based on construction visitor Cookie's
Scheme is limited in single IP in set time window to the request frequency of microblog system.Limitation ratio of the microblog system for IP
It is looser for the limitation of account.
Claims (9)
1. a kind of microblog data acquisition methods based on more strategy fusions, which comprises the following steps:
Step 1: simulation logs in, and obtains the Cookie logined successfully;
Step 2: the Cookie that step 1 is obtained is saved in Cookie queue;
Step 3: obtaining initiating task: choosing the more microblog users of multiple number of fans as the node initially crawled;
Step 4: crawling user using more account load balancings and pay close attention to list, subscriber data: accessing initial microblog users
Set carries out the acquisition of subscriber data, customer relationship according to User ID structuring user's data URL;
Step 5: extracting User ID, generate concern relation and user information queue to be crawled, go to step 4;Meanwhile it generating and using
The queue to be crawled of family microblogging;
Step 6: construction visitor Cookie: to be not logged in conditional access microblogging homepage, crawler is indicated using what microblog system generated
The related content of the Cookie acquisition microblog of tourist's identity;
Step 7: capture program accesses user's microblogging collection of queues to be crawled, and constructs microblogging URL according to microblogging ID, is based on visitor
The microblogging URL of Cookie and Agent IP pond access construction, carry out the acquisition of content of microblog;
Step 8: after capture program downloads the microblogging URL page, the page being explained, extracts microblogging ID, comment information is generated and waits for
Crawl queue;
Step 9: capture program accesses comment information collection of queues to be crawled, and downloads microblogging based on visitor Cookie and Agent IP pond
Review pages explain the page, microblogging comment information, and information is stored in database.
2. the microblog data acquisition methods according to claim 1 based on more strategy fusions, which is characterized in that the simulation
It logs in are as follows: utilize program analog subscriber login service device, to obtain login account Cookie, its step are as follows:
The pre- logging request of step 1): program carries out base64 coding to user name, and constructs pre- logging request address;
Step 2) obtains the nonce and servertime of encryption: sending GET request and obtains nonce, servertime, pubkey
With rsakv variable, nonce and servertime are for encrypting login password, and pubkey and rsakv are fixed values, directly
It connects in write-in program;
Step 3) uses RSA2 encrypted login password: the nonce and rsakv obtained using step 2), in conjunction with the public key of microblogging
Rsakt encrypts user password using RSA2 algorithm, obtains encrypted password;
Step 4) obtains credential server: key parameter sent, is completed after requesting by POST method, the sound that server will be passed back
Answer information, including retcode and arrURL two parts content;
Step 5) obtains the Cookie logined successfully: by GET way access arrURL, server will return to of active user
People's information, requesting the Cookie returned is effective Cookie, carries out data acquisition with it.
3. the microblog data acquisition methods according to claim 2 based on more strategy fusions, which is characterized in that less than 24
The interval timing of hour carries out simulation login, and the Cookie that will be failed is replaced with newest Cookie.
4. the microblog data acquisition methods according to claim 1 based on more strategy fusions, which is characterized in that more accounts
In number load balancing, the Cookie for obtaining multiple accounts is logged in by simulation, and these Cookie are saved in a team
In column;Crawler obtains a Cookie when request, from team's head, and gives initial ttl value, and each crawler carries should
After Cookie makes requests, just the ttl value for corresponding to Cookie is subtracted 1, just puts the Cookie to team when ttl value is kept to 0
Tail, then take head of the queue Cookie to carry out page request from Cookie queue.
5. the microblog data acquisition methods according to claim 4 based on more strategy fusions, which is characterized in that described to ask
After having sought a page, crawler is by random suspend mode certain time, to guarantee the safety of account.
6. the microblog data acquisition methods according to claim 4 based on more strategy fusions, which is characterized in that the step
It is the net logged in based on simulation in conjunction with current network environment and Cookie queue length using adaptive concurrent acquisition strategies in 4
Network crawler, which finds, can stablize the concurrent thread number threshold value for quickly carrying out data grabber;The strategy includes fast increases and slow adjustment two
A stage:
Exponentially increase request thread number in the fast increase stage, after thread increase, program is in a time window
Whether account state used in interior judgement is normal;It is normal then continue the Thread Count that is multiplied, and according to the load balancing plan
Slightly rotation Cookie;Abnormal then rejecting abnormalities Cookie, and a new account Cookie is added in tail of the queue, make Cookie queue
Length is consistent with initial value, and sets the Thread Count of next time window to the half of current thread number, is gone forward side by side
Enter the slow adjusting stage;
Wherein, Nt+1Indicate the gathering line number of passes of next time window, NtIndicate the gathering line number of passes of actual time window;
State indicate in actual time window data acquire, whether Cookie state normal, 1 be it is normal, 0 be extremely;Slowly it is adjusting
Stage increases request thread number according to linear rule, and after thread increase, program judges to be used in a time window
Account state it is whether normal;It is normal then continue to increase in a linear manner Thread Count, and according to the load balancing rotation
Cookie;Abnormal then rejecting abnormalities Cookie, and add a new Cookie in tail of the queue makes Cookie queue length and initial
Value is consistent, and reduces current thread number by linear mode;Terminate until the slow adjusting stage, then current thread number is to work as
The best line of continual and steady acquisition microblog data is able to carry out under the conditions of preceding network environment and concurrent thread Cookie queue length
Number of passes;
Wherein, Nt+1Indicate the gathering line number of passes of next time window, NtIndicate the gathering line number of passes of actual time window;
State indicate in actual time window data acquire, whether Cookie state normal, 1 be it is normal, 0 be extremely.
7. the microblog data acquisition methods according to claim 1 based on more strategy fusions, which is characterized in that the visitor
Cookie make is as follows:
Step a) obtains tri- parameters of tid, c and w
Analysis browser header obtains the acquisition modes of tid: firstly the need of constructing variable fp and cb, fp parameter is by browser correlation
Information is constituted, including parameter os, brower, fonts, plugins and screenInfo;Cb parameter is fixed value, is worth and is
"gen_callback";After the completion of fp and cb parametric configuration, get parms tid;Meanwhile server-side will return to new_tid
With two parameters of confidence, the value of new_tid is true or false;When new_tid is true, w 3;Work as new_
When tid is false, w 2;The value of parameter c and the value of confidence are identical;
Step b) obtains the Cookie being not logged under state
A new Cookie is constructed by the tid that step a) is obtained first, the content of the Cookie includes a key-value pair, interior
Hold for { " tid ": tid+ " _ _ "+c };Then request is completed by GET method, then by checking the msg word in the content returned
Whether the value of section is succ, is not logged in whether Cookie succeeds to judge to obtain;If msg value is succ, then it represents that obtain
Cookie success, visitor Cookie can be obtained from the header of response.
8. the microblog data acquisition methods according to claim 1 based on more strategy fusions, which is characterized in that the IP generation
Managing pond includes: Agent IP collector, Agent IP checker and Agent IP scheduler;It is public from network that Agent IP collector is responsible for timing
The Agent IP source Collection agent IP opened, including agent IP address, port and the agreement of support;Agent IP checker is responsible for acquire
Agent IP resource be timed verification;Agent IP scheduler is responsible for for qualified Agent IP being supplied to crawler use.
9. the microblog data acquisition methods according to claim 8 based on more strategy fusions, which is characterized in that the IP generation
Manage pond the specific implementation process is as follows:
Step A) the transparent IP of filtering
When verifying to the Agent IP being newly put in storage, Agent IP checker will access HTTP request and response service, which returns
The IP for returning corresponding HTTP request abandons the IP if the IP is identical with crawler server real IP;If return content and
Server real IP is different, then giving mono- initial value of the IP;If there is the mistake of Agent IP port shutdown, then it is assumed that
The IP is unavailable, directly deletes;
Step B) it is verified for microblogging website itself
Microblogging homepage is accessed using Agent IP, if in the page that microblogging returns including " microblogging-finds strange thing whenever and wherever possible " word
Symbol string, then the Agent IP can be used for microblog data acquisition;If occurring yzm_input in response page, the generation is directly deleted
Manage IP;If there is request timed out, then subtract 1 to the score value of the Agent IP;If there is port shutdown mistake, then directly deleting should
Agent IP;For the Agent IP that verification passes through, its score in Agent IP pond, the last checking time and response speed are updated
Degree, using the standard as scheduler from Agent IP pond screening Agent IP;Step C) Agent IP scheduling
Agent IP scheduler is default according to three score of Agent IP, response time and nearest checking time attributes in Agent IP pond
Value selects the Agent IP for meeting specified requirement and is ranked up from Agent IP pond, forms an end to end chained list;Every time
When crawler requests microblog page, the Agent IP for being located at linked list head node for its distribution is scheduled by Agent IP scheduler, success
When obtaining response results, which is placed into tail of the queue;If request failure, deletes the Agent IP from the chained list;
It accesses after IP agent pool, all HTTP requests is managed using downloader middleware;For access authority requirement
Higher microblog users data collection module, downloader middleware obtains a Cookie from Cookie queue head, then carries
The Cookie carries out data acquisition;Lower content of microblog acquisition module is required for access authority, downloader middleware passes through
Agent IP scheduler obtains an Agent IP, and carries the visitor Cookie constructed by the Agent IP and carry out data acquisition.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910175559.0A CN109933701B (en) | 2019-03-08 | 2019-03-08 | Microblog data acquisition method based on multi-strategy fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910175559.0A CN109933701B (en) | 2019-03-08 | 2019-03-08 | Microblog data acquisition method based on multi-strategy fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109933701A true CN109933701A (en) | 2019-06-25 |
CN109933701B CN109933701B (en) | 2019-12-31 |
Family
ID=66986839
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910175559.0A Active CN109933701B (en) | 2019-03-08 | 2019-03-08 | Microblog data acquisition method based on multi-strategy fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109933701B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110995691A (en) * | 2019-11-28 | 2020-04-10 | 佛山科学技术学院 | Method and system for acquiring webpage data |
CN111083136A (en) * | 2019-12-12 | 2020-04-28 | 北京百分点信息科技有限公司 | Account resource management device and method and data acquisition system and method |
CN111538590A (en) * | 2020-04-17 | 2020-08-14 | 姜海强 | Distributed data acquisition method and system based on CS framework |
CN111538593A (en) * | 2020-04-21 | 2020-08-14 | 夏邦泽 | Data acquisition method based on industrial internet operating system |
CN111859072A (en) * | 2020-07-22 | 2020-10-30 | 广州兆和电力技术有限公司 | Automatic form declaration and score query method and system for intranet |
CN112380467A (en) * | 2020-11-26 | 2021-02-19 | 厦门市美亚柏科信息股份有限公司 | Website data extraction method based on mobile phone, terminal device and storage medium |
CN112769777A (en) * | 2020-12-28 | 2021-05-07 | 上海蓝云网络科技有限公司 | Data integration method and device based on cloud platform and electronic equipment |
CN112765438A (en) * | 2021-01-25 | 2021-05-07 | 北京星汉博纳医药科技有限公司 | Automatic crawler management method based on micro-service |
CN113420234A (en) * | 2021-07-02 | 2021-09-21 | 青海师范大学 | Microblog data acquisition method and system |
CN114168831A (en) * | 2021-11-04 | 2022-03-11 | 无锡知产云信息技术有限公司 | Network data acquisition method and device, terminal and storage medium |
CN114547418A (en) * | 2022-02-25 | 2022-05-27 | 哈尔滨工程大学 | Fatigue simulation model-based anthropomorphic crawler method |
CN116150542A (en) * | 2023-04-21 | 2023-05-23 | 河北网新数字技术股份有限公司 | Dynamic page generation method and device and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103186670A (en) * | 2013-03-27 | 2013-07-03 | 中金数据系统有限公司 | Method and system for integrally acquiring webpage information |
CN103902386A (en) * | 2014-04-11 | 2014-07-02 | 复旦大学 | Multi-thread network crawler processing method based on connection proxy optimal management |
CN104954234A (en) * | 2015-05-19 | 2015-09-30 | 中国地质大学(北京) | Microblog data acquisition method, microblog data acquisition device and public opinion analysis method |
CN107395782A (en) * | 2017-07-19 | 2017-11-24 | 北京理工大学 | A kind of IP limitation controlled source information extraction methods based on agent pool |
-
2019
- 2019-03-08 CN CN201910175559.0A patent/CN109933701B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103186670A (en) * | 2013-03-27 | 2013-07-03 | 中金数据系统有限公司 | Method and system for integrally acquiring webpage information |
CN103902386A (en) * | 2014-04-11 | 2014-07-02 | 复旦大学 | Multi-thread network crawler processing method based on connection proxy optimal management |
CN104954234A (en) * | 2015-05-19 | 2015-09-30 | 中国地质大学(北京) | Microblog data acquisition method, microblog data acquisition device and public opinion analysis method |
CN107395782A (en) * | 2017-07-19 | 2017-11-24 | 北京理工大学 | A kind of IP limitation controlled source information extraction methods based on agent pool |
Non-Patent Citations (1)
Title |
---|
孙青云等: "一种基于模拟登录的微博数据采集方案", 《计算机技术与发展》 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110995691A (en) * | 2019-11-28 | 2020-04-10 | 佛山科学技术学院 | Method and system for acquiring webpage data |
CN111083136A (en) * | 2019-12-12 | 2020-04-28 | 北京百分点信息科技有限公司 | Account resource management device and method and data acquisition system and method |
CN111083136B (en) * | 2019-12-12 | 2022-03-08 | 北京百分点科技集团股份有限公司 | Account resource management device and method and data acquisition system and method |
CN111538590A (en) * | 2020-04-17 | 2020-08-14 | 姜海强 | Distributed data acquisition method and system based on CS framework |
CN111538593A (en) * | 2020-04-21 | 2020-08-14 | 夏邦泽 | Data acquisition method based on industrial internet operating system |
CN111859072A (en) * | 2020-07-22 | 2020-10-30 | 广州兆和电力技术有限公司 | Automatic form declaration and score query method and system for intranet |
CN111859072B (en) * | 2020-07-22 | 2024-07-09 | 广州兆和电力技术有限公司 | Automatic form reporting and score inquiring method and system for intranet |
CN112380467A (en) * | 2020-11-26 | 2021-02-19 | 厦门市美亚柏科信息股份有限公司 | Website data extraction method based on mobile phone, terminal device and storage medium |
CN112769777A (en) * | 2020-12-28 | 2021-05-07 | 上海蓝云网络科技有限公司 | Data integration method and device based on cloud platform and electronic equipment |
CN112765438B (en) * | 2021-01-25 | 2024-03-26 | 北京星汉博纳医药科技有限公司 | Automatic crawler management method based on micro-service |
CN112765438A (en) * | 2021-01-25 | 2021-05-07 | 北京星汉博纳医药科技有限公司 | Automatic crawler management method based on micro-service |
CN113420234A (en) * | 2021-07-02 | 2021-09-21 | 青海师范大学 | Microblog data acquisition method and system |
CN114168831A (en) * | 2021-11-04 | 2022-03-11 | 无锡知产云信息技术有限公司 | Network data acquisition method and device, terminal and storage medium |
CN114547418A (en) * | 2022-02-25 | 2022-05-27 | 哈尔滨工程大学 | Fatigue simulation model-based anthropomorphic crawler method |
CN116150542A (en) * | 2023-04-21 | 2023-05-23 | 河北网新数字技术股份有限公司 | Dynamic page generation method and device and storage medium |
CN116150542B (en) * | 2023-04-21 | 2023-07-14 | 河北网新数字技术股份有限公司 | Dynamic page generation method and device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109933701B (en) | 2019-12-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109933701A (en) | A kind of microblog data acquisition methods based on more strategy fusions | |
US10356094B2 (en) | Uniqueness and auditing of a data resource through an immutable record of transactions in a hash history | |
US11575772B2 (en) | Systems and methods for initiating processing actions utilizing automatically generated data of a group-based communication system | |
US8301653B2 (en) | System and method for capturing and reporting online sessions | |
US20190121813A1 (en) | System and Method of Sovereign Digital Identity Search and Bidirectional Matching | |
US8700708B2 (en) | Social data recording | |
US20090070665A1 (en) | Social Network Site Including Trust-based Wiki Functionality | |
US20080160490A1 (en) | Seeking Answers to Questions | |
JP2006331044A (en) | Single sign-on achievement method | |
WO2013036624A1 (en) | Online learning collaboration system and method | |
US10893091B2 (en) | Management of asynchronous content post and media file transmissions | |
CN113590576A (en) | Database parameter adjusting method and device, storage medium and electronic equipment | |
CN105589953A (en) | Unexpected public health event internet text extraction method | |
CN109729044A (en) | A kind of general internet data acquisition is counter to climb system and method | |
CN103136621B (en) | Engineering is submitted to a higher level for approval or revision the online management method of auditing flow of list | |
Kleppe et al. | Analysing and understanding news consumption patterns by tracking online user behaviour with a multimodal research design | |
CN111125420A (en) | Object recommendation method and device based on artificial intelligence and electronic equipment | |
JP4614854B2 (en) | Community management device and community management program | |
Zhou et al. | Exploring the dark side of the web: Collection and analysis of US extremist online forums | |
Reda et al. | Towards improved web acceleration: leveraging the personal web | |
Zhang et al. | A university-oriented Web 2.0 services portal | |
Xhafa et al. | Performance Evaluation of a MapReduce Hadoop-Based Implementation for Processing Large Virtual Campus Log Files | |
CN109101226A (en) | A method of Alexa Smart Home Skill is quickly generated based on public version | |
JULAN et al. | Methodologies for Retrieving and Processing Information from Open Sources (OSINT) | |
Parra et al. | More! a social discovery tool for researchers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |