CN109933701B - Microblog data acquisition method based on multi-strategy fusion - Google Patents

Microblog data acquisition method based on multi-strategy fusion Download PDF

Info

Publication number
CN109933701B
CN109933701B CN201910175559.0A CN201910175559A CN109933701B CN 109933701 B CN109933701 B CN 109933701B CN 201910175559 A CN201910175559 A CN 201910175559A CN 109933701 B CN109933701 B CN 109933701B
Authority
CN
China
Prior art keywords
cookie
microblog
proxy
queue
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910175559.0A
Other languages
Chinese (zh)
Other versions
CN109933701A (en
Inventor
王文贤
陈兴蜀
王海舟
严丹
王培名
唐瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201910175559.0A priority Critical patent/CN109933701B/en
Publication of CN109933701A publication Critical patent/CN109933701A/en
Application granted granted Critical
Publication of CN109933701B publication Critical patent/CN109933701B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a microblog data acquisition method based on multi-strategy fusion, which comprises the steps of firstly simulating login and acquiring Cookie successfully logged in; storing the obtained Cookie into a Cookie queue to obtain an initial task; then, a multi-account load balancing strategy is used for crawling a user attention list and user data; extracting a user ID, generating a queue to be crawled for concerning relationship and user information, crawling a user concerning list and user data, and generating a queue to be crawled for a user microblog; constructing a visitor Cookie, accelerating the crawling of microblog contents by using an IP proxy pool, and storing information into a database; extracting the microblog ID and generating a comment information queue to be crawled; and crawling microblog comment information, and storing the information into a database. The method finds the number of concurrent requests suitable for the current network environment and the Cookie queue length through a self-adaptive algorithm, and balances the acquisition speed and the account safety; meanwhile, a high-availability agent IP module is realized to accelerate data acquisition and provide basic data support for network public opinion analysis.

Description

Microblog data acquisition method based on multi-strategy fusion
Technical Field
The invention relates to the technical field of network data acquisition, in particular to a microblog data acquisition method based on multi-strategy fusion.
Background
The popularity and development of the internet has prompted the explosion in social networking. Microblogs, as one of the most popular social network applications at present, have been developed rapidly in recent years due to the characteristics of large number of users, frequent update of state information, rapid information dissemination, and the like, and have become one of the main dissemination media in china. According to the 42 th statistical report of the development conditions of the Chinese Internet, as shown in 2018, by 6 months, microblogs are ranked at the third position of social application at a user utilization rate of 42.1%, and the microblog utilization rate is increased by 1.2% compared with 2017 in 12 months, so that the microblog development method is further enhanced in the aspects of vermicelli interaction, content distribution and the like. The social network users have huge scale, high information transmission speed, rich content and wide influence range, and have very important significance for network public opinion analysis.
The current microblog data collection method generally collects data through an Application Programming Interface (API) or a simulated login-based manner. The acquisition by using the microblog API is limited by the authorization of the microblog API and the calling times per day, and the acquired data amount is small; data acquisition through simulated login breaks through the limitation of microblog API, but rapid acquisition can be achieved only by matching a plurality of account numbers with a certain load balancing strategy, the risk of number sealing exists in large-scale data acquisition, and the difficulty of account number keep-alive is high. The existing microblog data acquisition method usually uses a single acquisition strategy, so that the acquired data volume is unstable, and the efficiency is low.
Disclosure of Invention
In view of the above problems, the invention aims to provide a multi-strategy fusion-based microblog data acquisition method which can stably and efficiently acquire microblog data and provide basic data support for network public opinion analysis. The technical scheme is as follows:
a microblog data acquisition method based on multi-strategy fusion comprises the following steps:
step 1: simulating login and acquiring Cookie successfully logged in;
step 2: saving the Cookie obtained in the step 1 into a Cookie queue;
and step 3: acquiring an initial task: selecting a plurality of microblog users with a large number of fans as initial crawling nodes;
and 4, step 4: crawling a user attention list and user data by using a multi-account load balancing strategy: accessing an initial microblog user set, constructing a user data URL according to a user ID, and collecting user data and user relations;
and 5: extracting the user ID, generating a queue to be crawled of the attention relationship and the user information, and jumping to the step 4; meanwhile, generating a queue to be crawled by the user microblog;
step 6: constructing a guest Cookie: accessing a microblog home page in a non-login state, and collecting related contents of a microblog platform by a crawler by using Cookie which is generated by a microblog system and used for marking the identity of a tourist;
and 7: the acquisition program accesses a user microblog queue set to be crawled, constructs a microblog URL according to the microblog ID, and acquires microblog content based on the visitor Cookie and the constructed microblog URL accessed by the proxy IP pool;
and 8: after the acquisition program downloads the microblog URL page, interpreting the page, extracting a microblog ID, and generating a comment information queue to be crawled;
and step 9: and the acquisition program accesses the comment information to-be-crawled queue set, downloads a microblog comment page based on the visitor Cookie and the proxy IP pool, explains the page, and stores the microblog comment information into the database.
Further, the simulated login is as follows: simulating a user to log in a server by using a program so as to obtain a Cookie of a login account, wherein the steps are as follows:
step 1) pre-login request: the program carries out base64 coding on the user name and constructs a pre-login request address;
step 2) obtaining encrypted nonce and servertime: sending a GET request to obtain nonces, servertime, pubkey and rsakv variables, wherein the nonces and the servertime are used for encrypting the login password, and the pubkey and the rsakv are fixed values and are directly written in a program;
step 3) encrypting the login password using RSA 2: encrypting the user password by using the nonce and the rsakv obtained in the step 2) and combining the public key rsakt of the microblog by using an RSA2 algorithm to obtain an encrypted password;
step 4), obtaining a server certificate: sending key parameters, and after the request is completed through a POST method, the server transmits the returned response information, including two parts of contents, namely a retcode and an arrURL;
step 5) obtaining Cookie which is successfully logged in: and accessing the arrURL by a GET method, returning the personal information of the current user by the server, wherein the Cookie requested to be returned is an effective Cookie, and performing data acquisition by using the effective Cookie.
Further, simulated logs are made at intervals of less than 24 hours to replace an impending failure Cookie with the most recent Cookie.
Furthermore, in the multi-account load balancing strategy, cookies of a plurality of accounts are obtained through simulated login, and the cookies are stored in a queue; and when the crawler requests, acquiring a Cookie from the head of the queue, giving an initial TTL value, subtracting 1 from the TTL value of the corresponding Cookie after the crawler carries the Cookie to request each time, putting the Cookie to the tail of the queue when the TTL value is subtracted to 0, and then taking the first Cookie from the Cookie queue to request a page.
Furthermore, after one page is requested, the crawler sleeps randomly for a certain time to ensure the security of the account.
Furthermore, in the step 4, a self-adaptive concurrent acquisition strategy is adopted, and the current network environment and the Cookie queue length are combined to search a concurrent thread number threshold value which can stably and quickly capture data for the web crawler based on the simulated login; the strategy comprises two stages of fast increase and slow adjustment:
the number of the request threads is increased according to an exponential law in a fast increasing stage, and after the threads are increased, a program judges whether the used account number state is normal or not in a time window; if the load balancing strategy is normal, the number of threads is increased in multiple, and Cookies are alternated according to the load balancing strategy; if the abnormal Cookie is abnormal, removing the abnormal Cookie, adding a new account Cookie at the tail of the queue, keeping the length of the Cookie queue consistent with the initial value, setting the thread number of the next time window to be half of the current thread number, and entering a slow adjustment stage;
wherein N ist+1Representing the number of acquisition passes, N, of the next time windowtRepresenting the number of acquisition threads of the current time window; state is shown inWhether the Cookie state is normal or not is judged by data acquisition in the current time window, wherein 1 is normal, and 0 is abnormal;
in the slow adjustment stage, the number of the requested threads is increased according to a linear rule, and after the threads are increased, the program judges whether the used account state is normal or not in a time window; if the load is normal, the number of threads is continuously increased in a linear mode, and Cookie is alternated according to the load balancing strategy; if the abnormal Cookie is abnormal, the abnormal Cookie is removed, a new Cookie is added at the tail of the queue, the length of the Cookie queue is kept consistent with the initial value, and the current thread number is reduced in a linear mode; until the slow adjustment stage is finished, the current thread number is the optimal thread number capable of continuously and stably acquiring microblog data under the conditions of the current network environment and the length of the Cookie queue of the concurrent thread;
wherein N ist+1Representing the number of acquisition passes, N, of the next time windowtRepresenting the number of acquisition threads of the current time window; and the state indicates whether the Cookie state is normal or not in the data acquisition in the current time window, wherein 1 is normal, and 0 is abnormal.
Further, the guest Cookie is constructed as follows:
step a) obtaining three parameters of tid, c and w
Analyzing the browser header to obtain the acquisition mode of tid: firstly, parameters fp and cb need to be constructed, wherein the fp parameters are formed by related information of a browser and comprise parameters os, browser, fonts, plugins and screenInfo; the cb parameter is a fixed value, with a value of "gen _ callback"; after the fp and cb parameter construction is completed, acquiring a parameter tid; meanwhile, the service end returns two parameters, namely new _ tid and confidence, wherein the value of the new _ tid is true or false; when new _ tid is true, w is 3; when new _ tid is false, w is 2; the value of the parameter c is the same as the value of confidence;
step b) obtaining Cookie in the unregistered state
Firstly, constructing a new Cookie by the tid obtained in the step a), wherein the content of the Cookie comprises a key value pair, and the content is { "tid": tid + "__" + c }; then, the request is completed through a GET method, and whether the value of the msg field in the returned content is succ or not is checked to judge whether the acquisition of the unregistered Cookie is successful or not; if the msg value is succ, indicating that the acquisition of the Cookie was successful, the guest Cookie may be acquired from the header of the response.
Further, the IP proxy pool includes: the proxy IP acquisition device, the proxy IP checker and the proxy IP scheduler; the agent IP collector is responsible for collecting agent IP from an agent IP source disclosed on the network at regular time, and the agent IP comprises an agent IP address, a port and a supported protocol; the agent IP checker is responsible for regularly checking the acquired agent IP resources; the proxy IP scheduler is responsible for providing eligible proxy IPs to the crawler for use.
Further, the specific implementation process of the IP proxy pool is as follows:
step A) Filtering transparent IP
When checking the newly put-in proxy IP, the proxy IP checker accesses HTTPs:// httpsin.org/IP service (whether the service can be described by characters), the service returns the IP corresponding to the HTTP request, and if the IP is the same as the real IP of the crawler server, the IP is discarded; if the returned content is different from the real IP of the server, giving an initial score to the IP; if the error of closing the proxy IP port occurs, the IP is considered to be unavailable and is directly deleted;
step B) checking the microblog station
Using an agent IP to access a microblog home page, wherein if a page returned by a microblog contains a character string of 'microblog-finding fresh things anytime and anywhere', the agent IP can be used for acquiring microblog data; if yzm _ input appears in the response page, the proxy IP is deleted directly; if the request is overtime, subtracting 1 from the score of the agent IP; if the port closing error occurs, directly deleting the proxy IP; updating the fraction, the latest verification time and the response speed of the proxy IP passing the verification in the proxy IP pool as the standard for screening the proxy IP from the proxy IP pool by the scheduler;
step C) proxy IP scheduling
The agent IP scheduler selects agent IPs meeting specified requirements from the agent IP pool according to three attribute preset values of the fraction, the response time and the latest check time of the agent IPs in the agent IP pool and sorts the agent IPs to form an end-to-end linked list; when the crawler requests a microblog page, the proxy IP scheduler schedules the proxy IP which is distributed at the head node of the linked list, and when a response result is successfully obtained, the proxy IP is placed at the tail of the queue; if the request fails, deleting the proxy IP from the linked list;
after the IP agent pool is accessed, all HTTP requests are managed by using downloader middleware; for a microblog user data acquisition module with higher access authority requirement, the downloader middleware acquires a Cookie from the head of a Cookie queue and carries the Cookie to perform data acquisition; for the microblog content acquisition module with lower access authority requirement, the downloader middleware acquires a proxy IP through the proxy IP scheduler and carries the constructed visitor Cookie through the proxy IP to acquire data.
The invention has the beneficial effects that:
(1) the invention provides an optimization method according to respective characteristics of simulating login and grabbing a microblog and constructing a Cookie and grabbing the microblog of a visitor.
(2) The IP agent pool provided by the invention can be used for the data acquisition process of other social networks, news websites, forums or blogs and the like, and the data acquisition interruption caused by IP access limitation of an acquisition program is avoided.
(3) The multi-strategy fused microblog data acquisition method can stably and efficiently acquire microblog data.
Drawings
FIG. 1 is an architecture diagram of a microblog data acquisition system according to the invention.
FIG. 2 is a microblog data acquisition process based on multi-policy fusion.
FIG. 3 is a flowchart illustrating simulated logging in of a microblog system according to the invention.
Fig. 4 is a flow of proxy IP crawling and checking according to the present invention.
FIG. 5 is a user attention relationship acquisition performance comparison graph of the present invention.
Fig. 6 is a graph comparing user information collection performance of the present invention.
FIG. 7 is a comparison graph of microblog information collection performance according to the invention.
Fig. 8 is a comparison graph of review information collection performance of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and specific embodiments.
The microblog data acquisition content comprises microblog personal information, microblog user relationship, hot topic content, all microblogs under the hot topics, all microblogs of users, all comments and forwarding contents of the microblogs and the like. Because the user data, the microblog content and the microblog comments have very important significance in public opinion analysis, the user data, the user relationship, the user microblog and the comments thereof are selected as collection objects in the later experiment of the invention.
The invention provides a microblog data acquisition system based on a multi-strategy fusion acquisition method, the structure of which is shown in fig. 1, the system adopts a breadth-first strategy, firstly, seed nodes are manually selected according to the number of fans, an attention list of an initial user is acquired, the attention list of each person concerned by the current user is sequentially obtained, the attention lists are outwards expanded layer by layer, user information, all microblog information of the user and comments of the user are simultaneously acquired, and the specific acquisition flow is shown in fig. 2.
Embodiments of the present invention are described below with reference to specific examples.
Step 1: simulating login and acquiring Cookie successfully logged in;
the simulated login refers to a process of simulating a user to log in a server by using a program so as to obtain a login account Cookie. The process of simulating login by the microblog system is shown in fig. 3.
1. Pre-login request
The program base64 encodes the username and then constructs the pre-login request address by http:// login.sina.com.cn/sso/prefix.php? entry ═ weibo & callback ═ sina ssontroller.prelingcalback & su ═ MTg3 MDgxMDMwMzM% 3D & rsakt ═ mod & checkpoint ═ 1& client ═ informatin.js (v1.4.18) &. 1526959231, where the su value MTg3 MDgxMDMwMzM% 3D indicates the username encoded by base64 (note: the user name after base64 is "MTg 3 mdxmgxzm ═ wmzm" ", the URL is replaced with the current timestamp" _ 1526959231 "", and the time is replaced with the 3D "_ in the request.
2. Obtaining encrypted nonce and servertime
Variables such as nonce, servertime, pubkey, rsakv and the like can be obtained by sending the GET request, the nonce and the servertime are used for encrypting the login password in the next step, and the pubkey and the rsakv are fixed values and can be directly written into the program.
3. Encrypting login password using RSA2
The microblog encrypts the login password by adopting an RSA2 algorithm. And encrypting the user password by using the nonce and the rsakv obtained in the last step and combining the public key rsakt of the microblog by using an RSA2 algorithm to obtain the encrypted password.
4. Obtaining server credentials
Http is requested by a POST method,// logic.sina.com.cn/sso/logic.php? client is ssologin.js (v1.4.19), and the key parameters needed to be sent are as follows:
entry is a weibo// login source
savestate 7// whether to save password
useticket 1// login with user credential
su MTg3 MDgxMDMwMzM// base64 coded user name
Servertime 1526959231// Server timestamp obtained during Pre-Login phase
non-ES 6HQ1// server random code obtained by pre-login
Encrypted password obtained by sp ═ password//
After the request is completed, the server returns a response message including two parts, namely the retcode and the arrURL. Wherein, the corresponding value of the arrURL is the URL required by the next verification.
5. Obtaining Cookie successfully logged in
And accessing the arrURL by a GET method, returning the personal information of the current user by the server, wherein the Cookie requested to be returned is an effective Cookie, and acquiring data by using the effective Cookie. In addition, the invention finds that the expiration time of the microblog Cookie is 24 hours, if the requirement of long-term stable and efficient data acquisition is met, the simulated login is regularly performed at intervals of less than 24 hours, and the latest Cookie replaces the Cookie which is about to fail.
Step 2: saving the Cookie obtained in the step 1 into a Cookie queue;
in order to quickly acquire a large amount of microblog data, a plurality of successfully logged-in cookies need to be acquired, the cookies are stored in a Cookie queue, and it is ensured that the microblog data are acquired by using multi-account load balancing in the step 4.
And step 3: acquiring an initial task;
and selecting a plurality of microblog users with more fans as initial crawling nodes. The attention behavior of the microblog user is an expression form of a microblog topological structure, and the number of fans of the microblog user can indicate the influence of the microblog user and the quality of the account number from the side. The method has the advantages that a plurality of microblog users with more fans are selected, so that the collected user nodes can be effectively prevented from forming a ring or collecting a large number of zombie users.
And 4, step 4: crawling a user attention list and user data by using a multi-account load balancing strategy;
and accessing an initial microblog user set, constructing a user data URL according to the user ID, and acquiring the user data and the user relationship.
In a login state, the microblog system limits the request amount of a single account in a certain time, and if the request rate of the current account exceeds the limit of the microblog server, the microblog system marks the current account as an abnormal state. In order to accelerate microblog data acquisition, 10 account numbers are adopted for data acquisition by using a certain access strategy.
Firstly, cookies of 10 accounts are obtained through simulated login, and the cookies are stored in a queue. And when the crawler requests, acquiring a Cookie from the head of the queue, setting the initial TTL to be 100, subtracting 1 from the TTL corresponding to the Cookie after the crawler carries the Cookie to request each time, placing the Cookie to the tail of the queue when the TTL is subtracted to be 0, and then taking the first Cookie from the Cookie queue to request a page. In order to simulate the operation of a person more truly, after a page is requested, the crawler sleeps randomly for a certain time to ensure the safety of the account. In order to improve the acquisition efficiency, the invention uses multiple threads to carry out concurrent requests. Too many threads can increase the access amount in the same time window, so that the risk of number sealing is increased, and therefore a concurrent request threshold value capable of stably and quickly capturing data needs to be searched for the web crawler based on the simulated login by combining the current network environment and the length of the Cookie queue. Based on the idea of a TCP congestion control algorithm, the invention adopts a self-adaptive concurrent acquisition strategy to find a concurrent thread threshold value capable of stably and efficiently acquiring data. The strategy contains two phases, fast increment and slow adjustment.
1. Fast increase phase
In the phase, the number of the requested threads is increased according to an exponential law, after the threads are increased, a program judges whether the used account number state is normal in a time window, if the used account number state is normal, the number of the threads is increased in multiples, Cookies are rotated according to the load balancing mode, if the used account number state is abnormal, the abnormal Cookies are removed, a new account number Cookie is added at the tail of the queue, the length of the Cookie queue is consistent with the initial value, the thread number of the next time window is set to be half of the current thread number, and the slow adjustment phase is started.
Nt+1Representing the number of acquisition passes, N, of the next time windowtRepresenting the number of acquisition threads of the current time window; and the state indicates whether the Cookie state is normal or not during data acquisition in the current time window.
2. Slow trim phase
And in the stage, the number of the requested threads is increased according to a linear rule, after the threads are increased, a program judges whether the used account state is normal in a time window, if the used account state is normal, the number of the threads is continuously increased according to the linear mode, Cookies are alternated according to the load balancing mode, if the used account state is abnormal, the abnormal Cookies are removed, a new Cookie is added at the tail of the queue, the length of the Cookie queue is consistent with the initial value, and the current number of the threads is reduced according to the linear mode. And finally, finishing the slow adjustment stage, wherein the current thread number is the optimal thread number capable of continuously and stably acquiring microblog data under the conditions of the network environment and the length of the Cookie queue.
And 5: extracting the user ID, generating a queue to be crawled of the attention relationship and the user information, and jumping to the step 4; meanwhile, generating a queue to be crawled by the user microblog;
step 6: constructing a visitor Cookie;
when the microblog home page is accessed in a non-login state, the microblog system can generate Cookie for the current visitor to mark the identity of the visitor, and the crawler can acquire related contents of the microblog platform by using the Cookie. The guest Cookie is constructed as follows:
1. three parameters of tid, c and w are obtained
Analyzing the browser header may result in the acquisition of tid. Firstly, parameters fp and cb need to be constructed, fp is composed of browser-related information, including parameters os, browser, fonts, plugins, and screenInfo, and the like, which can be forged, and the contents of a legal fp parameter are as follows:
{"os":"1","browser":"Chrome57,0,2110,104","fonts":"undefined","screenInfo":"1436*752*24","plugins":"Portable Document Format::internal-pdf-viewer::Chrome PDF Plugin|::mhjfbmdgcfjbbpaeojofohoefgie hjai::Chrome PDFViewer|::internal-nacl-plugin::Nati ve Client|Enables Widevine licenses forplayback of HTML audio/video content.(version:1.4.8.1008)::widevinecdmadapter.dll::Widevine Content Decryption Module"}
the cb parameter is a fixed value, with the value "gen _ callback". After the fp and cb parameter construction is completed, the parameter tid can be acquired by requesting https:// passport. Meanwhile, the service end returns two parameters, namely new _ tid and confidence, and the value of the new _ tid is true or false. When new _ tid is true, w is 3; when new _ tid is false, w is 2. The value of the parameter c is the same as the value of confidence.
2. Obtaining Cookie in non-login state
The method comprises the steps of firstly constructing a new Cookie by using a tid obtained in the first step, wherein the content of the Cookie comprises a key value pair, and the content is { "tid": tid + "__" + c }, then requesting https:// passport, window, com/viewer? a ═ include & t ═ tid & w ═ c & gc ═ c ═ cb ═ cross _ domain & frfre by using a GET method, after the request is completed, judging whether the acquisition of the unregistered Cookie is successful by checking whether the value of the msg field in the returned content is succ, and if the value of the msg is succ, indicating that the acquisition of the Cookie is successful.
And 7: accelerating the crawling of microblog contents by using an IP proxy pool, and storing information into a database;
and the acquisition program accesses the user microblog queue set to be crawled, constructs a microblog URL according to the microblog ID, and acquires the microblog content based on the constructed microblog URL accessed by the visitor Cookie and the proxy IP pool.
The microblog system in the non-login state limits the acquisition behavior of the crawler mainly through IP, and an IP agent pool is designed and realized in order to accelerate data acquisition. The IP proxy pool consists of 3 parts: the system comprises an agent IP collector, an agent IP checker and an agent IP dispatcher. The proxy IP collector is responsible for collecting proxy IP from a proxy IP source disclosed on the network at regular time, including proxy IP address, port and supported protocol. And the proxy IP checker is responsible for carrying out timing check on the acquired proxy IP resources. The proxy IP scheduler is responsible for providing eligible proxy IPs to the crawler for use. The specific implementation flow of the IP proxy pool is shown in fig. 4.
1. Filtering transparent IP
When checking the newly-put proxy IP, the proxy IP checker can access HTTP request and response service (the URL address is HTTPs:// httpsin.org/IP, the service is internet free check service, or private HTTP request and response service can be built in the internet according to the requirement), the service returns the IP corresponding to the HTTP request, and if the IP is the same as the real IP of the crawler server, the IP is discarded; if the returned content is different from the real IP of the server, giving an initial score to the IP, and setting the value to be 5; if the overtime condition occurs, the invention gives an initial score of 4 to the agent IP; if the error of closing the proxy IP port occurs, the IP is considered to be unavailable and is directly deleted.
2. Checking against microblog sites
The filtered proxy IP may not be used for microblog data acquisition, and may be already shielded by a microblog, or may have poor quality and low proxy speed. By using the proxy IP to access the microblog home page, whether the character string of 'microblog-finding fresh things anytime and anywhere' is contained in the page returned by the microblog is compared to judge whether the proxy IP can be used for microblog data acquisition; if yzm _ input appears in the response page, the proxy IP is proved to be abnormal by the microblog, and the proxy IP is directly deleted; if the request timeout occurs, subtracting 1 from the score of the proxy IP; if the error of port closing and the like occurs, the proxy IP is directly deleted. And updating the fraction, the latest verification time and the response speed of the verified proxy IP in the proxy IP pool as the standard for the scheduler to screen the proxy IP from the proxy IP pool. One of the screening criteria is to use the latest verification time because the service lives of the proxy IPs published on the Internet are short, generally 3-10 minutes, and the proxy IPs may fail after a period of time passes the verification. The fraction of the agent IP is used as a standard for judging the stability of the agent IP, and the response speed is used as a standard for judging the request rate of the agent IP.
3. Proxy IP scheduling
In addition to good proxy IP verification policies, proxy IP scheduling policies are also very important. And the agent IP scheduler selects the agent IPs meeting the specified requirements from the agent IP pool according to three attribute preset values of the fraction, the response time and the latest check time of the agent IPs in the agent IP pool and sorts the agent IPs to form an end-to-end linked list. When the crawler requests a microblog page, the proxy IP scheduler allocates a proxy IP located at the head node of the linked list for scheduling, and when a response result is successfully obtained, the IP is placed at the tail of the queue; if the request fails, the proxy IP is deleted from the linked list. Meanwhile, the agent IP scheduler is provided with a subprocess for regularly screening the agent IP meeting the three conditions from the agent IP pool and placing the agent IP at the tail part of the linked list to prevent the agent IP in the scheduler from being insufficient in quantity.
After accessing the IP proxy pool, all HTTP requests are managed using downloader middleware. For the microblog user data acquisition module with higher access authority requirement, the downloader middleware acquires a Cookie from the head of the Cookie queue and carries the Cookie to acquire data. For the microblog content acquisition module with lower access authority requirement, the downloader middleware acquires a proxy IP through the proxy IP scheduler and carries the constructed visitor Cookie through the proxy IP to acquire data.
And 8: extracting the microblog ID and generating a comment information queue to be crawled;
and after the acquisition program downloads the microblog URL page, interpreting the page, extracting the microblog ID, and generating a comment information queue to be crawled.
And step 9: and crawling microblog comment information, and storing the information into a database.
And the acquisition program accesses the comment information to-be-crawled queue set, downloads a microblog comment page based on the visitor Cookie and the proxy IP pool, explains the page, and stores the microblog comment information into the database.
Fig. 5 selects 10 microblog large V user IDs with fan count exceeding 100 ten thousand, and the IDs are used as seed IDs to perform continuous collection of attention relationship in a single-thread single Cookie, multi-thread multi-Cookie, and multi-Cookie self-adaptive concurrent thread count mode through simulated login, so that it can be seen that in the initial collection stage, the efficiency of 10-thread concurrent collection is far higher than that of other schemes, and after continuous collection for 5 hours, the data volume obtained by the collection scheme of 10 threads is far smaller than that of the self-adaptive concurrent thread scheme, because the concurrent access volume of 10 threads is large, after the program runs for a period of time, the account number is determined as an abnormal state by the microblog system, thereby causing subsequent collection failure.
All user IDs acquired by using the self-adaptive concurrent thread number in one day are used as seeds, and user data, all microblogs and microblog comments of seed users are acquired by respectively using an API-based mode, a simulated login-based mode, an API-based and visitor Cookie construction fusion mode and a simulated login-based and visitor Cookie construction fusion mode.
Fig. 6 to 8 are sequentially the case where the user information, the microblog contents, and the microblog comments are continuously collected for 5 hours using each scheme. As can be seen from fig. 6-8, the speed of the scheme of merging the simulated login and the guest Cookie and cooperating with the proxy pool for collection is far higher than that of other schemes. This is because the bottleneck in using this scheme is the proxy IP quality of the proxy pool, and the availability of the proxy IP can be guaranteed by using a good proxy IP checksum screening strategy. Other schemes have a plurality of limitations, the acquisition scheme based on the API can limit the request times of the current authorized user and the current IP, and the acquisition speed is slowest due to the strict limitation; the access speed of the current login user is limited based on the simulated login mode, so that stable and efficient acquisition can be carried out only by finding out the threshold of the concurrent acquisition thread through the self-adaptive method provided by the invention; the limitation of the guest Cookie construction based approach is the frequency of requests to the microblog system by a single IP within a fixed time window. The limitation of the microblog system to the IP is looser than that to the account.

Claims (9)

1. A microblog data acquisition method based on multi-strategy fusion is characterized by comprising the following steps:
step 1: simulating login and acquiring Cookie successfully logged in;
step 2: saving the Cookie obtained in the step 1 into a Cookie queue;
and step 3: acquiring an initial task: selecting a plurality of microblog users with a large number of fans as initial crawling nodes;
and 4, step 4: crawling a user attention list and user data by using a multi-account load balancing strategy: accessing an initial microblog user set, constructing a user data URL according to a user ID, and collecting user data and user relations;
and 5: extracting the user ID, generating a queue to be crawled of the attention relationship and the user information, and jumping to the step 4; meanwhile, generating a queue to be crawled by the user microblog;
step 6: constructing a guest Cookie: accessing a microblog home page in a non-login state, and collecting related contents of a microblog platform by a crawler by using Cookie which is generated by a microblog system and used for marking the identity of a tourist;
and 7: the acquisition program accesses a user microblog queue set to be crawled, constructs a microblog URL according to the microblog ID, and acquires microblog content based on the visitor Cookie and the constructed microblog URL accessed by the proxy IP pool;
and 8: after the acquisition program downloads the microblog URL page, interpreting the page, extracting a microblog ID, and generating a comment information queue to be crawled;
and step 9: and the acquisition program accesses the comment information to-be-crawled queue set, downloads a microblog comment page based on the visitor Cookie and the proxy IP pool, explains the page, and stores the microblog comment information into the database.
2. The method for acquiring microblog data based on multi-strategy fusion according to claim 1, wherein the simulated login is as follows: simulating a user to log in a server by using a program so as to obtain a Cookie of a login account, wherein the steps are as follows:
step 1) pre-login request: the program carries out base64 coding on the user name and constructs a pre-login request address;
step 2) obtaining encrypted nonce and servertime: sending a GET request to obtain nonces, servertime, pubkey and rsakv variables, wherein the nonces and the servertime are used for encrypting the login password, and the pubkey and the rsakv are fixed values and are directly written in a program;
step 3) encrypting the login password using RSA 2: encrypting the user password by using the nonce and the rsakv obtained in the step 2) and combining the public key rsakt of the microblog by using an RSA2 algorithm to obtain an encrypted password;
step 4), obtaining a server certificate: sending key parameters, and after the request is completed through a POST method, the server transmits the returned response information, including two parts of contents, namely a retcode and an arrURL;
step 5) obtaining Cookie which is successfully logged in: and accessing the arrURL by a GET method, returning the personal information of the current user by the server, wherein the Cookie requested to be returned is an effective Cookie, and performing data acquisition by using the effective Cookie.
3. The method for acquiring microblog data based on multi-policy fusion according to claim 2, wherein the simulated login is performed regularly at intervals of less than 24 hours, and the latest Cookie replaces the Cookie which will fail.
4. The method for acquiring microblog data based on multi-strategy fusion of claim 1, wherein in the multi-account load balancing strategy, cookies of a plurality of accounts are acquired through simulated login and are stored in a queue; and when the crawler requests, acquiring a Cookie from the head of the queue, giving an initial TTL value, subtracting 1 from the TTL value of the corresponding Cookie after the crawler carries the Cookie to request each time, putting the Cookie to the tail of the queue when the TTL value is subtracted to 0, and then taking the first Cookie from the Cookie queue to request a page.
5. The method for acquiring microblog data based on multi-strategy fusion of claim 4, wherein after a page is requested, the crawler sleeps randomly for a certain time to ensure the security of the account.
6. The method for acquiring microblog data based on multi-strategy fusion according to claim 4, wherein a self-adaptive concurrent acquisition strategy is adopted in the step 4, and a concurrent thread number threshold value which can stably and quickly capture data is searched for by a web crawler based on simulated login by combining the current network environment and the Cookie queue length; the strategy comprises two stages of fast increase and slow adjustment:
the number of the request threads is increased according to an exponential law in a fast increasing stage, and after the threads are increased, a program judges whether the used account number state is normal or not in a time window; if the load balancing strategy is normal, the number of threads is increased in multiple, and Cookies are alternated according to the load balancing strategy; if the abnormal Cookie is abnormal, removing the abnormal Cookie, adding a new account Cookie at the tail of the queue, keeping the length of the Cookie queue consistent with the initial value, setting the thread number of the next time window to be half of the current thread number, and entering a slow adjustment stage;
wherein N ist+1Representing the number of acquisition passes, N, of the next time windowtRepresenting the number of acquisition threads of the current time window; the state indicates whether the Cookie state is normal or not in the data acquisition in the current time window, 1 is normal, and 0 is abnormal; in the slow adjustment stage, the number of the requested threads is increased according to a linear rule, and after the threads are increased, the program judges whether the used account state is normal or not in a time window; if the load is normal, the number of threads is continuously increased in a linear mode, and Cookie is alternated according to the load balancing strategy; if the abnormal Cookie is abnormal, the abnormal Cookie is removed, a new Cookie is added at the tail of the queue, the length of the Cookie queue is kept consistent with the initial value, and the current thread number is reduced in a linear mode; until the slow adjustment stage is finished, the current thread number is the optimal thread number capable of continuously and stably acquiring microblog data under the conditions of the current network environment and the length of the Cookie queue of the concurrent thread;
wherein N ist+1Representing the number of acquisition passes, N, of the next time windowtRepresenting the number of acquisition threads of the current time window; and the state indicates whether the Cookie state is normal or not in the data acquisition in the current time window, wherein 1 is normal, and 0 is abnormal.
7. The method for acquiring microblog data based on multi-policy fusion according to claim 1, wherein the visitor Cookie is constructed in the following manner:
step a) obtaining three parameters of tid, c and w
Analyzing the browser header to obtain the acquisition mode of tid: firstly, parameters fp and cb need to be constructed, wherein the fp parameters are formed by related information of a browser and comprise parameters os, browser, fonts, plugins and screenInfo; the cb parameter is a fixed value, with a value of "gen _ callback"; after the fp and cb parameter construction is completed, acquiring a parameter tid; meanwhile, the service end returns two parameters, namely new _ tid and confidence, wherein the value of the new _ tid is true or false; when new _ tid is true, w is 3; when new _ tid is false, w is 2; the value of the parameter c is the same as the value of confidence;
step b) obtaining Cookie in the unregistered state
Firstly, constructing a new Cookie by the tid obtained in the step a), wherein the content of the Cookie comprises a key value pair, and the content is { "tid": tid + "__" + c }; then, the request is completed through a GET method, and whether the value of the msg field in the returned content is succ or not is checked to judge whether the acquisition of the unregistered Cookie is successful or not; if the msg value is succ, indicating that the acquisition of the Cookie was successful, the guest Cookie may be acquired from the header of the response.
8. The method for acquiring microblog data based on multi-policy fusion according to claim 1, wherein the IP proxy pool comprises: the proxy IP acquisition device, the proxy IP checker and the proxy IP scheduler; the agent IP collector is responsible for collecting agent IP from an agent IP source disclosed on the network at regular time, and the agent IP comprises an agent IP address, a port and a supported protocol; the agent IP checker is responsible for regularly checking the acquired agent IP resources; the proxy IP scheduler is responsible for providing eligible proxy IPs to the crawler for use.
9. The method for acquiring microblog data based on multi-policy fusion according to claim 8, wherein the specific implementation process of the IP agent pool is as follows:
step A) Filtering transparent IP
When the proxy IP newly put in the warehouse is checked, the proxy IP checker accesses an HTTP request and response service, the service returns the IP corresponding to the HTTP request, and if the IP is the same as the real IP of the crawler server, the IP is discarded; if the returned content is different from the real IP of the server, giving an initial score to the IP; if the error of closing the proxy IP port occurs, the IP is considered to be unavailable and is directly deleted;
step B) checking the microblog station
Using an agent IP to access a microblog home page, wherein if a page returned by a microblog contains a character string of 'microblog-finding fresh things anytime and anywhere', the agent IP can be used for acquiring microblog data; if yzm _ input appears in the response page, the proxy IP is deleted directly; if the request is overtime, subtracting 1 from the score of the agent IP; if the port closing error occurs, directly deleting the proxy IP; updating the fraction, the latest verification time and the response speed of the proxy IP passing the verification in the proxy IP pool as the standard for screening the proxy IP from the proxy IP pool by the scheduler; step C) proxy IP scheduling
The agent IP scheduler selects agent IPs meeting specified requirements from the agent IP pool according to three attribute preset values of the fraction, the response time and the latest check time of the agent IPs in the agent IP pool and sorts the agent IPs to form an end-to-end linked list; when the crawler requests a microblog page, the proxy IP scheduler schedules the proxy IP which is distributed at the head node of the linked list, and when a response result is successfully obtained, the proxy IP is placed at the tail of the queue; if the request fails, deleting the proxy IP from the linked list;
after the IP agent pool is accessed, all HTTP requests are managed by using downloader middleware; for a microblog user data acquisition module with higher access authority requirement, the downloader middleware acquires a Cookie from the head of a Cookie queue and carries the Cookie to perform data acquisition; for the microblog content acquisition module with lower access authority requirement, the downloader middleware acquires a proxy IP through the proxy IP scheduler and carries the constructed visitor Cookie through the proxy IP to acquire data.
CN201910175559.0A 2019-03-08 2019-03-08 Microblog data acquisition method based on multi-strategy fusion Active CN109933701B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910175559.0A CN109933701B (en) 2019-03-08 2019-03-08 Microblog data acquisition method based on multi-strategy fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910175559.0A CN109933701B (en) 2019-03-08 2019-03-08 Microblog data acquisition method based on multi-strategy fusion

Publications (2)

Publication Number Publication Date
CN109933701A CN109933701A (en) 2019-06-25
CN109933701B true CN109933701B (en) 2019-12-31

Family

ID=66986839

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910175559.0A Active CN109933701B (en) 2019-03-08 2019-03-08 Microblog data acquisition method based on multi-strategy fusion

Country Status (1)

Country Link
CN (1) CN109933701B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110995691A (en) * 2019-11-28 2020-04-10 佛山科学技术学院 Method and system for acquiring webpage data
CN111083136B (en) * 2019-12-12 2022-03-08 北京百分点科技集团股份有限公司 Account resource management device and method and data acquisition system and method
CN111538590A (en) * 2020-04-17 2020-08-14 姜海强 Distributed data acquisition method and system based on CS framework
CN111538593A (en) * 2020-04-21 2020-08-14 夏邦泽 Data acquisition method based on industrial internet operating system
CN111859072A (en) * 2020-07-22 2020-10-30 广州兆和电力技术有限公司 Automatic form declaration and score query method and system for intranet
CN112380467A (en) * 2020-11-26 2021-02-19 厦门市美亚柏科信息股份有限公司 Website data extraction method based on mobile phone, terminal device and storage medium
CN112769777B (en) * 2020-12-28 2022-12-02 上海蓝云网络科技有限公司 Data integration method and device based on cloud platform and electronic equipment
CN112765438B (en) * 2021-01-25 2024-03-26 北京星汉博纳医药科技有限公司 Automatic crawler management method based on micro-service
CN113420234B (en) * 2021-07-02 2022-08-02 青海师范大学 Microblog data acquisition method and system
CN114547418A (en) * 2022-02-25 2022-05-27 哈尔滨工程大学 Fatigue simulation model-based anthropomorphic crawler method
CN116150542B (en) * 2023-04-21 2023-07-14 河北网新数字技术股份有限公司 Dynamic page generation method and device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103186670A (en) * 2013-03-27 2013-07-03 中金数据系统有限公司 Method and system for integrally acquiring webpage information
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management
CN104954234A (en) * 2015-05-19 2015-09-30 中国地质大学(北京) Microblog data acquisition method, microblog data acquisition device and public opinion analysis method
CN107395782A (en) * 2017-07-19 2017-11-24 北京理工大学 A kind of IP limitation controlled source information extraction methods based on agent pool

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103186670A (en) * 2013-03-27 2013-07-03 中金数据系统有限公司 Method and system for integrally acquiring webpage information
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management
CN104954234A (en) * 2015-05-19 2015-09-30 中国地质大学(北京) Microblog data acquisition method, microblog data acquisition device and public opinion analysis method
CN107395782A (en) * 2017-07-19 2017-11-24 北京理工大学 A kind of IP limitation controlled source information extraction methods based on agent pool

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于模拟登录的微博数据采集方案;孙青云等;《计算机技术与发展》;20140330;第24卷(第3期);第6-10 *

Also Published As

Publication number Publication date
CN109933701A (en) 2019-06-25

Similar Documents

Publication Publication Date Title
CN109933701B (en) Microblog data acquisition method based on multi-strategy fusion
CN111079104B (en) Authority control method, device, equipment and storage medium
CN108206802B (en) Method and device for detecting webpage backdoor
JP4668567B2 (en) System and method for client-based web crawling
CN110855676B (en) Network attack processing method and device and storage medium
US8898272B1 (en) Identifying information in resource locators
CN112260990B (en) Method and device for safely accessing intranet application
US20120047577A1 (en) Safe url shortening
US10324896B2 (en) Method and apparatus for acquiring resource
CN112261172B (en) Service addressing access method, device, system, equipment and medium
CN104580364A (en) Resource sharing method and device
US20230394096A1 (en) Optimizing scraping requests through browsing profiles
CN111353151A (en) Vulnerability detection method and device for network application
WO2022057002A1 (en) Abnormal request processing method and device
CN110753045A (en) Single sign-on method between different domains
CN110611611B (en) Web security access method for home gateway
CN110430062B (en) Login request processing method, device, equipment and medium
CN112231481A (en) Website classification method and device, computer equipment and storage medium
CN107343028B (en) Communication method and system based on HTTP (hyper text transport protocol)
Noskov Smart City Webgis Applications: Proof of Work Concept For High-Level Quality-Of-Service Assurance
CN111666465A (en) Method and device for crawling data, storage medium and terminal
EP2605480B1 (en) Apparatus and method for HTTP analysis
CN114553529A (en) Data processing method, device, network equipment and storage medium
CN109302446B (en) Cross-platform access method and device, electronic equipment and storage medium
Sanadhya et al. Precog: Action-based time-shifted prefetching for web applications on mobile devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant