WO2019200784A1 - 网络爬虫方法、终端及存储介质 - Google Patents

网络爬虫方法、终端及存储介质 Download PDF

Info

Publication number
WO2019200784A1
WO2019200784A1 PCT/CN2018/100162 CN2018100162W WO2019200784A1 WO 2019200784 A1 WO2019200784 A1 WO 2019200784A1 CN 2018100162 W CN2018100162 W CN 2018100162W WO 2019200784 A1 WO2019200784 A1 WO 2019200784A1
Authority
WO
WIPO (PCT)
Prior art keywords
proxy
validity
access
preset
response time
Prior art date
Application number
PCT/CN2018/100162
Other languages
English (en)
French (fr)
Inventor
阮晓雯
徐亮
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019200784A1 publication Critical patent/WO2019200784A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • H04L63/0876Network architectures or network communication protocols for network security for authentication of entities based on the identity of the terminal or configuration, e.g. MAC address, hardware or software configuration or device fingerprint
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/50Network service management, e.g. ensuring proper service fulfilment according to agreements
    • H04L41/5003Managing SLA; Interaction between SLA and QoS
    • H04L41/5009Determining service level performance parameters or violations of service level contracts, e.g. violations of agreed response time or mean time between failures [MTBF]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/101Access control lists [ACL]

Definitions

  • the present application relates to the field of network crawler technology, and in particular, to a network crawler method, a terminal, and a storage medium.
  • Web crawler is a very important part of the search engine system. It is responsible for collecting web pages from the Internet and collecting information. These web pages are used to set indexes to support search engines. The performance of the web directly affects the performance of search engines. . As the amount of network information grows geometrically, the requirements for the performance and efficiency of web crawler page collection are getting higher and higher.
  • a first aspect of the present application provides a web crawler method, the method comprising:
  • the selected proxy IP is used as the new proxy IP for data crawling.
  • a second aspect of the present application provides a terminal, the terminal comprising a processor and a memory, the processor implementing the web crawler method when the computer readable instructions stored in the memory are executed.
  • a third aspect of the present application provides a non-volatile readable storage medium having stored thereon computer readable instructions, the computer readable instructions being implemented by a processor to implement the Web crawler method.
  • the network crawler method, the terminal and the storage medium described in the present application can effectively solve the IP limitation problem in the process that the same proxy IP crawls data for a long time, multiple times, and a large amount of data.
  • the proxy IP can be continuously obtained and the proxy IP pool can be updated in real time to ensure that the proxy proxy in the proxy IP pool is sufficient.
  • it effectively solve the IP limitation problem of the same proxy IP in the process of long time, multiple fast and large amount of data crawling, but also ensure that the most effective proxy IP is selected for data crawling.
  • FIG. 1 is a flowchart of a web crawling method provided in Embodiment 1 of the present application.
  • FIG. 2 is a flowchart of a web crawling method provided in Embodiment 2 of the present application.
  • FIG. 3 is a structural diagram of a network crawler device according to Embodiment 3 of the present application.
  • FIG. 4 is a structural diagram of a web crawler device provided in Embodiment 4 of the present application.
  • FIG. 5 is a schematic diagram of a terminal provided in Embodiment 5 of the present application.
  • the web crawler method of the embodiment of the present application is applied to one or more terminals.
  • the web crawler method can also be applied to a hardware environment composed of a terminal and a server connected to the terminal through a network.
  • the web crawling method of the embodiment of the present application may be executed by a server or by a terminal; or may be jointly performed by a server and a terminal.
  • the web crawling function provided by the method of the present application may be directly integrated on the terminal, or the client for implementing the method of the present application may be installed.
  • the method provided by the present application can also be run on a server or the like in the form of a software development kit (SDK), and provides an interface of a web crawler function in the form of an SDK, and a terminal or other device provides the The interface can implement the web crawler function.
  • SDK software development kit
  • FIG. 1 is a flowchart of a web crawling method provided in Embodiment 1 of the present application.
  • a proxy IP pool is preset in the local database, and the obtained multiple proxy IPs are added to the proxy IP pool for use by the crawler.
  • the proxy IP can be found in the website of the proxy IP provided on the Internet. The specific list can be obtained manually or by another small crawler. It is also possible to purchase multiple proxy IPs through a third-party service organization and add the obtained proxy IPs to a preset proxy IP pool.
  • the proxy information of the proxy IP may include, but is not limited to, an IP address, a name, and the port.
  • the proxy IP may be obtained in the website of the proxy IP provided by the manual or another small crawler automatically by the manual or another small crawler every preset time period, for example, every other day, or through a third-party service organization.
  • the proxy IP is obtained by purchasing multiple proxy IPs, and the obtained proxy IP is stored in the proxy IP pool, so that the number of IPs in the proxy pool is sufficient, and the proxy IP is obtained without interruption.
  • the proxy IP pool can be updated for use by crawlers.
  • the proxy IP that performs the validity verification is referred to as the proxy IP to be verified, and the proxy IP to be verified is used to access the search engine (for example, Google, Baidu, etc.) for the first time to verify whether the response of the search engine is obtained. If the first-time access receives a response from the search engine, it indicates that the proxy IP to be verified is valid. If the first-time access does not receive a response from the search engine, it indicates that the proxy IP to be verified is invalid. Whether the response of the search engine can be obtained means that the proxy IP to be verified can crawl data from the accessed search engine for the first time, that is, it is not restricted by the accessed search engine for the first time.
  • the search engine for example, Google, Baidu, etc.
  • a whitelist and a blacklist are set in advance in the proxy IP pool.
  • the whitelist is used to record a proxy IP determined to be valid in the proxy IP pool
  • the blacklist is used to record a proxy IP determined to be invalid in the proxy IP pool.
  • the proxy replacement condition is set in advance, and when it is detected that the current proxy IP satisfies the preset proxy replacement condition, a proxy IP is selected from the whitelist in the proxy IP pool.
  • the current proxy IP meets a preset proxy replacement condition including one or more of the following combinations:
  • the preset access time threshold may be 10 minutes, and when the current proxy IP access time exceeds 10 minutes, detecting that the current proxy IP satisfies the preset proxy replacement condition, Then, a proxy IP is selected from the proxy IP pool, which can effectively solve the IP limitation problem caused by the same proxy IP crawling data for a long time.
  • the preset access frequency threshold may be 100 times/second, and when the current proxy IP access frequency exceeds 100 times/second, it is detected that the current proxy IP satisfies the preset
  • the proxy replacement condition selects a proxy IP from the proxy IP pool, which can effectively solve the IP restriction problem caused by the same proxy IP fast crawling data multiple times.
  • the preset number of access thresholds is 200, and when the current number of accesses of the proxy IP exceeds 200, that is, the current proxy IP is detected to satisfy the preset proxy replacement condition, then A proxy IP is selected in the proxy IP pool, which can effectively solve the IP limitation problem caused by a large amount of data crawling by the same proxy IP.
  • the selected proxy IP is used as a new proxy IP for data crawling.
  • the current proxy IP is replaced with the proxy IP selected from the whitelist in the proxy IP pool, and the selected proxy IP is used as a new proxy IP for data crawling.
  • the proxy IP replacement can be automatically performed, thereby eliminating the trouble of manual frequent replacement and the crawling efficiency is higher.
  • the network crawling method described in the present application stores the obtained plurality of proxy IPs in a preset proxy IP pool; each proxy IP in the proxy IP pool is verified one by one, and the obtained is determined.
  • the validity of the proxy IP; the proxy IP determined to be valid is recorded in the whitelist in the proxy IP pool, and the proxy IP determined to be invalid is recorded in the blacklist in the proxy IP pool; when detecting When the current proxy IP meets the preset proxy replacement condition, a proxy IP is selected from the whitelist in the proxy IP pool; and the selected proxy IP is used as a new proxy IP for data crawling. It can effectively solve the IP limitation problem of the same proxy IP in the process of long time, multiple fast and large amount of data crawling.
  • FIG. 2 is a flowchart of a web crawling method provided in Embodiment 2 of the present application.
  • Step 201 in this embodiment is the same as step 101 in the first embodiment, and details are not described herein again.
  • the proxy IP that performs the first validity verification is referred to as the proxy IP to be verified, and the search engine (for example, Google, Baidu, etc.) is accessed using the proxy IP to be verified to verify whether the response of the search engine is obtained. If the response of the search engine is obtained, it indicates that the proxy IP to be verified has the first validity, and if the response of the search engine is not obtained, it indicates that the proxy IP to be verified does not have the first validity. Whether the response of the search engine can be obtained means that the proxy IP to be verified can crawl data from the accessed search engine, that is, it is not restricted by the accessed search engine.
  • the search engine for example, Google, Baidu, etc.
  • step 203 When it is determined that the proxy IP in the proxy IP pool has the first validity, step 203 is performed; when it is determined that the proxy IP in the proxy IP pool does not have the first validity, step 204 is performed.
  • the access success rate and the access response time of the proxy IP may be further used as a criterion for verifying whether the proxy IP is valid.
  • Determining, according to the access success rate and the access response time of the proxy IP, whether the proxy IP having the first validity has the second validity specifically includes:
  • the access success rate refers to the ratio of the number of successful visits to the total number of visits in a preset time period. For example, in a one-hour period, the total number of times the proxy IP having the first validity accesses the search engine is 100 times, wherein the number of successful accesses is 97, and the number of access failures is three, then the calculation is performed.
  • the access response time refers to the time when the access request is issued until the access response is received.
  • the proxy IP with the first validity sends an access request at 9:55:54 seconds, and the access response is received at 9:55:55 seconds, then the access response time of the proxy IP with the first validity is calculated as 1 second.
  • the preset access success threshold may be, for example, 80%.
  • the preset access response time threshold may be, for example, 1 second.
  • the proxy IP has a second validity; when the access success rate of the proxy IP having the first validity is less than or equal to the preset access success threshold, or the access response time is greater than or equal to the preset When the response time threshold is accessed, it is determined that the proxy IP having the first validity does not have the second validity.
  • whether the proxy IP to be verified has higher validity according to the access success rate and the access response time is determined according to whether the proxy IP to be verified can access the search engine when determining that the search engine can be accessed. In this way, it is not only possible to determine that the proxy IP is valid, but also to further determine the quality of the proxy IP to be verified.
  • the higher the access success rate the better the quality of the proxy IP corresponding to the faster access response time, and the smaller the access success rate, the worse the quality of the proxy IP corresponding to the slower access response time.
  • step 204 When it is determined that the proxy IP having the first validity has the second validity, step 204 is performed; when it is determined that the proxy IP having the first validity does not have the second validity, step 205 is performed.
  • a whitelist is set in advance in the proxy IP pool, and the whitelist is used to record the proxy IP in the proxy IP pool determined to have the second validity.
  • the access success rate and the access response time of the proxy IP are recorded in the white list, and the purpose is to select the target IP in the subsequent selection. For example, preferentially selecting the proxy IP with higher access success rate and/or faster access response time can make the data crawling more efficient and the amount of data crawled more.
  • the method may further include: setting a plurality of effective levels to the proxy IP according to the access success rate and the access response time of the proxy IP, and recording the multiple valid levels and their corresponding proxy IPs in the whitelist in.
  • the plurality of valid levels may include, but are not limited to, a first effective level, a second effective level, and a third effective level.
  • the first effective level corresponds to the first access success rate and the first response time
  • the second effective level corresponds to the second access success rate and the second response time
  • the third effective level corresponds to the third access success rate and the third response time, analogy.
  • the first effective level has the highest level and the third effective level has the lowest level.
  • the application does not specifically limit the number of effective levels to be set, and two or more can be set according to actual needs.
  • the access success rate is greater than a preset first access success rate (for example, 95%) and the access response time is greater than a preset first access response time (0.5 seconds) corresponding to the proxy IP as the proxy of the first active level.
  • IP the access success rate is less than the preset first access success rate but greater than a preset second access success threshold (for example, 90%), and the access response time is less than the preset first access response time but greater than the preset
  • the second access response (for example, 1 second) corresponds to the proxy IP as the proxy ID of the second active level; the access success rate is less than the preset second access success rate and the access response time is less than the preset second access response time corresponding to The proxy IP acts as the proxy IP for the third active level.
  • the validity level of the proxy IP is determined according to the access success rate and the access response time, so that when the proxy IP is selected subsequently, the proxy IP can be quickly selected from multiple proxy IPs corresponding to the first effective level. Crawling.
  • the method may further include: recording, in the whitelist, a type of a search engine having a proxy IP access of the second validity.
  • the type of search engine that the proxy IP can access is recorded in the white list.
  • the current proxy IP can only access Baidu, or can only access Sogou, or the current proxy IP can access any search engine.
  • the type of the search engine accessed by the proxy IP in the whitelist is used to perform a targeted replacement proxy IP according to the type of the search engine accessed by the current proxy IP when the proxy IP is subsequently selected. For example, if the current proxy IP is accessing Baidu, and subsequently needs to replace the current proxy IP, a proxy IP can be selected for data crawling according to the search engine type for multiple proxy IPs corresponding to Baidu.
  • the whitelist list also records the proxy IP with the second validity and its corresponding access success rate, access response time, type of search engine accessed, time of acquisition, number of accesses, etc. .
  • a proxy IP determined to have no first validity and having a first validity but no second validity is recorded in a blacklist in the proxy IP pool.
  • a blacklist is set in the proxy IP pool in advance, and the blacklist is used to record that the proxy IP pool is determined not to have the first validity and has the first validity but not A proxy IP with a second validity.
  • proxy IP that is determined not to have the first validity and has the first validity but does not have the second validity may be included in the blacklist in the proxy IP pool.
  • the preset access failure rate threshold may be, for example, 50%.
  • the proxy IP that cannot successfully access the search engine when the first verification is performed is confirmed as the proxy IP that does not have the first validity, but when the subsequent multiple verification determines that the access failure rate is less than the preset failure rate threshold, it is considered that The proxy IP that does not have the first validity is a temporarily invalid proxy IP, which is moved from the second blacklist sublist to the first blacklist sublist.
  • the proxy IP does not work well at all times, the proxy IP may be unstable, causing it to be unusable for a certain period of time; or due to problems with the current search engine itself (for example, search engine performance degradation) or the network
  • the verification caused by slow speed and other reasons fails, but may continue to be used in the future; or the proxy IP is banned due to frequent access, etc., but the access restriction may be lifted after a period of time.
  • the proxy IP that does not have the first validity is considered to be a truly invalid proxy. IP, avoiding a verification failure and mistakenly thinking that the proxy IP is permanently invalid, resulting in waste of proxy IP. Subsequently, if all the proxy IPs having the second validity in the whitelist are unavailable, the proxy IP may be selected from the first blacklist sublist.
  • the proxy IP that does not have the first validity and the access failure rate is greater than the preset access failure rate is recorded in the second blacklist sub-list, and may be convenient for obtaining the IP on the free website or by the third party.
  • the service organization purchases the proxy IP, it can directly match the permanently invalid proxy IP in the second blacklist sub-list, thereby quickly determining whether the proxy IP is a permanent invalid proxy IP, and avoiding newly recording the proxy.
  • the proxy IP in the pool is verified one by one, saving time.
  • the proxy selection rule is preset, and the preset proxy selection rule includes one or more of the following combinations:
  • the search engine type of the access of each proxy IP is recorded in the whitelist. If the currently accessed search engine is Baidu, a proxy IP is selected from the proxy IP corresponding to Baidu in the whitelist.
  • the number of crawls of each proxy IP is recorded in the whitelist, the number of crawls of each proxy IP is sorted, and the proxy IP with a small number of crawls is selected.
  • the search engine type of the access of each proxy IP is recorded in the white list, the number of search engine types accessed by each proxy IP is counted, and the number of search engine types accessed by each proxy IP is sorted, and the accessed is selected. A larger number of proxy IPs for search engine types.
  • the proxy IP that is newly recorded in the whitelist is selected.
  • a proxy IP is selected from the whitelist according to rules 1) to 4) above.
  • the web crawling method may further include: providing a user option for the user to add, delete, or change according to actual needs.
  • the whitelist list can be further updated in time to ensure that the proxy IPs in the whitelist list are available and valid agents, and the effect of proxy changes on data crawling is eliminated.
  • the web crawling method may further include: cascading a plurality of proxy devices useragents, and the proxy devices may communicate with each other, and each of the proxy devices stores the preset proxy IP pool in a local storage.
  • the access time of a useragent exceeds the preset time, another useragent is replaced for data crawling.
  • the web crawling method may further include: storing the constructed proxy IP pool in a preset setting server, and assigning an IP to the proxy device by the server to perform data crawling.
  • the proxy device can include a mobile device, a web device, and the like.
  • the web crawling method may further include: registering a plurality of accounts, and when using the same IP, performing data crawling by switching different accounts.
  • the process of obtaining IP from the proxy IP pool and the multiple agent setting useragent crawl data provided by the present application may be performed synchronously, that is, the proxy IP may be replaced while the proxy device is replaced.
  • the network crawling method described in the present application can continuously obtain the proxy IP and update the proxy IP pool in real time to ensure that there are enough excellent agents in the proxy IP pool; secondly, according to the proxy IP access success rate and access response.
  • the time further determines whether the proxy IP has a second validity, thereby determining the degree of validity of the proxy IP; third, further subdividing the effective level for the proxy IP determined to have the second validity; fourth, for the first time Accessing the failed proxy IP, it also performs multiple verifications, further subdivided into permanent invalidation and temporary invalidation, which can avoid the waste of proxy IP.
  • it can effectively solve the IP limitation problem of the same proxy IP in the process of long time, multiple fast and large amount of data crawling, but also ensure that the most effective proxy IP is selected for data crawling.
  • FIG. 3 is a functional block diagram of a preferred embodiment of the web crawler device of the present application.
  • the web crawler device 30 operates in a terminal.
  • the web crawler device 30 can include a plurality of functional modules consisting of program code segments.
  • Program code for each of the program segments in the web crawler device 30 may be stored in a memory and executed by at least one processor to perform (see Figure 1 and its associated description) tracking of the hand region.
  • the web crawler device 30 of the terminal may be divided into a plurality of functional modules according to functions performed by the terminal.
  • the function module may include: a storage module 301, a determination module 302, a recording module 303, a selection module 304, and a crawl module 305.
  • a module as referred to in this application refers to a series of computer readable instruction segments that are executable by at least one processor and capable of performing a fixed function, which are stored in the memory.
  • the storage module 301 is configured to store the plurality of proxy IPs acquired every preset time period in a preset proxy IP pool.
  • a proxy IP pool is preset in the local database, and the obtained multiple proxy IPs are added to the proxy IP pool for use by the crawler.
  • the proxy IP can be found in the website of the proxy IP provided on the Internet. The specific list can be obtained manually or by another small crawler. It is also possible to purchase multiple proxy IPs through a third-party service organization and add the obtained proxy IPs to a preset proxy IP pool.
  • the proxy information of the proxy IP may include, but is not limited to, an IP address, a name, and the port.
  • the proxy IP may be obtained in the website of the proxy IP provided by the manual or another small crawler automatically by the manual or another small crawler every preset time period, for example, every other day, or through a third-party service organization.
  • the proxy IP is obtained by purchasing multiple proxy IPs, and the obtained proxy IP is stored in the proxy IP pool, so that the number of IPs in the proxy pool is sufficient, and the proxy IP is obtained without interruption.
  • the proxy IP pool can be updated for use by crawlers.
  • the determining module 302 is configured to verify each proxy IP in the proxy IP pool one by one, and determine the validity of the obtained proxy IP.
  • the proxy IP that performs the validity verification is referred to as the proxy IP to be verified, and the proxy IP to be verified is used to access the search engine (for example, Google, Baidu, etc.) for the first time to verify whether the response of the search engine is obtained. If the first-time access receives a response from the search engine, it indicates that the proxy IP to be verified is valid. If the first-time access does not receive a response from the search engine, it indicates that the proxy IP to be verified is invalid. Whether the response of the search engine can be obtained means that the proxy IP to be verified can crawl data from the accessed search engine for the first time, that is, it is not restricted by the accessed search engine for the first time.
  • the search engine for example, Google, Baidu, etc.
  • the recording module 303 is configured to record the proxy IP determined to be valid in the whitelist in the proxy IP pool, and record the proxy IP determined to be invalid in the blacklist in the proxy IP pool.
  • a whitelist and a blacklist are set in advance in the proxy IP pool.
  • the whitelist is used to record a proxy IP determined to be valid in the proxy IP pool
  • the blacklist is used to record a proxy IP determined to be invalid in the proxy IP pool.
  • the selecting module 304 is configured to: when detecting that the current proxy IP meets a preset proxy replacement condition, select a proxy IP from the whitelist in the proxy IP pool.
  • the proxy replacement condition is set in advance, and when it is detected that the current proxy IP satisfies the preset proxy replacement condition, a proxy IP is selected from the whitelist in the proxy IP pool.
  • the current proxy IP meets a preset proxy replacement condition including one or more of the following combinations:
  • the preset access time threshold may be 10 minutes, and when the current proxy IP access time exceeds 10 minutes, detecting that the current proxy IP satisfies the preset proxy replacement condition, Then, a proxy IP is selected from the proxy IP pool, which can effectively solve the IP limitation problem caused by the same proxy IP crawling data for a long time.
  • the preset access frequency threshold may be 100 times/second, and when the current proxy IP access frequency exceeds 100 times/second, it is detected that the current proxy IP satisfies the preset
  • the proxy replacement condition selects a proxy IP from the proxy IP pool, which can effectively solve the IP restriction problem caused by the same proxy IP fast crawling data multiple times.
  • the preset number of access thresholds is 200, and when the current number of accesses of the proxy IP exceeds 200, that is, the current proxy IP is detected to satisfy the preset proxy replacement condition, then A proxy IP is selected in the proxy IP pool, which can effectively solve the IP limitation problem caused by a large amount of data crawling by the same proxy IP.
  • the crawling module 305 is configured to perform data crawling by using the selected proxy IP as a new proxy IP.
  • the current proxy IP is replaced with the proxy IP selected from the whitelist in the proxy IP pool, and the selected proxy IP is used as a new proxy IP for data crawling.
  • the proxy IP replacement can be automatically performed, thereby eliminating the trouble of manual frequent replacement and the crawling efficiency is higher.
  • the storage module 301 stores the acquired plurality of proxy IPs in a preset proxy IP pool; the determining module 302 pairs each proxy IP in the proxy IP pool. Performing one by one verification to determine the validity of the obtained proxy IP; the recording module 303 records the proxy IP determined to be valid in the whitelist in the proxy IP pool, and records the proxy IP determined to be invalid in the proxy IP. In the blacklist in the pool; when the module 304 detects that the current proxy IP meets the preset proxy replacement condition, the proxy IP is selected from the whitelist in the proxy IP pool; the crawl module 305 The selected proxy IP is used as the new proxy IP for data crawling. It can effectively solve the IP limitation problem of the same proxy IP in the process of long time, multiple fast and large amount of data crawling.
  • FIG. 4 is a functional block diagram of a preferred embodiment of the web crawler of the present application.
  • the web crawler device 40 operates in a terminal.
  • the web crawler 40 can include a plurality of functional modules consisting of program code segments.
  • Program code for each of the program segments in the web crawler device 40 may be stored in a memory and executed by at least one processor to perform (see FIG. 2 and its associated description) prevention of web crawlers.
  • the web crawler 40 of the terminal may be divided into a plurality of functional modules according to functions performed by the terminal.
  • the function module may include: a storage module 401, a first determining module 402, a second determining module 403, a first recording module 404, a second recording module 405, a selecting module 406, and a crawling module 407.
  • the storage module 401 is configured to store the plurality of proxy IPs acquired every preset time period in a preset proxy IP pool.
  • the proxy IP may be obtained in the website of the proxy IP provided by the manual or another small crawler automatically by the manual or another small crawler every preset time period, for example, every other day, or through a third-party service organization.
  • the proxy IP is obtained by purchasing multiple proxy IPs, and the obtained proxy IP is stored in the proxy IP pool, so that the number of IPs in the proxy pool is sufficient, and the proxy IP is obtained without interruption.
  • the proxy IP pool can be updated for use by crawlers.
  • the first determining module 402 is configured to verify each proxy IP in the proxy IP pool one by one, and determine whether the obtained proxy IP has the first validity.
  • the proxy IP that performs the first validity verification is referred to as the proxy IP to be verified, and the search engine (for example, Google, Baidu, etc.) is accessed using the proxy IP to be verified to verify whether the response of the search engine is obtained. If the response of the search engine is obtained, it indicates that the proxy IP to be verified has the first validity, and if the response of the search engine is not obtained, it indicates that the proxy IP to be verified does not have the first validity. Whether the response of the search engine can be obtained means that the proxy IP to be verified can crawl data from the accessed search engine, that is, it is not restricted by the accessed search engine.
  • the search engine for example, Google, Baidu, etc.
  • the second determining module 403 is configured to: when the first determining module 402 determines that the proxy IP in the proxy IP pool has the first validity, determine, according to the access success rate and the access response time of the proxy IP, Whether the first valid proxy IP has a second validity.
  • the access success rate and the access response time of the proxy IP may be further used as a criterion for verifying whether the proxy IP is valid.
  • the determining, by the second determining module 403, whether the proxy IP having the first validity has the second validity according to the access success rate and the access response time of the proxy IP includes:
  • the access success rate refers to the ratio of the number of successful visits to the total number of visits in a preset time period. For example, in a one-hour period, the total number of times the proxy IP having the first validity accesses the search engine is 100 times, wherein the number of successful accesses is 97, and the number of access failures is three, then the calculation is performed.
  • the access response time refers to the time when the access request is issued until the access response is received.
  • the proxy IP with the first validity sends an access request at 9:55:54 seconds, and the access response is received at 9:55:55 seconds, then the access response time of the proxy IP with the first validity is calculated as 1 second.
  • the preset access success threshold may be, for example, 80%.
  • the preset access response time threshold may be, for example, 1 second.
  • the proxy IP has a second validity; when the access success rate of the proxy IP having the first validity is less than or equal to the preset access success threshold, or the access response time is greater than or equal to the preset When the response time threshold is accessed, it is determined that the proxy IP having the first validity does not have the second validity.
  • whether the proxy IP to be verified has higher validity according to the access success rate and the access response time is determined according to whether the proxy IP to be verified can access the search engine when determining that the search engine can be accessed. In this way, it is not only possible to determine that the proxy IP is valid, but also to further determine the quality of the proxy IP to be verified.
  • the higher the access success rate the better the quality of the proxy IP corresponding to the faster access response time, and the smaller the access success rate, the worse the quality of the proxy IP corresponding to the slower access response time.
  • the first recording module 404 is configured to record the proxy IP determined to have the second validity in a whitelist in the proxy IP pool.
  • a whitelist is set in advance in the proxy IP pool, and the whitelist is used to record the proxy IP in the proxy IP pool determined to have the second validity.
  • the access success rate and the access response time of the proxy IP are recorded in the white list, and the purpose is to select the target IP in the subsequent selection. For example, preferentially selecting the proxy IP with higher access success rate and/or faster access response time can make the data crawling more efficient and the amount of data crawled more.
  • the first recording module 404 is further configured to set a plurality of effective levels for the proxy IP according to the access success rate and the access response time of the proxy IP, and record the multiple effective levels and their corresponding proxy IPs in the On the white list.
  • the plurality of valid levels may include, but are not limited to, a first effective level, a second effective level, and a third effective level.
  • the first effective level corresponds to the first access success rate and the first response time
  • the second effective level corresponds to the second access success rate and the second response time
  • the third effective level corresponds to the third access success rate and the third response time, analogy.
  • the first effective level has the highest level and the third effective level has the lowest level.
  • the application does not specifically limit the number of effective levels to be set, and two or more can be set according to actual needs.
  • the proxy success rate corresponding to the preset first access success rate (for example, 95%) and the access response time greater than the preset first access response time (0.5 seconds) is used as the proxy of the first active level.
  • IP the access success rate is less than the preset first access success rate but greater than a preset second access success threshold (for example, 90%), and the access response time is less than the preset first access response time but greater than the preset
  • the second access response (for example, 1 second) corresponds to the proxy IP as the proxy ID of the second active level; the access success rate is less than the preset second access success rate and the access response time is less than the preset second access response time corresponding to The proxy IP acts as the proxy IP for the third active level.
  • the validity level of the proxy IP is determined according to the access success rate and the access response time, so that when the proxy IP is selected subsequently, the proxy IP can be quickly selected from multiple proxy IPs corresponding to the first effective level. Crawling.
  • the first recording module 404 is further configured to record, in the whitelist, a type of a search engine that has a second validity proxy IP access.
  • the type of search engine that the proxy IP can access is recorded in the white list.
  • the current proxy IP can only access Baidu, or can only access Sogou, or the current proxy IP can access any search engine.
  • the type of the search engine accessed by the proxy IP in the whitelist is used to perform a targeted replacement proxy IP according to the type of the search engine accessed by the current proxy IP when the proxy IP is subsequently selected. For example, if the current proxy IP is accessing Baidu, and subsequently needs to replace the current proxy IP, a proxy IP can be selected for data crawling according to the search engine type for multiple proxy IPs corresponding to Baidu.
  • the whitelist list also records the proxy IP with the second validity and its corresponding access success rate, access response time, type of search engine accessed, time of acquisition, number of accesses, etc. .
  • the second recording module 405 is configured to record, in the blacklist in the proxy IP pool, a proxy IP that is determined not to have the first validity and has the first validity but does not have the second validity.
  • a blacklist is set in the proxy IP pool in advance, and the blacklist is used to record that the proxy IP pool is determined not to have the first validity and has the first validity but not A proxy IP with a second validity.
  • the second recording module 405 may further include the proxy ID that is determined not to have the first validity and has the first validity but does not have the second validity in the blacklist in the proxy IP pool. :
  • the preset access failure rate threshold may be, for example, 50%.
  • the proxy IP that cannot successfully access the search engine when the first verification is performed is confirmed as the proxy IP that does not have the first validity, but when the subsequent multiple verification determines that the access failure rate is less than the preset failure rate threshold, it is considered that The proxy IP that does not have the first validity is a temporarily invalid proxy IP, which is moved from the second blacklist sublist to the first blacklist sublist.
  • the proxy IP does not work well at all times, the proxy IP may be unstable, causing it to be unusable for a certain period of time; or due to problems with the current search engine itself (for example, search engine performance degradation) or the network
  • the verification caused by slow speed and other reasons fails, but may continue to be used in the future; or the proxy IP is banned due to frequent access, etc., but the access restriction may be lifted after a period of time.
  • the proxy IP that does not have the first validity is considered to be a truly invalid proxy. IP, avoiding a verification failure and mistakenly thinking that the proxy IP is permanently invalid, resulting in waste of proxy IP. Subsequently, if all the proxy IPs having the second validity in the whitelist are unavailable, the proxy IP may be selected from the first blacklist sublist.
  • the proxy IP that does not have the first validity and the access failure rate is greater than the preset access failure rate is recorded in the second blacklist sub-list, and may be convenient for obtaining the IP on the free website or by the third party.
  • the service organization purchases the proxy IP, it can directly match the permanently invalid proxy IP in the second blacklist sub-list, thereby quickly determining whether the proxy IP is a permanent invalid proxy IP, and avoiding newly recording the proxy.
  • the proxy IP in the pool is verified one by one, saving time.
  • the selecting module 406 is configured to: when detecting that the current proxy IP meets the preset proxy replacement condition, select a proxy IP from the whitelist in the proxy IP pool according to a preset proxy selection rule.
  • the proxy selection rule is preset, and the preset proxy selection rule includes one or more of the following combinations:
  • the search engine type of the access of each proxy IP is recorded in the whitelist. If the currently accessed search engine is Baidu, a proxy IP is selected from the proxy IP corresponding to Baidu in the whitelist.
  • the number of crawls of each proxy IP is recorded in the whitelist, the number of crawls of each proxy IP is sorted, and the proxy IP with a small number of crawls is selected.
  • the search engine type of the access of each proxy IP is recorded in the white list, the number of search engine types accessed by each proxy IP is counted, and the number of search engine types accessed by each proxy IP is sorted, and the accessed is selected. A larger number of proxy IPs for search engine types.
  • the proxy IP that is newly recorded in the whitelist is selected.
  • a proxy IP is selected from the whitelist according to rules 1) to 4) above.
  • the crawling module 407 is configured to perform data crawling by using the selected proxy IP as a new proxy IP.
  • the whitelist list may also provide a user option for the user to add, delete, or change according to actual needs, so that the whitelist list may be further updated in time to ensure that the proxy IP in the whitelist list is A proxy that is available and effective, eliminating the impact of proxy changes on data crawling.
  • the web crawler 40 described in the present application can continuously obtain the proxy IP and update the proxy IP pool in real time to ensure that the excellent proxy in the proxy IP pool is sufficient; secondly, according to the proxy IP access success rate and access.
  • the response time further determines whether the proxy IP has a second validity, thereby determining the degree of validity of the proxy IP; third, further subdividing the effective level for the proxy IP determined to have the second validity; fourth, for The proxy IP that failed for the first time is also verified multiple times, further subdivided into permanent invalidity and temporarily invalid, which can avoid the waste of proxy IP.
  • FIG. 5 is a schematic diagram of a terminal according to Embodiment 5 of the present application.
  • the terminal 5 comprises a memory 51, at least one processor 52, computer readable instructions 53 stored in the memory 51 and operable on the at least one processor 52, and at least one communication bus 54.
  • the at least one processor 52 implements the steps in the above-described web crawler method embodiment when the computer readable instructions 53 are executed.
  • the terminal 5 can be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. It can be understood by those skilled in the art that the schematic diagram 5 is only an example of the terminal 5, does not constitute a limitation of the terminal 5, may include more or less components than the illustration, or combine some components, or different components.
  • the terminal 5 may further include an input/output device, a network access device, a bus, and the like.
  • the at least one processor 52 may be a central processing unit (CPU), or may be another general-purpose processor, a digital signal processor (DSP), or an application specific integrated circuit (ASIC). ), a Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, and the like.
  • the processor 52 may be a microprocessor or the processor 52 may be any conventional processor or the like.
  • the processor 52 is a control center of the terminal 5, and connects the entire terminal 5 with various interfaces and lines. section.
  • the memory 51 can be used to store the computer readable instructions 53 and/or modules/units by running or executing computer readable instructions and/or modules/units stored in the memory 51, and The data stored in the memory 51 is called to implement various functions of the terminal 5.
  • the memory 51 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.); and the storage data area may be Data (such as audio data, phone book, etc.) created according to the use of the terminal 5 is stored.
  • the memory 51 may include a high-speed random access memory, and may also include a non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (SMC), and a secure digital (SD).
  • a non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (SMC), and a secure digital (SD).
  • SSD secure digital
  • flash card at least one disk storage device, flash device, or other volatile solid state storage device.
  • the modules/units integrated by the terminal 5 can be stored in a non-volatile readable storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the present application implements all or part of the processes of the above-described embodiments, and can also be accomplished by instructing related hardware through computer readable instructions.
  • the computer readable instructions comprise computer readable instruction code, which may be in the form of source code, an object code form, an executable file or some intermediate form or the like.
  • the non-transitory readable medium may include any entity or device capable of carrying the computer readable instruction code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read only memory (ROM, Read-Only Memory), Random Access Memory (RAM), electrical carrier signals, telecommunications signals, and software distribution media.
  • ROM Read Only memory
  • RAM Random Access Memory
  • the contents of the non-volatile readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, Volatile readable media does not include electrical carrier signals and telecommunication signals.
  • each functional unit in each embodiment of the present application may be integrated in the same processing unit, or each unit may exist physically separately, or two or more units may be integrated in the same unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of hardware plus software function modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Power Engineering (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

一种网络爬虫方法,包括:将每隔预设时间段获取的多个代理IP存储于预先设置的代理IP池中;对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP的有效性;将确定为有效的代理IP记录于所述代理IP池中的白名单中,将确定为无效的代理IP记录于所述代理IP池中的黑名单中;当侦测到当前的代理IP满足预先设置的代理替换条件时,从所述代理IP池中的所述白名单中选取出一个代理IP;及将选取出的代理IP作为新的代理IP进行数据爬取。本申请还提供一种终端及存储介质。本申请能够有效地解决同一代理IP在长时间、多次快速、大量爬取数据的过程中的IP受限问题。

Description

网络爬虫方法、终端及存储介质
本申请要求于2018年04月18日提交中国专利局,申请号为201810349987.6、发明名称为“网络爬虫方法、终端及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及网络爬虫技术领域,具体涉及一种网络爬虫方法、终端及存储介质。
背景技术
网络爬虫是搜索引擎系统中十分重要的组成部分,它负责从互联网中搜集网页,采集信息,这些网页信息用于设置索引从而为搜索引擎提供支持,其性能的优劣直接影响着搜索引擎的效果。随着网络信息量几何级的增长,对网络爬虫页面采集的性能和效率的要求也越来越高。
我们总是希望在更短的时间内,获取更多的数据,但是,这会对网站造成非常高的负载,也带来了网络流量增加,泄露隐私数据等问题,很多网站采用爬虫检测技术,分析Web访问日志,当判断出爬虫时,则禁止爬虫使用地址,拒绝爬虫继续访问。尤其是在批量爬取百度指数和微博指数等时,同一账号同一IP的爬取次数或频率会受限。
因而为了防止爬虫被监测出来,有必要提出一种对抗反爬虫的机制。
发明内容
鉴于以上内容,有必要提出一种网络爬虫方法、终端及存储介质,结合深度信息,构造代理IP池,根据预先设定的选取规则或策略从代理IP池中挑选出代理IP进行爬取,有效的解决了同一代理IP的爬取次数或频率受限的问题。
本申请的第一方面提供一种网络爬虫方法,所述方法包括:
将每隔预设时间段获取的多个代理IP存储于预先设置的代理IP池中;
对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP的有效性;
将确定为有效的代理IP记录于所述代理IP池中的白名单中,将确定为无效的代理IP记录于所述代理IP池中的黑名单中;
当侦测到当前的代理IP满足预先设置的代理替换条件时,从所述代理IP池中的所述白名单中选取出一个代理IP;及
将选取出的代理IP作为新的代理IP进行数据爬取。
本申请的第二方面提供一种终端,所述终端包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令时实现所述网络爬虫方法。
本申请的第三方面提供一种非易失性可读存储介质,所述非易失性可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现所述网络爬虫方法。
本申请所述的网络爬虫方法、终端及存储介质,能够有效地解决同一代 理IP在长时间、多次快速、大量爬取数据的过程中的IP受限问题。尤其能够不断的获取代理IP并实时更新代理IP池,保证代理IP池中的优秀代理足够多;其次,根据代理IP的访问成功率和访问响应时间进一步确定代理IP是否具有第二有效性,以此确定代理IP的有效性程度;第三,针对确定为具有第二有效性的代理IP还进一步地细分了有效级别;第四,对于首次访问失败的代理IP,还进行多次验证,进一步细分为永久无效和暂时无效,可以避免代理IP的浪费。不仅能够有效地解决同一代理IP在长时间、多次快速、大量爬取数据的过程中的IP受限问题,还能保证选取出最有效的代理IP进行数据爬取
附图说明
图1是本申请实施例一提供的网络爬虫方法的流程图。
图2是本申请实施例二提供的网络爬虫方法的流程图。
图3是本申请实施例三提供的网络爬虫装置的结构图。
图4是本申请实施例四提供的网络爬虫装置的结构图。
图5是本申请实施例五提供的终端的示意图。
如下具体实施方式将结合上述附图进一步说明本申请。
具体实施方式
本申请实施例的网络爬虫方法应用在一个或者多个终端中。所述网络爬虫方法也可以应用于由终端和通过网络与所述终端进行连接的服务器所构成的硬件环境中。本申请实施例的网络爬虫方法可以由服务器来执行,也可以由终端来执行;还可以是由服务器和终端共同执行。
所述对于需要进行网络爬虫方法的终端,可以直接在终端上集成本申请的方法所提供的网络爬虫功能,或者安装用于实现本申请的方法的客户端。再如,本申请所提供的方法还可以以软件开发工具包(Software Development Kit,SDK)的形式运行在服务器等设备上,以SDK的形式提供网络爬虫功能的接口,终端或其他设备通过提供的接口即可实现网络爬虫功能。
实施例一
图1是本申请实施例一提供的网络爬虫方法的流程图。
101:将每隔预设时间段获取的多个代理IP存储于预先设置的代理IP池中。
本实施例中,在本地数据库中预先设置一个代理IP池,将所获取的多个代理IP添加到该代理IP池中,以供爬虫使用。可以在互联网上提供的代理IP的网站站点中找到代理IP,具体的列表可以通过人工方式或另外的小爬虫自动获取。也可以通过第三方服务机构购买多个代理IP,并将所获取的代理IP添加到预先设置的代理IP池中。
本实施例中,所述代理IP的代理信息可以包括,但不限于:IP地址、名称及所述端口。
本实施例中,可以每隔预设时间段,例如,每隔一天,一周等,通过人工或另外的小爬虫自动在互联网上提供的代理IP的网站站点中获取代理IP或者通过第三方服务机构购买多个代理IP的方式获取代理IP,并将所获取 的代理IP存储于所述代理IP池中,如此可以保证所述代理池中的IP的数量足够多,且不间断的获取代理IP,可以更新所述代理IP池,供爬虫使用。
102:对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP的有效性。
本实施例中,将进行有效性验证的代理IP称之为待验证的代理IP,使用待验证的代理IP首次访问搜索引擎(例如,谷歌、百度等)以验证是否得到搜索引擎的响应。如果首次访问得到搜索引擎的响应,表明该待验证的代理IP有效,如果首次访问没有得到搜索引擎的响应,表明该待验证的代理IP无效。所述是否能够得到搜索引擎的响应是指待验证的代理IP首次能够从所访问的搜索引擎中爬取到数据,即不会被所访问的搜索引擎进行首次访问限制。
103:将确定为有效的代理IP记录于所述代理IP池中的白名单中,将确定为无效的代理IP记录于所述代理IP池中的黑名单中。
本实施例中,预先在所述代理IP池中设置一个白名单列表和一个黑名单列表。所述白名单列表用于记录所述代理IP池中被确定为有效的代理IP,所述黑名单列表用于记录所述代理IP池中被确定为无效的代理IP。
104:当侦测到当前的代理IP满足预先设置的代理替换条件时,从所述代理IP池中的所述白名单中选取出一个代理IP。
本实施例中,预先设置代理替换条件,当侦测到当前的代理IP满足预先设置的代理替换条件时,从所述代理IP池中的所述白名单中选取出一个代理IP。
本实施例中,所述当前的代理IP满足预先设置的代理替换条件包括以下一种或多种的组合:
1)当前的代理IP的访问时间超过预先设置的访问时间阈值;
所述预先设置的访问时间阈值可以是10分钟,则当所述当前的代理IP的访问时间超过10分钟时,即侦测到所述当前的代理IP满足了所述预先设置的代理替换条件,则从所述代理IP池中选取出一个代理IP,如此能够有效地解决同一代理IP长时间爬取数据时造成的IP受限问题。
2)当前的代理IP的访问频率超过预先设置的访问频率阈值;
所述预先设置的访问频率阈值可以是100次/秒,则当所述当前的代理IP的访问频率超过100次/秒时,即侦测到所述当前的代理IP满足了所述预先设置的代理替换条件,则从所述代理IP池中选取出一个代理IP,如此能够有效地解决同一代理IP多次快速爬取数据时造成的IP受限问题。
3)当前的代理IP的访问数量超过预先设置的访问数量阈值;
所述预先设置的访问数量阈值是200,则当所述当前的代理IP的访问数量超过200后时,即侦测到所述当前的代理IP满足了所述预先设置的代理替换条件,则从所述代理IP池中选取出一个代理IP,如此可以有效的解决同一代理IP大量爬取数据时造成的IP受限问题。
105:将选取出的代理IP作为新的代理IP进行数据爬取。
本实施例中,将当前的代理IP替换为从所述代理IP池中的所述白名单 中选取出的代理IP,所选取出的代理IP作为新的代理IP进行数据爬取。只要当前的代理IP在爬取数据的过程中满足所述预先设置的代理替换条件时,即可自动进行代理IP的替换,从而可以免去人工频繁替换的麻烦,爬取效率更高。
综上所述,本申请所述的网络爬虫方法,将获取的多个代理IP存储于预先设置的代理IP池中;对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP的有效性;将确定为有效的代理IP记录于所述代理IP池中的白名单中,将确定为无效的代理IP记录于所述代理IP池中的黑名单中;当侦测到当前的代理IP满足预先设置的代理替换条件时,从所述代理IP池中的所述白名单中选取出一个代理IP;将选取出的代理IP作为新的代理IP进行数据爬取。能够有效地解决同一代理IP在长时间、多次快速、大量爬取数据的过程中的IP受限问题。
实施例二
图2是本申请实施例二提供的网络爬虫方法的流程图。
201:将每隔预设时间段获取的多个代理IP存储于预先设置的代理IP池中。
本实施例中的步骤201同实施例一中的步骤101,在此不再详细赘述。
202:对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP是否具有第一有效性。
本实施例中,将进行第一有效性验证的代理IP称之为待验证的代理IP,使用待验证的代理IP访问搜索引擎(例如,谷歌、百度等)以验证是否得到搜索引擎的响应。如果得到搜索引擎的响应,表明该待验证的代理IP具有第一有效性,如果没有得到搜索引擎的响应,表明该待验证的代理IP不具有第一有效性。所述是否能够得到搜索引擎的响应是指待验证的代理IP能够从所访问的搜索引擎中爬取到数据,即不会被所访问的搜索引擎进行访问限制。
当确定所述代理IP池中的代理IP具有所述第一有效性时,执行步骤203;当确定所述代理IP池中的代理IP不具有所述第一有效性时,执行步骤204。
203:根据代理IP的访问成功率和访问响应时间判断具有所述第一有效性的代理IP是否具有第二有效性。
为了得到更有效的能够完成访问任务的代理IP,可以进一步将代理IP的访问成功率和访问响应时间作为验证所述代理IP是否有效的标准。
所述根据代理IP的访问成功率和访问响应时间判断具有所述第一有效性的代理IP是否具有第二有效性具体包括:
1)使用具有所述第一有效性的代理IP多次访问多个搜索引擎,计算所述具有第一有效性的代理IP的访问成功率及访问响应时间;
所述访问成功率是指在预设的时间周期内访问成功的次数占访问总次数的比例。例如,在1个小时的周期内,具有第一有效性的代理IP访问搜索引擎的总次数为100次,其中,访问成功的次数为97次,访问失败的次数为3次,则计算所述具有第一有效性的代理IP的访问成功率为97/100=97%。
所述访问响应时间是指发出访问请求到接收到访问回应的时间。例如, 具有第一有效性的代理IP在9:55:54秒发出访问请求,在9:55:55秒接收到访问回应,则计算所述具有第一有效性的代理IP的访问响应时间为1秒。
2)判断所述具有第一有效性的代理IP的访问成功率是否大于预先设置的访问成功率阈值,同时判断所述具有第一有效性的代理IP的访问响应时间是否小于预先设置的访问响应时间阈值;
所述预先设置的访问成功率阈值可以是,例如,80%。所述预先设置的访问响应时间阈值可以是,例如,1秒。
3)当所述具有第一有效性的代理IP的访问成功率大于所述预先设置的访问成功率阈值且访问响应时间小于所述预先设置的访问响应时间阈值时,确定所述具有第一有效性的代理IP具有第二有效性;当所述具有第一有效性的代理IP的访问成功率小于或等于所述预先设置的访问成功率阈值,或访问响应时间大于或等于所述预先设置的访问响应时间阈值时,确定所述具有第一有效性的代理IP不具有第二有效性。
本实施例中,根据所述待验证的代理IP是否能够访问搜索引擎,在确定能访问搜索引擎时继续根据访问成功率和访问响应时间判断所述待验证的代理IP是否具有更高的有效性,如此不仅能够确定所述代理IP有效,而且还能进一步确定所述待验证的代理IP的质量。访问成功率越大,访问响应时间越快的代理IP对应的质量越好,访问成功率越小,访问响应时间越慢的代理IP对应的质量越差。
当确定具有所述第一有效性的代理IP具有第二有效性时,执行步骤204;当确定具有所述第一有效性的代理IP不具有第二有效性时,执行步骤205。
204:将确定为具有第二有效性的代理IP记录于所述代理IP池中的白名单中。
本实施例中,预先在所述代理IP池中设置一个白名单列表,所述白名单列表用于记录所述代理IP池中被确定为具有第二有效性的代理IP。
在所述白名单中记录代理IP的访问成功率及访问响应时间,目的是为了后续选取代理IP时,能够有针对性的进行选取。例如,优先选取访问成功率越高及/访问响应时间越快的代理IP,能够使得爬取数据的效率更高,爬取到的数据量更多。
进一步地,所述方法还可以包括:根据代理IP的访问成功率及访问响应时间,对代理IP设置多个有效级别,将所述多个有效级别及其对应的代理IP记录于所述白名单中。
所述多个有效级别可以包括,但不限于:第一有效级别、第二有效级别及第三有效级别。第一有效级别对应第一访问成功率及第一响应时间,第二有效级别对应第二访问成功率及第二响应时间,第三有效级别对应第三访问成功率及第三响应时间,以此类推。所述第一有效级别的级别最高,所述第三有效级别的级别最低。本申请对设置的有效级别的数量不作具体限制,可以根据实际需求设置两个或者更多个。
具体地,将访问成功率大于预先设置的第一访问成功率(例如,95%)及访问响应时间大于预先设置的第一访问响应时间(0.5秒)对应的代理IP 作为第一有效级别的代理IP;将访问成功率小于预先设置的第一访问成功率但大于预先设置的第二访问成功率阈值(例如,90%),访问响应时间小于预先设置的第一访问响应时间但大于预先设置的第二访问响应(例如,1秒)对应的代理IP作为第二有效级别的代理IP;将访问成功率小于预先设置的第二访问成功率及访问响应时间小于预先设置的第二访问响应时间对应的代理IP作为第三有效级别的代理IP。
本实施例中,根据所述访问成功率及访问响应时间确定代理IP的有效级别,便于后续选取代理IP时,能够快速的从第一有效级别对应的多个代理IP中选取一个代理IP进行数据爬取。
更进一步的,所述方法还可以包括:所述白名单中还记录具有第二有效性的代理IP访问的搜索引擎的类型。
在所述白名单中记录代理IP对应能够访问的搜索引擎的类型,例如,当前代理IP只能访问百度,或者只能访问搜狗,或者当前代理IP能够访问任何一个搜索引擎。
本实施例中,在所述白名单中记录代理IP访问的搜索引擎的类型,是为了后续选取代理IP时,能够根据当前的代理IP访问的搜索引擎的类型进行有针对性的替换代理IP。例如,当前的代理IP正在访问百度,后续需要替换当前的代理IP时,即可根据搜索引擎类型为百度对应的多个代理IP中选取一个代理IP进行数据爬取。
本实施例中,所述白名单列表还记录有所述具有第二有效性的代理IP及其对应的访问成功率、访问响应时间、访问的搜索引擎的类型、获取的时间、访问的次数等。
205:将确定为不具有第一有效性及具有第一有效性但不具有第二有效性的代理IP记录于所述代理IP池中的黑名单中。
本实施例中,预先在所述代理IP池中设置一个黑名单列表,所述黑名单列表用于记录所述代理IP池中被确定为不具有第一有效性及具有第一有效性但不具有第二有效性的代理IP。
进一步地,所述将确定为不具有第一有效性及具有第一有效性但不具有第二有效性的代理IP记录于所述代理IP池中的黑名单中还可以包括:
1)将确定为具有第一有效性但不具有第二有效性的代理IP记录于所述黑名单中的第一黑名单子列表中,将确定为不具有第一有效性的代理IP记录于所述黑名单中的第二黑名单子列表中。
2)使用不具有第一有效性的代理IP多次访问多个搜索引擎,计算不具有第一有效性的代理IP的访问失败率。
3)判断不具有第一有效性的代理IP的访问失败率是否小于预先设置的访问失败率。
所述预先设置的访问失败率阈值可以是,例如,50%。
4)当所述不具有第一有效性的代理IP的访问失败率小于所述预先设置的访问失败率时,将对应的不具有第一有效性的代理IP确认为暂时无效的代理IP,并记录于所述第一黑名单子列表中;当所述不具有第一有效性的代理 IP的访问失败率大于所述预先设置的访问失败率时,将对应的不具有第一有效性的代理IP确认为永久无效的代理IP,并记录于所述第二黑名单子列表中。
本实施例中,将首次进行验证时无法成功访问搜索引擎的代理IP确认为不具有第一有效性的代理IP,但后续多次验证时确定访问失败率小于预先设置的失败率阈值时,认为该不具有第一有效性的代理IP为暂时无效的代理IP,将其从所述第二黑名单子列表中移至所述第一黑名单子列表中。原因在于代理IP并不是任何时候都能很好的完成工作,代理IP可能不稳定,导致它某一段时间内无法使用;或者由于当前搜索引擎自身出现的问题(例如,搜索引擎性能下降)或者网速慢等原因造成的验证失败,但是以后可能依然能继续使用;或者代理IP因为频繁访问等因素被禁了,但是一段时间以后禁止访问权限可能会被解除。
其次,通过多次验证,以进一步验证不具有第一有效性的代理IP的访问失败率大于所述预先设置的访问失败率时,认为该不具有第一有效性的代理IP为真正无效的代理IP,避免一次验证失败而误认为该代理IP永久无效,造成了代理IP的浪费。后续如果所述白名单中的具有第二有效性的代理IP全部无法使用时,可从所述第一黑名单子列表中选取代理IP。
另外,所述第二黑名单子列表中记录不具有第一有效性的且访问失败率大于所述预先设置的访问失败率的代理IP,还可以便于后续再在免费网站获取IP或者通过第三方服务机构购买代理IP时,能够直接与所述第二黑名单子列表中的永久无效的代理IP进行匹配,从而快速判断该代理IP是否为永久无效的代理IP,避免对新记录于所述代理池中的代理IP进行逐个验证,可以节省时间。
206:当侦测到当前的代理IP满足预先设置的代理替换条件时,根据预先设置的代理选取规则从所述代理IP池中的所述白名单中选取出一个代理IP。
本实施例中,预先设置代理选取规则,所述预先设置的代理选取规则包括以下一种或多种的组合:
1)根据当前访问的搜索引擎类型从所述白名单中对应所述访问的搜索引擎类型的代理IP中选取;
例如,所述白名单中记录了各个代理IP的访问的搜索引擎类型,若当前访问的搜索引擎为百度时,则从所述白名单中的对应访问百度的代理IP中选取出一个代理IP。
2)根据所述白名单中记录的代理IP的爬取次数进行选取;
例如,所述白名单中记录了各个代理IP的爬取次数,则将各个代理IP的爬取次数进行排序,选取爬取次数少的代理IP。
3)根据所述白名单中记录的代理IP访问的搜索引擎类型的数量进行选取;
例如,所述白名单中记录了各个代理IP的访问的搜索引擎类型,则统计各个代理IP访问的搜索引擎类型的数量,将各个代理IP访问的搜索引擎类 型的数量进行排序,则选取访问的搜索引擎类型的数量较大的代理IP。
4)根据所述白名单中记录的代理IP的获取时间进行选取;
例如,所述白名单中记录了各个代理IP的获取时间,则选取最新记录入白名单中的代理IP。
5)延迟预设时间段后选取代理IP。
侦测到当前的代理IP满足所述预先设置的代理替换条件时,延迟预设时间段后,根据上述规则1)至4)从所述白名单中选取出一个代理IP。
207:将选取出的代理IP作为新的代理IP进行数据爬取。
优选地,所述网络爬虫方法还可以包括:提供用户选项,供用户根据实际需要自行添加、删除或者更改。如此能更进一步地及时更新所述白名单列表,保证所述白名单列表中的代理IP都是可用的且有效的代理,消除代理的改变对数据爬取的影响。
优选地,所述网络爬虫方法还可以包括:将多个代理设备useragent进行级联且代理设备之间可相互通讯,每一个代理设备的本地存储器中均存储有预先设置的所述代理IP池,当某一个useragent的访问时间超过预设时间,替换另一个useragent进行数据爬取。
优选地,所述网络爬虫方法还可以包括:将所构造的代理IP池存储于预设设置的服务器中,通过服务器分配IP给代理设备进行数据爬取。所述代理设备可以包括移动设备、web设备等。
优选地,所述网络爬虫方法还可以包括:注册多个账号,使用同一IP时,可通过切换不同的账号进行数据爬取。
需要说明的是,本申请所提供的从代理IP池中获取IP的过程及多个代理设置useragent爬取数据可同步进行,即在更换代理设备的同时可更换代理IP。
总上所述,本申请所述的网络爬虫方法,能够不断的获取代理IP并实时更新代理IP池,保证代理IP池中的优秀代理足够多;其次,根据代理IP的访问成功率和访问响应时间进一步确定代理IP是否具有第二有效性,以此确定代理IP的有效性程度;第三,针对确定为具有第二有效性的代理IP还进一步地细分了有效级别;第四,对于首次访问失败的代理IP,还进行多次验证,进一步细分为永久无效和暂时无效,可以避免代理IP的浪费。不仅能够有效地解决同一代理IP在长时间、多次快速、大量爬取数据的过程中的IP受限问题,还能保证选取出最有效的代理IP进行数据爬取。
上述对图1、图2流程中各个步骤的描述,可以根据不同的需求,流程图中的执行顺序可以改变,某些步骤可以省略。
下面结合第3至5图,分别对实现上述网络爬虫方法的终端的功能模块及硬件结构进行介绍。
实施例三
图3为本申请网络爬虫装置较佳实施例中的功能模块图。
在一些实施例中,所述网络爬虫装置30运行于终端中。所述网络爬虫装置30可以包括多个由程序代码段所组成的功能模块。所述网络爬虫装置30 中的各个程序段的程序代码可以存储于存储器中,并由至少一个处理器所执行,以执行(详见图1及其相关描述)对手部区域的跟踪。
本实施例中,所述终端的网络爬虫装置30根据其所执行的功能,可以被划分为多个功能模块。所述功能模块可以包括:存储模块301、判断模块302、记录模块303、选取模块304及爬取模块305。本申请所称的模块是指一种能够被至少一个处理器所执行并且能够完成固定功能的一系列计算机可读指令段,其存储在所述存储器中。
存储模块301,用于将每隔预设时间段获取的多个代理IP存储于预先设置的代理IP池中。
本实施例中,在本地数据库中预先设置一个代理IP池,将所获取的多个代理IP添加到该代理IP池中,以供爬虫使用。可以在互联网上提供的代理IP的网站站点中找到代理IP,具体的列表可以通过人工方式或另外的小爬虫自动获取。也可以通过第三方服务机构购买多个代理IP,并将所获取的代理IP添加到预先设置的代理IP池中。
本实施例中,所述代理IP的代理信息可以包括,但不限于:IP地址、名称及所述端口。
本实施例中,可以每隔预设时间段,例如,每隔一天,一周等,通过人工或另外的小爬虫自动在互联网上提供的代理IP的网站站点中获取代理IP或者通过第三方服务机构购买多个代理IP的方式获取代理IP,并将所获取的代理IP存储于所述代理IP池中,如此可以保证所述代理池中的IP的数量足够多,且不间断的获取代理IP,可以更新所述代理IP池,供爬虫使用。
判断模块302,用于对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP的有效性。
本实施例中,将进行有效性验证的代理IP称之为待验证的代理IP,使用待验证的代理IP首次访问搜索引擎(例如谷歌、百度等)以验证是否得到搜索引擎的响应。如果首次访问得到搜索引擎的响应,表明该待验证的代理IP有效,如果首次访问没有得到搜索引擎的响应,表明该待验证的代理IP无效。所述是否能够得到搜索引擎的响应是指待验证的代理IP首次能够从所访问的搜索引擎中爬取到数据,即不会被所访问的搜索引擎进行首次访问限制。
记录模块303,用于将确定为有效的代理IP记录于所述代理IP池中的白名单中,将确定为无效的代理IP记录于所述代理IP池中的黑名单中。
本实施例中,预先在所述代理IP池中设置一个白名单列表和一个黑名单列表。所述白名单列表用于记录所述代理IP池中被确定为有效的代理IP,所述黑名单列表用于记录所述代理IP池中被确定为无效的代理IP。
选取模块304,用于当侦测到当前的代理IP满足预先设置的代理替换条件时,从所述代理IP池中的所述白名单中选取出一个代理IP。
本实施例中,预先设置代理替换条件,当侦测到当前的代理IP满足预先设置的代理替换条件时,从所述代理IP池中的所述白名单中选取出一个代理IP。
本实施例中,所述当前的代理IP满足预先设置的代理替换条件包括以下一种或多种的组合:
1)当前的代理IP的访问时间超过预先设置的访问时间阈值;
所述预先设置的访问时间阈值可以是10分钟,则当所述当前的代理IP的访问时间超过10分钟时,即侦测到所述当前的代理IP满足了所述预先设置的代理替换条件,则从所述代理IP池中选取出一个代理IP,如此能够有效地解决同一代理IP长时间爬取数据时造成的IP受限问题。
2)当前的代理IP的访问频率超过预先设置的访问频率阈值;
所述预先设置的访问频率阈值可以是100次/秒,则当所述当前的代理IP的访问频率超过100次/秒时,即侦测到所述当前的代理IP满足了所述预先设置的代理替换条件,则从所述代理IP池中选取出一个代理IP,如此能够有效地解决同一代理IP多次快速爬取数据时造成的IP受限问题。
3)当前的代理IP的访问数量超过预先设置的访问数量阈值;
所述预先设置的访问数量阈值是200,则当所述当前的代理IP的访问数量超过200后时,即侦测到所述当前的代理IP满足了所述预先设置的代理替换条件,则从所述代理IP池中选取出一个代理IP,如此可以有效的解决同一代理IP大量爬取数据时造成的IP受限问题。
爬取模块305,用于将选取出的代理IP作为新的代理IP进行数据爬取。
本实施例中,将当前的代理IP替换为从所述代理IP池中的所述白名单中选取出的代理IP,所选取出的代理IP作为新的代理IP进行数据爬取。只要当前的代理IP在爬取数据的过程中满足所述预先设置的代理替换条件时,即可自动进行代理IP的替换,从而可以免去人工频繁替换的麻烦,爬取效率更高。
综上所述,本申请所述的网络爬虫装置30,存储模块301将获取的多个代理IP存储于预先设置的代理IP池中;判断模块302对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP的有效性;记录模块303将确定为有效的代理IP记录于所述代理IP池中的白名单中,将确定为无效的代理IP记录于所述代理IP池中的黑名单中;选取模块304侦测到当前的代理IP满足预先设置的代理替换条件时,从所述代理IP池中的所述白名单中选取出一个代理IP;爬取模块305将选取出的代理IP作为新的代理IP进行数据爬取。能够有效地解决同一代理IP在长时间、多次快速、大量爬取数据的过程中的IP受限问题。
实施例四
图4为本申请网络爬虫装置的较佳实施例中的功能模块图。
在一些实施例中,所述网络爬虫装置40运行于终端中。所述网络爬虫装置40可以包括多个由程序代码段所组成的功能模块。所述网络爬虫装置40中的各个程序段的程序代码可以存储于存储器中,并由至少一个处理器所执行,以执行(详见图2及其相关描述)对网络爬虫的防止。
本实施例中,所述终端的网络爬虫装置40根据其所执行的功能,可以被划分为多个功能模块。所述功能模块可以包括:存储模块401、第一判断模 块402、第二判断模块403、第一记录模块404、第二记录模块405、选取模块406及爬取模块407。
存储模块401,用于将每隔预设时间段获取的多个代理IP存储于预先设置的代理IP池中。
本实施例中,可以每隔预设时间段,例如,每隔一天,一周等,通过人工或另外的小爬虫自动在互联网上提供的代理IP的网站站点中获取代理IP或者通过第三方服务机构购买多个代理IP的方式获取代理IP,并将所获取的代理IP存储于所述代理IP池中,如此可以保证所述代理池中的IP的数量足够多,且不间断的获取代理IP,可以更新所述代理IP池,供爬虫使用。
第一判断模块402,用于对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP是否具有第一有效性。
本实施例中,将进行第一有效性验证的代理IP称之为待验证的代理IP,使用待验证的代理IP访问搜索引擎(例如,谷歌、百度等)以验证是否得到搜索引擎的响应。如果得到搜索引擎的响应,表明该待验证的代理IP具有第一有效性,如果没有得到搜索引擎的响应,表明该待验证的代理IP不具有第一有效性。所述是否能够得到搜索引擎的响应是指待验证的代理IP能够从所访问的搜索引擎中爬取到数据,即不会被所访问的搜索引擎进行访问限制。
第二判断模块403,用于当所述第一判断模块402确定所述代理IP池中的代理IP具有所述第一有效性时,根据代理IP的访问成功率和访问响应时间判断具有所述第一有效性的代理IP是否具有第二有效性。
为了得到更有效的能够完成访问任务的代理IP,可以进一步将代理IP的访问成功率和访问响应时间作为验证所述代理IP是否有效的标准。
所述第二判断模块403根据代理IP的访问成功率和访问响应时间判断具有所述第一有效性的代理IP是否具有第二有效性具体包括:
1)使用具有所述第一有效性的代理IP多次访问多个搜索引擎,计算所述具有第一有效性的代理IP的访问成功率及访问响应时间;
所述访问成功率是指在预设的时间周期内访问成功的次数占访问总次数的比例。例如,在1个小时的周期内,具有第一有效性的代理IP访问搜索引擎的总次数为100次,其中,访问成功的次数为97次,访问失败的次数为3次,则计算所述具有第一有效性的代理IP的访问成功率为97/100=97%。
所述访问响应时间是指发出访问请求到接收到访问回应的时间。例如,具有第一有效性的代理IP在9:55:54秒发出访问请求,在9:55:55秒接收到访问回应,则计算所述具有第一有效性的代理IP的访问响应时间为1秒。
2)判断所述具有第一有效性的代理IP的访问成功率是否大于预先设置的访问成功率阈值,同时判断所述具有第一有效性的代理IP的访问响应时间是否小于预先设置的访问响应时间阈值;
所述预先设置的访问成功率阈值可以是,例如,80%。所述预先设置的访问响应时间阈值可以是,例如,1秒。
3)当所述具有第一有效性的代理IP的访问成功率大于所述预先设置的访问成功率阈值且访问响应时间小于所述预先设置的访问响应时间阈值时, 确定所述具有第一有效性的代理IP具有第二有效性;当所述具有第一有效性的代理IP的访问成功率小于或等于所述预先设置的访问成功率阈值,或访问响应时间大于或等于所述预先设置的访问响应时间阈值时,确定所述具有第一有效性的代理IP不具有第二有效性。
本实施例中,根据所述待验证的代理IP是否能够访问搜索引擎,在确定能访问搜索引擎时继续根据访问成功率和访问响应时间判断所述待验证的代理IP是否具有更高的有效性,如此不仅能够确定所述代理IP有效,而且还能进一步确定所述待验证的代理IP的质量。访问成功率越大,访问响应时间越快的代理IP对应的质量越好,访问成功率越小,访问响应时间越慢的代理IP对应的质量越差。
第一记录模块404,用于将确定为具有第二有效性的代理IP记录于所述代理IP池中的白名单中。
本实施例中,预先在所述代理IP池中设置一个白名单列表,所述白名单列表用于记录所述代理IP池中被确定为具有第二有效性的代理IP。
在所述白名单中记录代理IP的访问成功率及访问响应时间,目的是为了后续选取代理IP时,能够有针对性的进行选取。例如,优先选取访问成功率越高及/访问响应时间越快的代理IP,能够使得爬取数据的效率更高,爬取到的数据量更多。
进一步地,所述第一记录模块404还用于根据代理IP的访问成功率及访问响应时间,对代理IP设置多个有效级别,将所述多个有效级别及其对应的代理IP记录于所述白名单中。
所述多个有效级别可以包括,但不限于:第一有效级别、第二有效级别及第三有效级别。第一有效级别对应第一访问成功率及第一响应时间,第二有效级别对应第二访问成功率及第二响应时间,第三有效级别对应第三访问成功率及第三响应时间,以此类推。所述第一有效级别的级别最高,所述第三有效级别的级别最低。本申请对设置的有效级别的数量不作具体限制,可以根据实际需求设置两个或者更多个。
具体地,将访问成功率大于预先设置的第一访问成功率(例如,95%)及访问响应时间大于预先设置的第一访问响应时间(0.5秒)对应的代理IP作为第一有效级别的代理IP;将访问成功率小于预先设置的第一访问成功率但大于预先设置的第二访问成功率阈值(例如,90%),访问响应时间小于预先设置的第一访问响应时间但大于预先设置的第二访问响应(例如,1秒)对应的代理IP作为第二有效级别的代理IP;将访问成功率小于预先设置的第二访问成功率及访问响应时间小于预先设置的第二访问响应时间对应的代理IP作为第三有效级别的代理IP。
本实施例中,根据所述访问成功率及访问响应时间确定代理IP的有效级别,便于后续选取代理IP时,能够快速的从第一有效级别对应的多个代理IP中选取一个代理IP进行数据爬取。
更进一步的,所述第一记录模块404还用于将具有第二有效性的代理IP访问的搜索引擎的类型记录于所述白名单中。
在所述白名单中记录代理IP对应能够访问的搜索引擎的类型,例如,当前代理IP只能访问百度,或者只能访问搜狗,或者当前代理IP能够访问任何一个搜索引擎。
本实施例中,在所述白名单中记录代理IP访问的搜索引擎的类型,是为了后续选取代理IP时,能够根据当前的代理IP访问的搜索引擎的类型进行有针对性的替换代理IP。例如,当前的代理IP正在访问百度,后续需要替换当前的代理IP时,即可根据搜索引擎类型为百度对应的多个代理IP中选取一个代理IP进行数据爬取。
本实施例中,所述白名单列表还记录有所述具有第二有效性的代理IP及其对应的访问成功率、访问响应时间、访问的搜索引擎的类型、获取的时间、访问的次数等。
第二记录模块405,用于将确定为不具有第一有效性及具有第一有效性但不具有第二有效性的代理IP记录于所述代理IP池中的黑名单中。
本实施例中,预先在所述代理IP池中设置一个黑名单列表,所述黑名单列表用于记录所述代理IP池中被确定为不具有第一有效性及具有第一有效性但不具有第二有效性的代理IP。
进一步地,所述第二记录模块405将确定为不具有第一有效性及具有第一有效性但不具有第二有效性的代理IP记录于所述代理IP池中的黑名单中还可以包括:
1)将确定为具有第一有效性但不具有第二有效性的代理IP记录于所述黑名单中的第一黑名单子列表中,将确定为不具有第一有效性的代理IP记录于所述黑名单中的第二黑名单子列表中。
2)使用不具有第一有效性的代理IP多次访问多个搜索引擎,计算不具有第一有效性的代理IP的访问失败率。
3)判断不具有第一有效性的代理IP的访问失败率是否小于预先设置的访问失败率。
所述预先设置的访问失败率阈值可以是,例如,50%。
4)当所述不具有第一有效性的代理IP的访问失败率小于所述预先设置的访问失败率时,将对应的不具有第一有效性的代理IP确认为暂时无效的代理IP,并记录于所述第一黑名单子列表中;当所述不具有第一有效性的代理IP的访问失败率大于所述预先设置的访问失败率时,将对应的不具有第一有效性的代理IP确认为永久无效的代理IP,并记录于所述第二黑名单子列表中。
本实施例中,将首次进行验证时无法成功访问搜索引擎的代理IP确认为不具有第一有效性的代理IP,但后续多次验证时确定访问失败率小于预先设置的失败率阈值时,认为该不具有第一有效性的代理IP为暂时无效的代理IP,将其从所述第二黑名单子列表中移至所述第一黑名单子列表中。原因在于代理IP并不是任何时候都能很好的完成工作,代理IP可能不稳定,导致它某一段时间内无法使用;或者由于当前搜索引擎自身出现的问题(例如,搜索引擎性能下降)或者网速慢等原因造成的验证失败,但是以后可能依然 能继续使用;或者代理IP因为频繁访问等因素被禁了,但是一段时间以后禁止访问权限可能会被解除。
其次,通过多次验证,以进一步验证不具有第一有效性的代理IP的访问失败率大于所述预先设置的访问失败率时,认为该不具有第一有效性的代理IP为真正无效的代理IP,避免一次验证失败而误认为该代理IP永久无效,造成了代理IP的浪费。后续如果所述白名单中的具有第二有效性的代理IP全部无法使用时,可从所述第一黑名单子列表中选取代理IP。
另外,所述第二黑名单子列表中记录不具有第一有效性的且访问失败率大于所述预先设置的访问失败率的代理IP,还可以便于后续再在免费网站获取IP或者通过第三方服务机构购买代理IP时,能够直接与所述第二黑名单子列表中的永久无效的代理IP进行匹配,从而快速判断该代理IP是否为永久无效的代理IP,避免对新记录于所述代理池中的代理IP进行逐个验证,可以节省时间。
选取模块406,用于当侦测到当前的代理IP满足预先设置的代理替换条件时,根据预先设置的代理选取规则从所述代理IP池中的所述白名单中选取出一个代理IP。
本实施例中,预先设置代理选取规则,所述预先设置的代理选取规则包括以下一种或多种的组合:
1)根据当前访问的搜索引擎类型从所述白名单中对应所述访问的搜索引擎类型的代理IP中选取;
例如,所述白名单中记录了各个代理IP的访问的搜索引擎类型,若当前访问的搜索引擎为百度时,则从所述白名单中的对应访问百度的代理IP中选取出一个代理IP。
2)根据所述白名单中记录的代理IP的爬取次数进行选取;
例如,所述白名单中记录了各个代理IP的爬取次数,则将各个代理IP的爬取次数进行排序,选取爬取次数少的代理IP。
3)根据所述白名单中记录的代理IP访问的搜索引擎类型的数量进行选取;
例如,所述白名单中记录了各个代理IP的访问的搜索引擎类型,则统计各个代理IP访问的搜索引擎类型的数量,将各个代理IP访问的搜索引擎类型的数量进行排序,则选取访问的搜索引擎类型的数量较大的代理IP。
4)根据所述白名单中记录的代理IP的获取时间进行选取;
例如,所述白名单中记录了各个代理IP的获取时间,则选取最新记录入白名单中的代理IP。
5)延迟预设时间段后选取代理IP。
侦测到当前的代理IP满足所述预先设置的代理替换条件时,延迟预设时间段后,根据上述规则1)至4)从所述白名单中选取出一个代理IP。
爬取模块407,用于将选取出的代理IP作为新的代理IP进行数据爬取。
优选地,所述白名单列表还可提供用户选项,供用户根据实际需要自行添加、删除或者更改,如此能更进一步地及时更新所述白名单列表,保证所 述白名单列表中的代理IP都是可用的且有效的代理,消除代理的改变对数据爬取的影响。
总上所述,本申请所述的网络爬虫装置40,能够不断的获取代理IP并实时更新代理IP池,保证代理IP池中的优秀代理足够多;其次,根据代理IP的访问成功率和访问响应时间进一步确定代理IP是否具有第二有效性,以此确定代理IP的有效性程度;第三,针对确定为具有第二有效性的代理IP还进一步地细分了有效级别;第四,对于首次访问失败的代理IP,还进行多次验证,进一步细分为永久无效和暂时无效,可以避免代理IP的浪费。
实施例五
图5为本申请实施例五提供的终端的示意图。
所述终端5包括:存储器51、至少一个处理器52、存储在所述存储器51中并可在所述至少一个处理器52上运行的计算机可读指令53及至少一条通讯总线54。
所述至少一个处理器52执行所述计算机可读指令53时实现上述网络爬虫方法实施例中的步骤。
所述终端5可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。本领域技术人员可以理解,所述示意图5仅仅是终端5的示例,并不构成对终端5的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端5还可以包括输入输出设备、网络接入设备、总线等。
所述至少一个处理器52可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。该处理器52可以是微处理器或者该处理器52也可以是任何常规的处理器等,所述处理器52是所述终端5的控制中心,利用各种接口和线路连接整个终端5的各个部分。
所述存储器51可用于存储所述计算机可读指令53和/或模块/单元,所述处理器52通过运行或执行存储在所述存储器51内的计算机可读指令和/或模块/单元,以及调用存储在存储器51内的数据,实现所述终端5的各种功能。所述存储器51可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据终端5的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器51可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
所述终端5集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个非易失性可读存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计 算机可读指令来指令相关的硬件来完成。其中,所述计算机可读指令包括计算机可读指令代码,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述非易失性可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述非易失性可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,非易失性可读介质不包括电载波信号和电信信号。
另外,在本申请各个实施例中的各功能单元可以集成在相同处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在相同单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
此外,显然“包括”一词不排除其他单元或,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神范围。

Claims (20)

  1. 一种网络爬虫方法,其特征在于,所述方法包括:
    将每隔预设时间段获取的多个代理IP存储于预先设置的代理IP池中;
    对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP的有效性;
    将确定为有效的代理IP记录于所述代理IP池中的白名单中,将确定为无效的代理IP记录于所述代理IP池中的黑名单中;
    当侦测到当前的代理IP满足预先设置的代理替换条件时,从所述代理IP池中的所述白名单中选取出一个代理IP;及
    将选取出的代理IP作为新的代理IP进行数据爬取。
  2. 如权利要求1所述的方法,其特征在于,所述对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP的有效性包括:
    对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP是否具有第一有效性;
    根据代理IP的访问成功率和访问响应时间判断具有所述第一有效性的代理IP是否具有第二有效性;
    将确定为具有第二有效性的代理IP记录于所述代理IP池中的白名单中;
    将确定为不具有第一有效性及具有第一有效性但不具有第二有效性的代理IP记录于所述代理IP池中的黑名单中。
  3. 如权利要求2所述的方法,其特征在于,所述根据代理IP的访问成功率和访问响应时间判断具有所述第一有效性的代理IP是否具有第二有效性包括:
    使用具有所述第一有效性的代理IP多次访问多个搜索引擎,计算所述具有第一有效性的代理IP的访问成功率及访问响应时间;
    判断所述具有第一有效性的代理IP的访问成功率是否大于预先设置的访问成功率阈值,同时判断所述具有第一有效性的代理IP的访问响应时间是否小于预先设置的访问响应时间阈值;
    当所述具有第一有效性的代理IP的访问成功率大于所述预先设置的访问成功率阈值且访问响应时间小于所述预先设置的访问响应时间阈值时,确定所述具有第一有效性的代理IP具有第二有效性;
    当所述具有第一有效性的代理IP的访问成功率小于或等于所述预先设置的访问成功率阈值,或访问响应时间大于或等于所述预先设置的访问响应时间阈值时,确定所述具有第一有效性的代理IP不具有第二有效性。
  4. 如权利要求2所述的方法,其特征在于,所述将确定为不具有第一有效性及具有第一有效性但不具有第二有效性的代理IP记录于所述代理IP池中的黑名单中包括:
    将确定为具有第一有效性但不具有第二有效性的代理IP记录于所述黑名单中的第一黑名单子列表中,将确定为不具有第一有效性的代理IP记录于所述黑名单中的第二黑名单子列表中;
    使用不具有第一有效性的代理IP多次访问多个搜索引擎,计算不具有第一 有效性的代理IP的访问失败率;
    判断不具有第一有效性的代理IP的访问失败率是否小于预先设置的访问失败率;
    当所述不具有第一有效性的代理IP的访问失败率小于所述预先设置的访问失败率时,将对应的不具有第一有效性的代理IP确认为暂时无效的代理IP,并记录于所述第一黑名单子列表中;或
    当所述不具有第一有效性的代理IP的访问失败率大于所述预先设置的访问失败率时,将对应的不具有第一有效性的代理IP确认为永久无效的代理IP,并记录于所述第二黑名单子列表中。
  5. 如权利要求1至4任意一项所述的方法,其特征在于,所述方法还包括:
    根据代理IP的访问成功率及访问响应时间,对代理IP设置多个有效级别,将所述多个有效级别及其对应的代理IP记录于所述白名单中。
  6. 如权利要求5所述的方法,其特征在于,所述对代理IP设置多个有效级别包括:
    将访问成功率大于预先设置的第一访问成功率及访问响应时间大于预先设置的第一访问响应时间对应的代理IP作为第一有效级别的代理IP;
    将访问成功率小于预先设置的第一访问成功率但大于预先设置的第二访问成功率阈值,访问响应时间小于预先设置的第一访问响应时间但大于预先设置的第二访问响应对应的代理IP作为第二有效级别的代理IP;
    将访问成功率小于预先设置的第二访问成功率及访问响应时间小于预先设置的第二访问响应时间对应的代理IP作为第三有效级别的代理IP。
  7. 如权利要求1所述的方法,其特征在于,所述从所述代理IP池中的所述白名单中选取出一个代理IP是根据预先设置的代理选取规则进行选取,所述预先设置的代理选取规则包括以下一种或多种的组合:
    根据当前访问的搜索引擎类型从所述白名单中对应所述访问的搜索引擎类型的代理IP中选取;根据所述白名单中记录的代理IP的爬取次数进行选取;根据所述白名单中记录的代理IP访问的搜索引擎类型的数量进行选取;根据所述白名单中记录的代理IP的获取时间进行选取;延迟预设时间段后选取代理IP。
  8. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    对所述白名单列表提供用户选项,根据用户的添加、删除或更改操作更新所述白名单列表。
  9. 一种终端,其特征在于,所述终端包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令时实现如下步骤:
    将每隔预设时间段获取的多个代理IP存储于预先设置的代理IP池中;
    对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP的有效性;
    将确定为有效的代理IP记录于所述代理IP池中的白名单中,将确定为无效的代理IP记录于所述代理IP池中的黑名单中;
    当侦测到当前的代理IP满足预先设置的代理替换条件时,从所述代理IP 池中的所述白名单中选取出一个代理IP;及
    将选取出的代理IP作为新的代理IP进行数据爬取。
  10. 如权利要求9所述的终端,其特征在于,所述对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP的有效性包括:
    对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP是否具有第一有效性;
    根据代理IP的访问成功率和访问响应时间判断具有所述第一有效性的代理IP是否具有第二有效性;
    将确定为具有第二有效性的代理IP记录于所述代理IP池中的白名单中;
    将确定为不具有第一有效性及具有第一有效性但不具有第二有效性的代理IP记录于所述代理IP池中的黑名单中。
  11. 如权利要求10所述的终端,其特征在于,所述根据代理IP的访问成功率和访问响应时间判断具有所述第一有效性的代理IP是否具有第二有效性包括:
    使用具有所述第一有效性的代理IP多次访问多个搜索引擎,计算所述具有第一有效性的代理IP的访问成功率及访问响应时间;
    判断所述具有第一有效性的代理IP的访问成功率是否大于预先设置的访问成功率阈值,同时判断所述具有第一有效性的代理IP的访问响应时间是否小于预先设置的访问响应时间阈值;
    当所述具有第一有效性的代理IP的访问成功率大于所述预先设置的访问成功率阈值且访问响应时间小于所述预先设置的访问响应时间阈值时,确定所述具有第一有效性的代理IP具有第二有效性;
    当所述具有第一有效性的代理IP的访问成功率小于或等于所述预先设置的访问成功率阈值,或访问响应时间大于或等于所述预先设置的访问响应时间阈值时,确定所述具有第一有效性的代理IP不具有第二有效性。
  12. 如权利要求10所述的终端,其特征在于,所述将确定为不具有第一有效性及具有第一有效性但不具有第二有效性的代理IP记录于所述代理IP池中的黑名单中包括:
    将确定为具有第一有效性但不具有第二有效性的代理IP记录于所述黑名单中的第一黑名单子列表中,将确定为不具有第一有效性的代理IP记录于所述黑名单中的第二黑名单子列表中;
    使用不具有第一有效性的代理IP多次访问多个搜索引擎,计算不具有第一有效性的代理IP的访问失败率;
    判断不具有第一有效性的代理IP的访问失败率是否小于预先设置的访问失败率;
    当所述不具有第一有效性的代理IP的访问失败率小于所述预先设置的访问失败率时,将对应的不具有第一有效性的代理IP确认为暂时无效的代理IP,并记录于所述第一黑名单子列表中;或
    当所述不具有第一有效性的代理IP的访问失败率大于所述预先设置的访问失败率时,将对应的不具有第一有效性的代理IP确认为永久无效的代理IP,并 记录于所述第二黑名单子列表中。
  13. 如权利要求9至12任意一项所述的终端,其特征在于,所述处理器执行所述计算机可读指令时还实现如下步骤:
    根据代理IP的访问成功率及访问响应时间,对代理IP设置多个有效级别,将所述多个有效级别及其对应的代理IP记录于所述白名单中。
  14. 如权利要求13所述的终端,其特征在于,所述对代理IP设置多个有效级别包括:
    将访问成功率大于预先设置的第一访问成功率及访问响应时间大于预先设置的第一访问响应时间对应的代理IP作为第一有效级别的代理IP;
    将访问成功率小于预先设置的第一访问成功率但大于预先设置的第二访问成功率阈值,访问响应时间小于预先设置的第一访问响应时间但大于预先设置的第二访问响应对应的代理IP作为第二有效级别的代理IP;
    将访问成功率小于预先设置的第二访问成功率及访问响应时间小于预先设置的第二访问响应时间对应的代理IP作为第三有效级别的代理IP。
  15. 一种非易失性可读存储介质,所述非易失性可读存储介质上存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下步骤:
    将每隔预设时间段获取的多个代理IP存储于预先设置的代理IP池中;
    对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP的有效性;
    将确定为有效的代理IP记录于所述代理IP池中的白名单中,将确定为无效的代理IP记录于所述代理IP池中的黑名单中;
    当侦测到当前的代理IP满足预先设置的代理替换条件时,从所述代理IP池中的所述白名单中选取出一个代理IP;及
    将选取出的代理IP作为新的代理IP进行数据爬取。
  16. 如权利要求15所述的存储介质,其特征在于,所述对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP的有效性包括:
    对所述代理IP池中的每个代理IP进行逐个验证,判断所获取的代理IP是否具有第一有效性;
    根据代理IP的访问成功率和访问响应时间判断具有所述第一有效性的代理IP是否具有第二有效性;
    将确定为具有第二有效性的代理IP记录于所述代理IP池中的白名单中;
    将确定为不具有第一有效性及具有第一有效性但不具有第二有效性的代理IP记录于所述代理IP池中的黑名单中。
  17. 如权利要求16所述的存储介质,其特征在于,所述根据代理IP的访问成功率和访问响应时间判断具有所述第一有效性的代理IP是否具有第二有效性包括:
    使用具有所述第一有效性的代理IP多次访问多个搜索引擎,计算所述具有第一有效性的代理IP的访问成功率及访问响应时间;
    判断所述具有第一有效性的代理IP的访问成功率是否大于预先设置的访问成功率阈值,同时判断所述具有第一有效性的代理IP的访问响应时间是 否小于预先设置的访问响应时间阈值;
    当所述具有第一有效性的代理IP的访问成功率大于所述预先设置的访问成功率阈值且访问响应时间小于所述预先设置的访问响应时间阈值时,确定所述具有第一有效性的代理IP具有第二有效性;
    当所述具有第一有效性的代理IP的访问成功率小于或等于所述预先设置的访问成功率阈值,或访问响应时间大于或等于所述预先设置的访问响应时间阈值时,确定所述具有第一有效性的代理IP不具有第二有效性。
  18. 如权利要求16所述的存储介质,其特征在于,所述将确定为不具有第一有效性及具有第一有效性但不具有第二有效性的代理IP记录于所述代理IP池中的黑名单中包括:
    将确定为具有第一有效性但不具有第二有效性的代理IP记录于所述黑名单中的第一黑名单子列表中,将确定为不具有第一有效性的代理IP记录于所述黑名单中的第二黑名单子列表中;
    使用不具有第一有效性的代理IP多次访问多个搜索引擎,计算不具有第一有效性的代理IP的访问失败率;
    判断不具有第一有效性的代理IP的访问失败率是否小于预先设置的访问失败率;
    当所述不具有第一有效性的代理IP的访问失败率小于所述预先设置的访问失败率时,将对应的不具有第一有效性的代理IP确认为暂时无效的代理IP,并记录于所述第一黑名单子列表中;或
    当所述不具有第一有效性的代理IP的访问失败率大于所述预先设置的访问失败率时,将对应的不具有第一有效性的代理IP确认为永久无效的代理IP,并记录于所述第二黑名单子列表中。
  19. 如权利要求15至18任意一项所述的存储介质,其特征在于,所述计算机可读指令被处理器执行时还实现如下步骤:
    根据代理IP的访问成功率及访问响应时间,对代理IP设置多个有效级别,将所述多个有效级别及其对应的代理IP记录于所述白名单中。
  20. 如权利要求19所述的存储介质,其特征在于,所述对代理IP设置多个有效级别包括:
    将访问成功率大于预先设置的第一访问成功率及访问响应时间大于预先设置的第一访问响应时间对应的代理IP作为第一有效级别的代理IP;
    将访问成功率小于预先设置的第一访问成功率但大于预先设置的第二访问成功率阈值,访问响应时间小于预先设置的第一访问响应时间但大于预先设置的第二访问响应对应的代理IP作为第二有效级别的代理IP;
    将访问成功率小于预先设置的第二访问成功率及访问响应时间小于预先设置的第二访问响应时间对应的代理IP作为第三有效级别的代理IP。
PCT/CN2018/100162 2018-04-18 2018-08-13 网络爬虫方法、终端及存储介质 WO2019200784A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810349987.6 2018-04-18
CN201810349987.6A CN108551452B (zh) 2018-04-18 2018-04-18 网络爬虫方法、终端及存储介质

Publications (1)

Publication Number Publication Date
WO2019200784A1 true WO2019200784A1 (zh) 2019-10-24

Family

ID=63515403

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/100162 WO2019200784A1 (zh) 2018-04-18 2018-08-13 网络爬虫方法、终端及存储介质

Country Status (2)

Country Link
CN (1) CN108551452B (zh)
WO (1) WO2019200784A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110798470A (zh) * 2019-10-31 2020-02-14 北京金堤科技有限公司 代理ip地址管理方法及系统

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8560604B2 (en) 2009-10-08 2013-10-15 Hola Networks Ltd. System and method for providing faster and more efficient data communication
US9241044B2 (en) 2013-08-28 2016-01-19 Hola Networks, Ltd. System and method for improving internet communication by using intermediate nodes
US11057446B2 (en) 2015-05-14 2021-07-06 Bright Data Ltd. System and method for streaming content from multiple servers
LT3767493T (lt) 2017-08-28 2023-03-10 Bright Data Ltd. Būdas pagerinti turinio parsisiuntimą, naudojant tunelinius įrenginius
CN109413153B (zh) * 2018-09-26 2022-09-02 深圳壹账通智能科技有限公司 数据爬取方法、装置、计算机设备和存储介质
CN109446762A (zh) * 2018-09-26 2019-03-08 深圳壹账通智能科技有限公司 云平台访问方法、装置、计算机设备和存储介质
CN111355693B (zh) * 2018-12-24 2023-10-31 北京奇虎科技有限公司 代理服务的实现方法、装置、电子设备和存储介质
CN109815385A (zh) * 2019-01-31 2019-05-28 无锡火球普惠信息科技有限公司 基于app客户端的爬虫及爬取方法
LT4075304T (lt) 2019-02-25 2023-07-25 Bright Data Ltd. Turinio parsisiuntimo, naudojant url bandymų mechanizmą, sistema ir būdas
CN109948026A (zh) * 2019-03-28 2019-06-28 深信服科技股份有限公司 一种网页数据爬取方法、装置、设备及介质
EP4027618B1 (en) 2019-04-02 2024-07-31 Bright Data Ltd. Managing a non-direct url fetching service
CN110147271B (zh) * 2019-05-15 2020-04-28 重庆八戒传媒有限公司 提升爬虫代理质量的方法、装置及计算机可读存储介质
CN110287395A (zh) * 2019-07-01 2019-09-27 杭州安恒信息技术股份有限公司 一种爬虫方法、装置、系统、设备及可读存储介质
CN110677510A (zh) * 2019-09-11 2020-01-10 苏州朗动网络科技有限公司 Ip代理池的管理方法、设备和存储介质
US10637956B1 (en) 2019-10-01 2020-04-28 Metacluster It, Uab Smart proxy rotator
CN111683163A (zh) * 2020-06-11 2020-09-18 杭州安恒信息技术股份有限公司 代理ip地址分配方法、装置、计算机设备和可读存储介质
CN113422777B (zh) * 2021-06-28 2022-08-19 安天科技集团股份有限公司 基于白名单的渗透测试方法、装置、计算设备及存储介质
CN113836355A (zh) * 2021-10-20 2021-12-24 盐城金堤科技有限公司 视频推荐方法及其装置、计算机存储介质、电子设备
CN113901297A (zh) * 2021-10-25 2022-01-07 杭州安恒信息技术股份有限公司 一种代理ip池的维护方法、装置及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080183889A1 (en) * 2007-01-31 2008-07-31 Dmitry Andreev Method and system for preventing web crawling detection
CN105426415A (zh) * 2015-10-30 2016-03-23 Tcl集团股份有限公司 网站访问请求的管理方法、装置及系统
CN105827619A (zh) * 2016-04-25 2016-08-03 无锡中科富农物联科技有限公司 高访问情况下的爬虫封禁方法
CN106547793A (zh) * 2015-09-22 2017-03-29 北京国双科技有限公司 获取代理服务器地址的方法和装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103581328A (zh) * 2013-11-14 2014-02-12 广州品唯软件有限公司 产品属性数据的获取方法和系统
CN103902386B (zh) * 2014-04-11 2017-05-10 复旦大学 一种基于连接代理优化管理的多线程网络爬虫处理方法
CN104506525B (zh) * 2014-12-22 2018-04-20 北京奇安信科技有限公司 防止恶意抓取的方法和防护装置
CN106534244B (zh) * 2015-09-14 2020-01-17 中国移动通信集团公司 一种代理资源的调度方法及装置
CN106210050A (zh) * 2016-07-12 2016-12-07 安徽天达网络科技有限公司 一种智能反屏蔽网络爬虫系统
CN106776983B (zh) * 2016-12-06 2019-03-26 深圳市小满科技有限公司 搜索引擎优化装置和方法
CN107169006A (zh) * 2017-03-31 2017-09-15 北京奇艺世纪科技有限公司 一种管理爬虫代理的方法及装置
CN107395782A (zh) * 2017-07-19 2017-11-24 北京理工大学 一种基于代理池的ip限制受控源信息抓取方法
CN107635026B (zh) * 2017-09-26 2019-01-22 马上消费金融股份有限公司 一种获取ip的方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080183889A1 (en) * 2007-01-31 2008-07-31 Dmitry Andreev Method and system for preventing web crawling detection
CN106547793A (zh) * 2015-09-22 2017-03-29 北京国双科技有限公司 获取代理服务器地址的方法和装置
CN105426415A (zh) * 2015-10-30 2016-03-23 Tcl集团股份有限公司 网站访问请求的管理方法、装置及系统
CN105827619A (zh) * 2016-04-25 2016-08-03 无锡中科富农物联科技有限公司 高访问情况下的爬虫封禁方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110798470A (zh) * 2019-10-31 2020-02-14 北京金堤科技有限公司 代理ip地址管理方法及系统

Also Published As

Publication number Publication date
CN108551452A (zh) 2018-09-18
CN108551452B (zh) 2021-01-08

Similar Documents

Publication Publication Date Title
WO2019200784A1 (zh) 网络爬虫方法、终端及存储介质
US8595847B2 (en) Systems and methods to control web scraping
US9031946B1 (en) Processor engine, integrated circuit and method therefor
EP3533199B1 (en) Detection of fraudulent account usage in distributed computing systems
US10133673B2 (en) Cache optimization based on predictive routing
EP3763097B1 (en) System and method for restricting access to web resources from web robots
US8713010B1 (en) Processor engine, integrated circuit and method therefor
IE20170207A1 (en) System and method of managing application updates
KR102150530B1 (ko) 분산 웹 크롤러에 대한 방어 방법 및 장치
WO2024027328A1 (zh) 基于零信任数据访问控制系统的数据处理方法
CN110619214A (zh) 一种监控软件正常运行的方法和装置
WO2021099959A1 (en) Cluster security based on virtual machine content
CN109145621B (zh) 文档管理方法及装置
US20240256668A1 (en) Detecting and Preventing Installation and Execution of Malicious Browser Extensions
US20150067766A1 (en) Application service management device and application service management method
CN110879773A (zh) 一种基于CGroup的内存监控方法和装置
US11394748B2 (en) Authentication method for anonymous account and server
CN111309264B (zh) 一种使目录配额兼容快照的方法、系统、设备及介质
US10255174B2 (en) Common cache pool for applications
US20230069845A1 (en) Using a threat intelligence framework to populate a recursive dns server cache
WO2023045575A1 (zh) 区块链中的权限管控
US10972477B1 (en) Systems and methods for performing micro-segmenting
CN114070596A (zh) Web应用防护系统的性能优化方法、系统、终端及介质
CN111124566A (zh) 一种bmc用户界面操作的管理方法、设备及可读介质
US12099565B2 (en) Systems and method for caching shortcodes and database queries

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 01/02/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18915100

Country of ref document: EP

Kind code of ref document: A1