CN114036364A

CN114036364A - Method, apparatus, device, medium and product for identifying a crawler

Info

Publication number: CN114036364A
Application number: CN202111316197.6A
Authority: CN
Inventors: 何永玄; 薛志方; 谭瑞兴
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-02-11
Anticipated expiration: 2041-11-08
Also published as: CN114036364B

Abstract

The present disclosure provides a method, apparatus, device, medium, and product for crawler identification, relating to the field of computer technology, in particular to the field of information security technology. The specific implementation scheme is as follows: acquiring request information for requesting to access page data; determining target anti-crawler operation aiming at the request information from a preset anti-crawler operation set according to a preset crawler identification sequence; performing crawler identification on the request information based on target anti-crawler operation to obtain an identification result; and in response to determining that the identification result indicates that the request information is a crawler, determining the identification result as a target crawler identification result. The realization mode can improve the data security of the small program of the webpage version.

Description

Method, apparatus, device, medium and product for identifying a crawler

Technical Field

The present disclosure relates to the field of computer technology, and more particularly, to the field of information security technology.

Background

A crawler is a program or script that automatically crawls the world wide web according to certain rules. Some public data is usually provided in the webpage version applet for the user to browse, and the crawler attack can cause the public data to be used maliciously.

However, for the applet of the web page version, a corresponding anti-crawler means is not configured, so that a certain safety hazard exists in public data in the applet of the web page version.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, medium, and article for crawler identification.

According to an aspect of the present disclosure, there is provided a method for identifying a crawler, including: acquiring request information for requesting to access page data; determining target anti-crawler operation aiming at the request information from a preset anti-crawler operation set according to a preset crawler identification sequence; performing crawler identification on the request information based on target anti-crawler operation to obtain an identification result; and in response to determining that the identification result indicates that the request information is a crawler, determining the identification result as a target crawler identification result.

According to another aspect of the present disclosure, there is provided an apparatus for recognizing a crawler, including: an information acquisition unit configured to acquire request information requesting access to page data; an operation determination unit configured to determine a target anti-crawler operation for the request information from a preset anti-crawler operation set according to a preset crawler identification order; the crawler identification unit is configured to perform crawler identification on the request information based on target anti-crawler operation to obtain an identification result; a result determination unit configured to determine the recognition result as a target crawler recognition result in response to determining that the recognition result indicates that the request information is a crawler.

According to another aspect of the present disclosure, there is provided an electronic device including: one or more processors; a memory for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method for identifying a crawler as any one of above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method for identifying a crawler as any one of the above.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements a method for identifying a crawler as any one of the above.

According to the technology of the disclosure, a method for identifying a crawler is provided, which can improve the data security of a webpage version applet.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for identifying crawlers according to the present disclosure;

FIG. 3 is a schematic diagram of one application scenario of a method for identifying crawlers according to the present disclosure;

FIG. 4 is a flow diagram of another embodiment of a method for identifying crawlers according to the present disclosure;

FIG. 5 is a schematic block diagram of one embodiment of an apparatus for identifying crawlers according to the present disclosure;

FIG. 6 is a block diagram of an electronic device for implementing a method for crawler identification according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, a netpage applet proxy server 105, a network 106 and a developer server 107. The network 104 serves as a medium for providing a communication link between the

terminal devices

101, 102, 103 and the netpage applet proxy server 105, and the network 106 serves as a medium for providing a communication link between the netpage applet proxy server 105 and the developer server 107. The

networks

104, 106 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may use the

terminal devices

101, 102, 103 to interact with the netpage applet proxy server 105 over the network 104 to receive or send messages and the like. The

terminal devices

101, 102, 103 may be installed with an applet client, and by running the applet client, a user may obtain corresponding services provided by the applet proxy server 105 and the developer server 107 for the applet client.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, mobile phones, computers, tablets, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The netpage applet proxy server 105 may be a server that provides various applet proxy services, for example, the netpage applet proxy server 105 may acquire request information corresponding to the applet client transmitted by the

terminal devices

101, 102, 103, transmit the request information to the developer server 107 through the network 106, receive service contents corresponding to the request information returned by the developer server 107, and return the service contents to the

terminal devices

101, 102, 103.

After the applet proxy server 105 acquires the request information sent by the

terminal devices

101, 102, and 103 and before the request information is sent to the developer server 107 through the network 106, in order to improve data security, a target anti-crawler operation for the request information may be determined from a preset anti-crawler operation set according to a preset crawler identification order, and crawler identification may be performed on the request information based on the target anti-crawler operation to obtain an identification result. And if the identification result indicates that the request information is the crawler, determining the identification result as a target crawler identification result. Optionally, the applet proxy server 105 may indicate that the request information is a crawler according to the target crawler identification result, and intercept the request information, or may also send a prompt message to the developer server 107, so that the developer server 107 performs corresponding processing on the request information identified as the crawler.

The netpage applet proxy server 105 and the developer server 107 may be hardware or software. When the netpage applet proxy server 105 and the developer server 107 are hardware, they may be implemented as a distributed server cluster consisting of a plurality of servers, or as a single server. When the netpage applet proxy server 105 and the developer server 107 are software, they may be implemented as multiple software or software modules (e.g., to provide distributed services) or as a single software or software module. And is not particularly limited herein.

The developer server 107 may be a server that provides various services, for example, the developer server 107 may receive request information transmitted by the netpage applet proxy server 105 over the network 106 and respond to the request information.

It should be noted that the method for identifying a crawler provided by the embodiment of the present disclosure is generally performed by the netpage applet proxy server 105, and the means for identifying a crawler is generally disposed in the netpage applet proxy server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for identifying crawlers is shown in accordance with the present disclosure. The method for identifying the crawler of the embodiment comprises the following steps:

step 201, obtaining request information for requesting to access page data.

In this embodiment, the execution main body (e.g., the electronic device such as the netpage applet proxy server 105 shown in fig. 1) may obtain request information requesting to access the page data, check the request information, and identify that the request information is a request sent by a normal user or a request sent by a crawler, thereby implementing interception of the crawler and ensuring data security of the page data. The page data may be page data corresponding to a netpage applet, or page data corresponding to other applications, and the like, which is not limited in this embodiment. The netpage applet refers to an H5 (a series of technical assemblies for creating interactive effects) version of the applet.

Step 202, according to a preset crawler identification sequence, determining a target anti-crawler operation for the request information from a preset anti-crawler operation set.

In this embodiment, the preset crawler recognition sequence is a recognition sequence for each anti-crawler operation in the preset anti-crawler operation set, for example, if the anti-crawler operation in the preset anti-crawler operation set includes anti-crawler operation a, anti-crawler operation B, anti-crawler operation C and anti-crawler operation D, the preset crawler recognition sequence may be that crawler recognition is performed by first utilizing anti-crawler operation D, then crawler recognition is performed by utilizing anti-crawler operation C, then crawler recognition is performed by utilizing anti-crawler operation B, and then crawler recognition is performed by utilizing anti-crawler operation a.

Each anti-crawler operation in the preset anti-crawler operation set can be the anti-crawler operation adopted for dealing with crawlers of different levels. The execution main body can pre-establish the corresponding relation between each anti-crawler operation and the corresponding crawler scene, and then sequence each anti-crawler operation according to the grade of each crawler scene to obtain the preset crawler identification sequence. The level of each crawler scene can be determined based on the complexity of the scene features of the crawler scene, and the higher the complexity, the higher the level of the crawler scene. For example, the crawler-anti-crawler operations may be sorted in order from the low level to the high level of the crawler scene, so as to obtain the preset crawler recognition order. Through selecting the anti-reptile operation of target and carrying out the reptile discernment according to this kind of predetermined reptile discernment order, can strengthen the reptile protection step by step, the security is higher.

And the execution main body can determine target anti-crawler operation from a preset anti-crawler operation set according to a preset crawler identification sequence, and perform crawler identification on the request information by using the target anti-crawler operation. Optionally, the execution main body may determine, according to the crawler recognition order, an anti-crawler operation for performing crawler recognition for the first time from a preset anti-crawler operation set, and use the determined anti-crawler operation as the target anti-crawler operation, and if the target anti-crawler operation recognition result indicates that the request information is a crawler, the target anti-crawler operation is not determined again. And if the identification result of the target anti-crawler operation indicates that the request information is the user, further determining the anti-crawler operation for crawler identification for the second time from the anti-crawler operation set again to serve as the target anti-crawler operation. The execution main body can repeat the process of determining target anti-crawler operation and performing crawler identification on the request information to obtain an identification result until the request information is judged to be a crawler, or until each anti-crawler operation in the anti-crawler operation set is taken out to judge the request information to be a user.

In some optional implementation manners of this embodiment, determining, according to a preset crawler identification order, a target anti-crawler operation for the request information from a preset set of anti-crawler operations may include: analyzing request parameters of the request information, and determining scene characteristics corresponding to the request information; and determining a target anti-crawler operation corresponding to the request information from the anti-crawler operation set based on the similarity between the scene features corresponding to the request information and the features of the crawler scene and a preset crawler identification sequence. The preset crawler identification sequence can be an anti-crawler operation sequence determined based on the characteristics of the crawler scene, the execution main body can determine the characteristics of the crawler scene with the highest similarity of the scene characteristics corresponding to the request information, and the characteristics of the crawler scene are determined to be target anti-crawler operation according to the anti-crawler operation corresponding to the preset crawler identification sequence. By implementing the optional implementation mode, the target anti-crawler operation which is most matched with the characteristics of the request information can be determined according to the characteristics of the request information, crawler identification is carried out based on the target anti-crawler operation, and the crawler identification accuracy can be improved.

And step 203, performing crawler identification on the request information based on the target anti-crawler operation to obtain an identification result.

In this embodiment, after the execution main body determines the target anti-crawler operation, crawler recognition may be performed on the request information according to the target anti-crawler operation, so as to obtain a recognition result. The identification result may indicate that the request information is a user or indicate that the request information is a crawler. Different crawler-resistant operations correspond to different crawler identification means, crawler identification is carried out on the request information through different crawler identification means, and data safety can be further enhanced.

And step 204, in response to the fact that the identification result indicates that the request information is a crawler, determining the identification result as a target crawler identification result.

In this embodiment, if the identification result indicates that the request information is a crawler, the identification result is directly determined as a final target crawler identification result. And when the target crawler identification result indicates that the request information is the crawler, the execution main body can intercept the request information to avoid the abnormal access of the crawler. If the identification result indicates that the request information is the user, the executing main body may repeat steps 202 to 204 until a target crawler identification result indicating that the request information is a crawler is obtained, or complete traversal of all crawler-anti-operations in the crawler-anti-operation set to obtain a target crawler identification result indicating that the request information is the user.

With continued reference to FIG. 3, a schematic diagram of one application scenario of a method for identifying crawlers in accordance with the present disclosure is shown. In the application scenario of fig. 3, the executing agent may execute step 301, when a user or a crawler requests to access a webpage (e.g., an applet webpage), obtain request information, determine a target anti-crawler operation 302 from a preset anti-crawler operation set according to a preset crawler identification sequence, and perform crawler identification on the request information by using the target anti-crawler operation 302 to obtain an identification result 303. If the recognition result 303 indicates that the request information is a crawler, the recognition result 303 is determined as a target crawler recognition result 304. If the identification result 303 indicates that the request information is the user, determining the target anti-crawler operation 302 from the preset anti-crawler operation set again according to the preset crawler identification sequence until obtaining a target crawler identification result with a crawler identification result, or completing traversal of each anti-crawler operation in the preset anti-crawler operation set to obtain a target crawler identification result with a user identification result.

The method for identifying the crawler according to the embodiments of the present disclosure may preset a crawler identification sequence and a crawler anti-crawler operation set, determine a target crawler anti-crawler operation in the crawler anti-crawler operation set according to the crawler identification sequence, perform crawler identification on request information based on the target crawler anti-crawler operation, and may implement security protection on page data (e.g., page data of an applet of a web version), thereby improving data security of the applet of the web version.

With continued reference to FIG. 4, a flow 400 of another embodiment of a method for identifying crawlers is shown in accordance with the present disclosure. As shown in fig. 4, the method for identifying a crawler of the present embodiment may include the steps of:

step 401, obtaining request information for requesting to access page data, where the request information is used for requesting to access page data of the netpage applet.

In this embodiment, the request information is used to access page data of the netpage applet. The netpage applet refers to an H5 (a series of technical assemblies for creating interactive effects) version of the applet. For the detailed description of step 401, refer to the detailed description of step 201, and are not repeated here.

In some optional implementations of this embodiment, the following steps may also be performed: determining an encrypted network address corresponding to the request information; determining a first encryption index and a second encryption index in the encrypted network address; decrypting the encrypted network address based on the first encryption index and the second encryption index to obtain a decrypted network address; and performing network access based on the decrypted network address.

In this implementation manner, in order to avoid the attack of the crawler on the third-party service, a URL (Uniform Resource Locator) encryption manner is adopted to encrypt the URL corresponding to the third-party service. When performing network access, the execution subject may first decrypt the encrypted URL in the request information to obtain a decrypted network address, and perform network access based on the decrypted network address.

The encrypted network address in the request information sent by the user is determined and obtained based on the following steps: acquiring two random numbers which are generated randomly to obtain the first encryption index and the second encryption index; dividing the initial network address into a first network sub-address and a second network sub-address based on the first encryption index and the second encryption index; for each character in the first network sub-address, carrying out offset processing on the character according to an offset corresponding to the first encryption index and an offset corresponding to the position of the character in the first network sub-address to obtain a first network sub-address after the offset processing; for each character in the second network sub-address, carrying out offset processing on the character according to the offset corresponding to the second encryption index and the offset corresponding to the position of the character in the second network sub-address to obtain a second network sub-address after the offset processing; and splicing the first network sub-address and the second network sub-address after the offset processing to obtain an encrypted network address.

For example, if the initial network address is https:// api. tusij.com/v 2/get-categorytester key & source ═ baidu _ app, the two randomly generated random numbers are 22 and 1, and according to the ordering of the letters and the random numbers, the first encryption index obtained is w (a is offset backward by 22), and the second encryption index obtained is B (a is offset backward by 1). Then, the first encryption index and the second encryption index are combined into "/wB", the combined "/wB" is inserted into the initial network address, and the initial network address is divided into a first network sub-address and a second network sub-address. For example, the initial network address inserted into "/wB" is "https:// api. tusij.com/wB/v2/get-categorytoken ═ source ═ basic _ ap p", at which time the first network sub-address is "https:// api. tusij.com", and the second network sub-address is "/v 2/get-categorytoken ═ source ═ basic _ app". For each character in the first network sub-address, according to an offset (22) corresponding to the first encryption index and an offset (for example, a corresponds to 8) corresponding to the position of the character in the first network sub-address, each character is subjected to offset processing, and the first network sub-address 'dqros:// euo.bdctv.qdc' subjected to offset processing is obtained. And for each character in the second network sub-address, performing offset processing on each character according to an offset (1) corresponding to the second encryption index and an offset corresponding to the position of the character in the second network sub-address to obtain a second network sub-address "/x 2/lka-lkeqtchezj ═ hhqme _ mcd" after the offset processing, splicing the first network sub-address and the second network sub-address, and finally obtaining an encrypted network address of "dqros:// euo.

Further, in the case of decrypting the URL, the execution main body may determine the encrypted network address corresponding to the request information, analyze the encrypted network address, and determine the first encryption index and the second encryption index in the encrypted network address. And performing corresponding offset processing on the encrypted network address through the first encryption index and the second encryption index to obtain a decrypted network address. The execution principal may access the third party service by decrypting the network address.

Step 402, according to a preset crawler identification sequence, determining a target crawler resisting operation for the request information from a preset crawler resisting operation set, wherein the crawler resisting operation in the preset crawler resisting operation set at least comprises one of the following items: the method comprises the steps of end feature identification operation, token identification operation, man-machine feature identification operation, data analysis identification operation and signature identification operation.

In this embodiment, the end feature identification operation may be an identification operation of a crawler scene for a pseudo-end feature, where the end feature refers to a specific parameter in the request information, and the specific parameter may include, but is not limited to, a UA (User Agent, environment information of the terminal), a referrer (address of a previous page), a header (header), and the like. Specifically, the end feature identification operation may analyze a specific parameter in the request information, check whether the specific parameter is a parameter corresponding to the user, determine that the request information is a crawler if the specific parameter is not the specific parameter corresponding to the user, and determine that the request information is the user if the specific parameter is the specific parameter corresponding to the user.

Further, the token identification operation may be an identification operation of a crawler scenario requesting playback for the fake build-end feature. Wherein request replay refers to the act of constantly retrying the same request. Specifically, in order to deal with the crawler scenario, the request information may carry encrypted token information, and the execution main body may decrypt the token information to implement token verification. If the token passes the verification, determining that the request information is the user; and if the token check is not passed, determining the request information as the crawler.

Further, the human-machine feature recognition operation may be a recognition operation for a crawler scene that implements an automation script using a real browser. Specifically, the executing entity may obtain the identification result by detecting an equipment identity corresponding to the request information, detecting an internet protocol corresponding to the request information, detecting a user agent corresponding to the request information, and the like. For example, if the device identity indicates a device corresponding to a crawler, the identification result is that the request information is a crawler.

Further, the data analysis and identification operation may be a crawler scene that cannot be identified by the crawler scenes such as the counterfeit end feature, the counterfeit end feature request replay, and the realization of the automation script using the real browser, and the data analysis and identification operation may be a corresponding identification operation. Specifically, the execution main body may perform statistical processing based on information such as historical crawler data and scene characteristics of the current request information, score the request information, and determine that the request information is a user or a crawler based on a scoring result.

Further, the signature recognition operation may be a recognition operation for a crawler scene that bypasses the execution agent (the applet proxy server) and directly attacks the developer server. Specifically, to cope with the crawler scenario, the same signature generation algorithm may be used between the executing agent and the developer server. When the signature generation algorithm in the request information is verified, even if the execution subject is bypassed, the developer server side can perform verification in the same manner and can return the verification result to the execution subject. And if the verification result indicates that the signature verification is passed, determining that the identification result is that the request information is the user, and if the verification result indicates that the signature verification is not passed, determining that the identification result is that the request information is the crawler. Furthermore, NA (Native App, a third-party application program written and run by Native programs based on a local operating system of a smart phone, such as iOS, Android, WP, also called local App, commonly used development languages, such as JAVA, C + +, Objective-C) end applet and netpage applet proxy server both send request information to a developer server, for which the same signature generation algorithm can be used between the NA end applet and the netpage applet proxy server. For the developer server, the signature from two sources, namely the NA-side applet and the webpage applet proxy server can be received, the signature is verified, and the verification result is returned to the NA-side applet and the webpage applet proxy server.

And, each anti-crawler operation in the above-mentioned anti-crawler operation set is ordered according to a preset crawler recognition order, and the obtained ordering result may be: the method comprises the steps of end feature identification operation, token identification operation, man-machine feature identification operation, data analysis identification operation and signature identification operation. The execution main body can sequentially extract the target anti-crawler operation according to the sequencing result, and traversal of the anti-crawler operation set is achieved.

Step 403, determining token index information corresponding to the request information under the condition that the target anti-crawler operation is a token identification operation; determining a target character based on the token index information; and in response to determining that the target character is not matched with the preset character, determining that the identification result is the request information is the crawler.

In this embodiment, if the target anti-crawler operation is a token identification operation, the token information (token) carried in the request information may be checked. Specifically, the execution body may include a web page rendering module (web-xrender) and a web page interface management module (webapi). When a user requests to render a page, the webpage rendering module in the execution body may send corresponding token index information to the user, for example, may send unfilled token information encrypted by base64url (an arbitrary binary to text string encoding method) to the user, where the token information includes the above token index information. When a user requests to access page data, the token information is carried in request information. The webpage interface management module in the execution main body can determine a target character needing to be checked through the token index information, match the target character with a preset character, determine that the identification result is the request information as the crawler if the target character is not matched with the preset character, and determine that the identification result is the request information as the user if the target character is matched with the preset character.

The preset characters can be token character information issued when a user requests to render a page. If the request message is sent by the user, the token character information in the request message is the same as the preset character at this time. If the request information is sent by the crawler, the token character information in the request information is different from the preset characters. Through token identification operation, the request information can be verified based on the preset characters in the execution main body, if the request information is broken, the request information can be verified again by modifying the preset characters, and safety modification is more convenient. Optionally, the execution main body may also perform code obfuscation on content displayed at the front end of the applet, so as to prevent the code at the front end from being broken, and further improve the security of the applet.

In some optional implementations of this embodiment, the following steps may also be performed: determining a target applet identifier and a target timestamp corresponding to the request information; and in response to determining that the target applet identifier does not match the preset applet identifier or that the target timestamp is expired, determining that the identification result is the request message is a crawler.

In this implementation, the request information may correspond to a corresponding target applet identification and target timestamp. Optionally, the target applet identification, target timestamp and token information may be stored in association. The target applet identification is the only identification information of the applet, and the target timestamp is used for describing timeliness of the token information. And the execution main body can analyze the current webpage domain name to obtain a preset applet identifier. And if the target small program identification is not matched with the preset small program identification, indicating that the identification result is that the request information is the crawler. And if the target small program identification is matched with the preset small program identification, indicating that the identification result is the request information is the user. And the execution main body can also store effective time in advance, if the time difference between the current time and the target timestamp is greater than the preset effective time, the request information is overdue, the identification result can be determined to be a crawler, and the request information is intercepted.

For example, when a user requests to render an applet webpage, the webpage rendering module issues unfilled, base64url encrypted token information to the browser requesting rendering, where the token information may correspond to a target applet identifier and a target timestamp. And then, the user can send request information for requesting to access the page data to the webpage interface management module, and the request information carries the token information, the target applet identification and the target timestamp. The execution body decrypts the unfilled, base64url encrypted token information and verifies whether the target applet identification is correct and whether the target timestamp is expired. And if the decrypted token information is correct, the target applet identification is correct and the target timestamp is not expired, determining that the identification result is the user. And if the decrypted token information is wrong, the target applet is wrong in identification and the target timestamp is expired, the identification result can be determined to be the crawler.

Wherein the encrypted token information may be generated based on: generating a random number, and converting the random number into a binary random number; generating a target timestamp based on the current time; generating a target applet identifier based on the applet identifier requested by the user; splicing the random number, the target timestamp and the target small program identification, and calculating a base64url character string without filling; determining the designated position of the character inserted in the character string according to the preset index; and inserting preset characters at the specified positions to obtain the token information which is inserted with characters, is not filled and is encrypted by base64 url.

Further, the step of executing the main body to decrypt the token information is as follows: determining token information in the request information and token index information aiming at the token information, wherein the token index information can be index information used for reading characters and can be stored in an execution body in advance; and determining the target character of the corresponding position in the token information based on the token index information. And matching the target character with a preset character pre-stored in the execution main body, and if the character is matched, the target applet identification in the token information is correct and the target timestamp is not expired, determining that the identification result is the request information as the user. And if the characters do not match, or the target applet identification in the token information is incorrect, or the target timestamp is expired, determining that the identification result is that the request information is the crawler.

Step 404, acquiring crawler analysis data under the condition that the target anti-crawler operation is a data analysis and identification operation; and performing crawler identification on the request information based on crawler analysis data to obtain an identification result corresponding to the request information.

In this embodiment, the execution subject may obtain crawler analysis data in advance, where the crawler analysis data may be data obtained by analyzing historical crawler data, features of current request information, and features of different crawler scenes. Based on the crawler analysis data, crawler identification can be carried out on the request information to obtain an identification result. The identification result may be that the indication request information is a crawler, or may be that the indication request information is a user. Optionally, the execution subject may also generate a corresponding rating score based on the crawler analysis data and the request information, for example, for a case that the probability that the request information is a crawler is higher, the corresponding rating score may be higher.

Step 405, determining signature information in the request information under the condition that the target anti-crawler operation is a signature synchronous identification operation; and obtaining an identification result based on the signature information and preset signature information.

In this embodiment, the executing agent may share a signature generation algorithm with the developer server. When the request information is verified, the signature information in the request information can be compared with signature information which is generated by a preset signature generation algorithm shared by an execution main body and a developer server, if the signatures are the same, the identification result is determined to be the request information as a user, and if the signatures are different, the identification result is determined to be the request information as a crawler. The developer server may identify the request information based on the same signature comparison method.

Wherein the common signature generation algorithm generates the signature based on the following steps: acquiring an applet key; computing md5 (a widely used cryptographic hash function) value corresponding to the applet key; decoding a designated part (such as a plurality of characters from back to front) in the website to obtain a first decoded value; decoding key value pairs in the query information, and sequencing the decoded key value pairs to obtain a second decoded value; splicing the md5 value, the first decoded value, the second decoded value and the timestamp to obtain a spliced character string; and performing md5 encryption on the spliced character string to generate an encrypted signature.

Further, when the signature information in the request information is verified, the executing entity may decrypt the signature information through md5 to obtain decrypted signature information; and respectively comparing the key subpart in the decrypted signature information with the md5 value corresponding to the applet key, comparing the website decoded value in the decrypted signature information with the first decoded value, comparing the sorted key value in the decrypted signature information with the second decoded value, and comparing the timestamp with the current time. And if the key subsection in the decrypted signature information is the same as the md5 value corresponding to the applet key, the website decoded value in the decrypted signature information is the same as the first decoded value, the sequenced key value pair in the decrypted signature information is the same as the second decoded value, and the timestamp is not expired, determining that the signatures are the same, and identifying that the request information is the user. If the condition is not met, the signatures are determined to be different, and the identification result is that the request information is the crawler.

In some optional implementations of this embodiment, the following steps may also be performed: determining crawler score information corresponding to the request information based on the identification result; and outputting crawler score information.

In this implementation manner, the execution subject may generate crawler score information corresponding to the request information based on the recognition result and the rank score corresponding to the data analysis recognition operation and the recognition result of the human-machine feature recognition operation, and output the crawler score information to the developer server, so that the developer server may perform corresponding processing by adopting other anti-crawler means. The recognition result here may be a recognition result of at least one of the anti-crawler operations in a preset set of anti-crawler operations. And the crawler score information is used for describing the probability condition that the request information is the crawler.

And step 406, in response to determining that the identification result indicates that the request information is a crawler, determining the identification result as a target crawler identification result.

In this embodiment, for the detailed description of step 406, please refer to the detailed description of step 204, which is not repeated herein.

Step 407, in response to determining that the identification result indicates that the request information is not a crawler and that the preset anti-crawler operation set is not traversed, re-determining a target anti-crawler operation for the request information from the preset anti-crawler operation set according to a preset crawler identification sequence.

In this embodiment, if the identification result indicates that the request information is not a crawler (for the user) and the preset anti-crawler operation set is not traversed, that is, an unused anti-crawler operation exists in the preset anti-crawler operation set, the target anti-crawler operation is determined again from the preset anti-crawler operation set according to the preset crawler identification sequence, and the identification result is continuously determined. Optionally, the execution main body may store the current recognition result, and if the recognition result has information such as a crawler level, the execution main body may also store the recognition result and information such as a crawler level correspondingly. And then, re-determining the target anti-crawler operation.

And step 408, in response to that the identification result indicates that the request information is not a crawler and the traversal of the preset anti-crawler operation set is completed, determining the identification result as a target crawler identification result.

In this embodiment, if the traversal completes the preset anti-crawler operation set, and the identification result of each time indicates that the request information is not a crawler, the identification result indicating that the request information is not a crawler is determined as the target crawler identification result.

According to the method for identifying the crawler, the target anti-crawler operation can be determined again until the crawler is identified, or the anti-crawler operation set is traversed and identified without the crawler under the condition that the identification result indicates that the request information is not the crawler and the anti-crawler operation set is not traversed and completed, so that the safety protection of the crawler is enhanced step by step until all anti-crawler operations are used and completed, and the crawler identification accuracy is improved. And the crawler scene which can be replayed by responding to the characteristic request of the counterfeit terminal based on the token identification operation, the crawler scene which cannot be identified by the operations such as the data analysis identification operation responding to the terminal characteristic identification operation, the token identification operation, the man-machine characteristic identification operation and the like based on the crawler scene which can respond to the third party website encryption attack third party service, the crawler scene which can not be identified by the operations such as the signature synchronous identification operation responding to the terminal characteristic identification operation, the token identification operation, the man-machine characteristic identification operation and the like is responded, and the pertinence protection of different crawler scenes is realized. And crawler score information can be generated for further processing by developers, so that the flexibility of crawler processing is improved.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for identifying a crawler, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to a netpage applet proxy server.

As shown in fig. 5, the apparatus 500 for recognizing a crawler of the present embodiment includes: an information acquisition unit 501, an operation determination unit 502, a crawler recognition unit 503, and a result determination unit 504.

An information acquisition unit 501 configured to acquire request information requesting access to page data.

An operation determining unit 502 configured to determine a target anti-crawler operation for the request information from a preset anti-crawler operation set according to a preset crawler recognition order.

And a crawler recognition unit 503 configured to perform crawler recognition on the request information based on the target anti-crawler operation, and obtain a recognition result.

A result determination unit 504 configured to determine the recognition result as the target crawler recognition result in response to determining that the recognition result indicates that the request information is a crawler.

In some optional implementations of the present embodiment, the operation determining unit 502 is further configured to: and in response to the fact that the identification result indicates that the request information is not a crawler and the preset anti-crawler operation set is not traversed, re-determining the target anti-crawler operation aiming at the request information from the preset anti-crawler operation set according to the preset crawler identification sequence.

In some optional implementations of the present embodiment, the result determining unit 504 is further configured to: and determining the identification result as a target crawler identification result in response to the fact that the identification result indicates that the request information is not a crawler and the traversal of the preset anti-crawler operation set is completed.

In some optional implementations of this embodiment, the target anti-crawler operation includes at least a token identification operation; and, the crawler recognition unit 503 is further configured to: determining token index information corresponding to the request information; determining a target character based on the token index information; and in response to determining that the target character is not matched with the preset character, determining that the identification result is the request information is the crawler.

In some optional implementations of this embodiment, the crawler recognition unit 503 is further configured to: determining a target applet identifier and a target timestamp corresponding to the request information; and in response to determining that the target applet identifier does not match the preset applet identifier or that the target timestamp is expired, determining that the identification result is the request message is a crawler.

In some optional implementations of this embodiment, the method further includes: a network access unit configured to determine an encrypted network address corresponding to the request information; determining a first encryption index and a second encryption index in the encrypted network address; decrypting the encrypted network address based on the first encryption index and the second encryption index to obtain a decrypted network address; and performing network access based on the decrypted network address.

In some optional implementations of this embodiment, the target anti-crawler operation includes at least a data analysis recognition operation; and, the crawler recognition unit 503 is further configured to: obtaining crawler analysis data; and performing crawler identification on the request information based on crawler analysis data to obtain an identification result corresponding to the request information.

In some optional implementations of this embodiment, the target anti-crawler operation includes at least a signature synchronization identification operation; and, the crawler recognition unit 503 is further configured to: determining signature information in the request information; and obtaining an identification result based on the signature information and preset signature information.

In some optional implementations of this embodiment, the method further includes: a score output unit configured to determine crawler score information corresponding to the request information based on the recognition result; and outputting crawler score information.

In some optional implementations of the embodiment, the request information is for requesting access to page data of the netpage applet.

In some optional implementations of the present embodiment, the anti-crawler operations in the preset set of anti-crawler operations include at least one of: the method comprises the steps of end feature identification operation, token identification operation, man-machine feature identification operation, data analysis identification operation and signature identification operation.

It should be understood that the units 501 to 504, which are described in the apparatus 500 for identifying a crawler, correspond to the respective steps in the method described with reference to fig. 2, respectively. Thus, the operations and features described above for the method for identifying a crawler are equally applicable to the apparatus 500 and the units contained therein and will not be described in detail here.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 601 performs the various methods and processes described above, such as a method for identifying crawlers. For example, in some embodiments, the method for identifying a crawler may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM603 and executed by the computing unit 601, one or more steps of the method for identifying crawlers described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured by any other suitable means (e.g., by means of firmware) to perform the method for identifying crawlers.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for identifying a crawler, comprising:

acquiring request information for requesting to access page data;

determining target anti-crawler operation aiming at the request information from a preset anti-crawler operation set according to a preset crawler identification sequence;

performing crawler identification on the request information based on the target anti-crawler operation to obtain an identification result;

in response to determining that the identification result indicates that the request information is a crawler, determining the identification result as a target crawler identification result.

2. The method of claim 1, further comprising:

and in response to the fact that the identification result indicates that the request information is not a crawler and the preset anti-crawler operation set is not traversed, re-determining the target anti-crawler operation aiming at the request information from the preset anti-crawler operation set according to the preset crawler identification sequence.

3. The method of claim 1, further comprising:

and in response to the fact that the identification result indicates that the request information is not a crawler and the traversal of the preset anti-crawler operation set is completed, determining the identification result as the target crawler identification result.

4. The method of claim 1, wherein the target anti-crawler operation includes at least a token identification operation; and

the crawler recognition is performed on the request information based on the target anti-crawler operation to obtain a recognition result, and the method comprises the following steps:

determining token index information corresponding to the request information;

determining a target character based on the token index information;

and in response to determining that the target character is not matched with the preset character, determining that the identification result is that the request information is a crawler.

5. The method of claim 4, further comprising:

determining a target applet identifier and a target timestamp corresponding to the request information;

and in response to determining that the target applet identifier is not matched with a preset applet identifier or that the target timestamp is expired, determining that the identification result is that the request information is a crawler.

6. The method of claim 1, further comprising:

determining an encrypted network address corresponding to the request information;

determining a first encryption index and a second encryption index in the encrypted network address;

decrypting the encrypted network address based on the first encryption index and the second encryption index to obtain a decrypted network address;

and performing network access based on the decrypted network address.

7. The method of claim 1, wherein the target anti-crawler operations include at least data analytics recognition operations; and

obtaining crawler analysis data;

and performing crawler identification on the request information based on the crawler analysis data to obtain the identification result corresponding to the request information.

8. The method of claim 1, wherein the target anti-crawler operation includes at least a signature synchronization identification operation; and

determining signature information in the request information;

and obtaining the identification result based on the signature information and preset signature information.

9. The method of claim 1, further comprising:

determining crawler score information corresponding to the request information based on the identification result;

and outputting the crawler score information.

10. The method of claim 1, wherein the request information is for requesting access to page data of a netpage applet.

11. The method of claim 1, wherein anti-crawler operations in the preset set of anti-crawler operations include at least one of: the method comprises the steps of end feature identification operation, token identification operation, man-machine feature identification operation, data analysis identification operation and signature identification operation.

12. An apparatus for identifying a crawler, comprising:

an information acquisition unit configured to acquire request information requesting access to page data;

an operation determination unit configured to determine a target anti-crawler operation for the request information from a preset anti-crawler operation set according to a preset crawler identification order;

the crawler identification unit is configured to perform crawler identification on the request information based on the target anti-crawler operation to obtain an identification result;

a result determination unit configured to determine the recognition result as a target crawler recognition result in response to determining that the recognition result indicates that the request information is a crawler.

13. The apparatus of claim 12, the operation determination unit further configured to:

14. The apparatus of claim 12, the result determination unit further configured to:

15. The apparatus of claim 12, wherein the target anti-crawler operation comprises at least a token identification operation; and

the crawler recognition unit is further configured to:

determining token index information corresponding to the request information;

determining a target character based on the token index information;

16. The apparatus of claim 15, the crawler identification unit further configured to:

17. The apparatus of claim 12, further comprising:

a network access unit configured to determine an encrypted network address corresponding to the request information; determining a first encryption index and a second encryption index in the encrypted network address; decrypting the encrypted network address based on the first encryption index and the second encryption index to obtain a decrypted network address; and performing network access based on the decrypted network address.

18. The apparatus of claim 12, wherein the target anti-crawler operation comprises at least a data analytics recognition operation; and

the crawler recognition unit is further configured to:

obtaining crawler analysis data;

19. The apparatus of claim 12, wherein the target anti-crawler operation comprises at least a signature synchronization identification operation; and

the crawler recognition unit is further configured to:

determining signature information in the request information;

20. The apparatus of claim 12, further comprising:

a score output unit configured to determine crawler score information corresponding to the request information based on the recognition result; and outputting the crawler score information.

21. The apparatus of claim 12, wherein the request information is for requesting access to page data of a netpage applet.

22. The apparatus of claim 12, wherein anti-crawler operations of the preset set of anti-crawler operations comprise at least one of: the method comprises the steps of end feature identification operation, token identification operation, man-machine feature identification operation, data analysis identification operation and signature identification operation.

23. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-12.

25. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-12.