WO2020062655A1 - Procédé, appareil et dispositif de reconnaissance de collecteur, et support de stockage lisible non volatil - Google Patents
Procédé, appareil et dispositif de reconnaissance de collecteur, et support de stockage lisible non volatil Download PDFInfo
- Publication number
- WO2020062655A1 WO2020062655A1 PCT/CN2018/123184 CN2018123184W WO2020062655A1 WO 2020062655 A1 WO2020062655 A1 WO 2020062655A1 CN 2018123184 W CN2018123184 W CN 2018123184W WO 2020062655 A1 WO2020062655 A1 WO 2020062655A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- crawler
- server
- request instruction
- request
- information
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/40—Network security protocols
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Definitions
- the present application relates to the field of computer technology, and in particular, to a crawler identification method, device, device, and non-volatile readable storage medium.
- a crawler refers to a web crawler (also known as a web spider, web robot, etc.), which is a program or script that automatically captures network information on the Internet in accordance with certain rules. After the crawler crawls a webpage, if the webpage has hyperlinks to various other webpages, it can crawl to another webpage to obtain other data.
- a web crawler also known as a web spider, web robot, etc.
- Method 1 The crawler end submits a request for network information-download the webpage code-parse it into a page
- method 2 the crawler end simulates a browser to send a request-extract useful data-stored in a database
- many websites need to restrict the crawlers of the crawler companies to avoid the comprehensive leakage of website information.
- the identification of the crawler end is difficult, which makes it difficult to restrict the crawlers, and the website information is easy to leak completely .
- the main purpose of the present application is to provide a crawler identification method, device, device, and non-volatile readable storage medium, which are aimed at solving the technical problems of difficulty in identification of the crawler end and difficulty in limitation of the crawler.
- the present application provides a crawler identification method.
- An identification code is provided on a server side to which the crawler identification method is applied, and an identification display content corresponding to the identification code is in a hidden display state on a corresponding page on the server end;
- the crawler identification method includes:
- the request instruction is a request instruction that includes the network information that identifies the display content, determine that the request instruction corresponds to a client terminal that has a crawler risk on the server side, and include the client terminal that sent the request instruction in all The server is listed in the blacklist.
- the present application also provides a crawler identification device.
- the crawler identification device includes:
- a collection module configured to receive a request instruction for requesting network information sent by a user end to a server end;
- a judging module configured to judge whether the request instruction is a request instruction for requesting network information including the identification display content
- a first determining module configured to: if the request instruction is a request instruction requesting network information including the identification and display content, determine that the request instruction corresponds to a crawler risk at the user end and crawl at the server end, and send the request The client of the instruction is included in a blacklist of the server.
- the present application further provides a crawler identification device, the crawler identification device includes: a memory, a processor, a communication bus, and computer-readable instructions stored on the memory,
- the communication bus is used to implement a communication connection between the processor and the memory
- the processor is configured to execute the computer-readable instructions to implement the following steps:
- the request instruction is a request instruction for requesting network information including the identification and display content, it is determined that the request instruction corresponds to a crawler risk on the user side and a crawler on the server side.
- the present application further provides a non-volatile readable storage medium, where the non-volatile readable storage medium stores one or more programs, and the one or more programs may be stored by one Or more than one processor to perform:
- the request instruction is a request instruction that includes the network information that identifies the display content, determine that the request instruction corresponds to a client terminal that has a crawler risk on the server side, and include the client terminal that sent the request instruction in all The server is listed in the blacklist.
- each time a request instruction for requesting network information sent by a user end to a server end is received it is determined whether the request instruction is a request instruction requesting network information including the identification display content; if the request instruction is Requesting a request instruction containing the network information identifying the display content, determining that the request instruction corresponds to a client terminal having a crawler risk of the server terminal, and including the client terminal sending the request instruction in a blacklist of the server in.
- the crawler client it is mainly to obtain the corresponding network information by displaying the code content of the target website, and displaying the code content, while for ordinary users, it is mainly to obtain the corresponding network information online. It does not obtain the code content corresponding to the network information.
- the website is provided with a "honey", which is an identification code.
- the identification code includes hidden attributes and display attributes.
- the identification code is displayed on the display page of the website server. It is hidden and not displayed. Therefore, for ordinary users, since the corresponding display content of the identification code is not displayed, ordinary users will not request to obtain the display content of the identification code, but because the crawler client obtains the target website
- the code content is obtained by displaying the code content and corresponding network information is obtained. Therefore, for the crawler client, the "honey" is requested, that is, the identification code corresponds to the display content. Therefore, if the request instruction of the client includes the A request for identifying network information to display content, then The request corresponds to a given instruction the crawler crawler risks client server. Therefore, there is a technical problem that the recognition of the crawler end is difficult and the crawler limitation is difficult.
- FIG. 1 is a schematic flowchart of a first embodiment of a crawler identification method of the present application
- FIG. 2 is a detailed flowchart of steps in the method for identifying a crawler of the present application in which a client that sends the request instruction is included in a blacklist of the server;
- FIG. 3 is a schematic diagram of a device structure of a hardware operating environment involved in a method according to an embodiment of the present application.
- This application provides a crawler identification method.
- a server side to which the crawler identification method is applied is provided with an identification code, and the identification display content corresponding to the identification code is on a corresponding page on the server side.
- the crawler recognition method includes:
- step S10 each time a request instruction for requesting network information sent by a client to a server is received, it is determined whether the request instruction is a request instruction for requesting network information including the identification display content;
- the corresponding network information is obtained through online browsing mainly through the input of keywords and the like without acquiring the code content corresponding to the network information.
- the crawler client in order to crawl a large amount of network information in a short time, the crawler client mainly obtains the code content of the target website and displays the code content to obtain the corresponding network information, which is used in specific embodiments. Note that if the crawler client wants to obtain the network information on Taobao, the crawler client will not obtain it from ordinary clients through link clicks, keyword input, etc., but will directly crawl to the Taobao backend server. Code content, get the display content of each page of Taobao.
- a "honey" or identification code is provided on the website server.
- the identification code includes hidden attributes and display attributes.
- the identification code is displayed on the background website server, and the page is displayed on the website server. It is in a state of being hidden and not displayed, that is, the website backend server has the identification code, but the display content corresponding to the identification code is not displayed on the front-end page.
- the request instruction When a request instruction for requesting network information sent by the client is obtained, the user information of the client is obtained.
- the request instruction carries user information, and the user information includes IP information, protocol stack information, and user agent UA information. After the new request instruction, new IP information, new protocol stack information, and new user agent UA information need to be obtained.
- the user information also includes identification information of the user terminal, and the identification information may be an international mobile user identification code of the user terminal, an electronic serial number ESN, and the like.
- the server side After detecting a request instruction, it is determined whether the content pointed to by the request instruction or corresponding user information is network information including identification and display content. It should be noted that, in this embodiment, the server side must When a request instruction is detected, in response to the request instruction to determine whether the content pointed to by the request instruction is an identification display content, a program segment needs to be set in the built-in processor in advance, and the program segment represents the processing logic when the request instruction is detected The processing logic is used to trigger the processor to respond to the request instruction event when a request instruction is detected to determine whether the content pointed to by the request instruction is identification display content.
- Step S20 If the request instruction is a request instruction for requesting network information including the identification and display content, determine that the request instruction corresponds to a client terminal that has a crawler risk on the server side, and send the request instruction to the client terminal. To the blacklist of the server.
- the request instruction is not a request instruction for requesting network information including the identification and display content
- the user information corresponds to a user terminal that may be at risk of crawling at the server side, or may not exist at the crawler.
- Server-side crawler risk If the request instruction is a request instruction that includes the network information identifying the display content, determine that the request instruction corresponds to a user-side crawler risk of crawling the server-side, and send the request instruction.
- the client is included in the blacklist of the server.
- the JD.com sells Huawei, Apple, Huawei and other mobile phones.
- the code for selling VIVO mobile phones has been added to the content of the back-end server page corresponding to the JD.com sales page. That is, on the Jingdong sales page, the page content of the VIVO mobile phone that is hidden cannot be seen, but the background has the code for selling the VIVO mobile phone. Therefore, it is impossible for ordinary users to click or touch the page content of the VIVO mobile phone.
- the crawler terminal will generate a request to further obtain the sales information corresponding to the VIVO mobile phone, so If it is detected that the request to obtain the sales information corresponding to the VIVO mobile phone is detected, it is determined that the corresponding requesting end is a crawler end, and the crawler end is included in the blacklist list.
- the user terminal that sends the request instruction is included in a blacklist of the server, including:
- Step S21 Acquire the unique identification code information of the user terminal corresponding to the request instruction, where the unique identification code information includes the electronic serial number ESN of the user terminal;
- the unique identification code information of the client corresponding to the request instruction is obtained, and the unique identification code information includes the electronic serial number ESN of the client.
- the unique identification code information may also be International Mobile Equipment Identity IMEI, International Mobile Subscriber Identity IMSI, and so on.
- step S22 the electronic serial number ESN of the user terminal is included in a blacklist list of the server terminal.
- the electronic serial number ESN of the client After obtaining the electronic serial number ESN of the client, the electronic serial number ESN of the client is included in the black list of the server, and the server no longer responds to each client in the black list to the server.
- the server can also establish a blacklist list based on the user's International Mobile Equipment Identity IMEI and International Mobile User Identity IMSI.
- each time a request instruction for requesting network information sent by a user end to a server end is received it is determined whether the request instruction is a request instruction requesting network information including the identification display content; if the request instruction is Requesting a request instruction containing the network information identifying the display content, determining that the request instruction corresponds to a client terminal having a crawler risk of the server terminal, and including the client terminal sending the request instruction in a blacklist of the server in.
- the crawler client it is mainly to obtain the corresponding network information by displaying the code content of the target website, and displaying the code content, while for ordinary users, it is mainly to obtain the corresponding network information online. It does not obtain the code content corresponding to the network information.
- the website is provided with a "honey", which is an identification code.
- the identification code includes hidden attributes and display attributes.
- the identification code is displayed on the display page of the website server. It is hidden and not displayed. Therefore, for ordinary users, since the corresponding display content of the identification code is not displayed, ordinary users will not request to obtain the display content of the identification code, but because the crawler client obtains the target website
- the code content is obtained by displaying the code content and corresponding network information is obtained. Therefore, for the crawler client, the "honey" is requested, that is, the identification code corresponds to the display content. Therefore, if the request instruction of the client includes the A request for identifying network information to display content, then The request corresponds to a given instruction the crawler crawler risks client server. Therefore, there is a technical problem that the recognition of the crawler end is difficult and the crawler limitation is difficult.
- the request instruction carries user information
- the user information includes IP information, protocol stack information, and user agent UA information
- the method further includes:
- Step S30 if the request instruction is not a request instruction for requesting network information including the identification display content, based on the user information corresponding to the request instruction, update the server-side collection within a preset time period past the current time point.
- the first aggregation ratio of the IP information updates the second aggregation ratio of the protocol stack information collected by the server during the past preset time period, and obtains the users collected by the server during the past preset time period Third aggregation ratio of proxy UA information;
- the server-side correspondence in the past preset time period is also collected.
- the collected multiple user information is used to determine whether a crawler end exists.
- the crawler continuously switches the proxy IP and UA information to perform crawling to obtain network information corresponding to the server end.
- the server is updated within a preset time period in the past.
- the first aggregation ratio of the IP information collected by the server updates the second aggregation ratio of the server-side corresponding protocol stack information in the past preset time period, and updates the server-side information collected in the past preset time period.
- the step of the third aggregation ratio of the user agent UA information that is, whenever a new request instruction is detected, the first aggregation ratio, the second aggregation ratio, and the third aggregation ratio need to be updated.
- the website server side obtains the first aggregation ratio, the second aggregation ratio, and the third aggregation ratio instead of only obtaining one of the aggregation ratios, thereby preventing the situation of the crawler client from being incorrectly identified.
- the step of updating the first aggregation ratio of the IP information collected by the server within a preset time period in the past at the current point in time includes:
- step S31 based on the user information corresponding to the request instruction, an IP segment of each user terminal collected by the server in a preset time period past the current time point is obtained;
- the front-end JS when detecting that the client sends a request instruction for requesting network information to the server, the front-end JS obtains the IP segment information of the client corresponding to the request instruction in real time, and obtains the client IP segment corresponding to the first request. After the information, it is also necessary to obtain the IP segments of other user terminals corresponding to other request instructions collected by the server within a preset time period corresponding to the past from the current time point.
- Step S32 arrange the IP segments of the respective clients in an orderly manner to obtain a first proportion of consecutive IP segments in the IP segments of the respective clients;
- the IP segments of each client are arranged in an orderly manner, specifically according to the size of the number on the first identification position of the IP segment Sorting, when the number in the first position is the same, sort according to the size of the number in the second position, and so on, until the sorting of the IP segment of each client is completed.
- the IP segment of the A client is 119.123.67.249
- D client IP segment is 116.30.198.37
- 119.123.67.249 is sorted first
- 116.30.198.37 is sorted last (9 is greater than 6)
- determine whether the IP segment of each client is continuous including the same IP Segment
- the IP segment of client A is 119.123.67.249
- the IP segment of client B is 119.123.67.250
- the IP segment of client C is 119.123.67.251
- the IP of client A, client B, and client C is Continuous, in addition to determining whether the IP segment of each client is continuous, it is also necessary to determine whether the IP segment of each client is the same, in order to obtain consecutive IP segments (including the same IP segment) in the respective users.
- the first segment accounted for the share of IP.
- in the past preset time period there are 5 consecutive IP segments accessing the server side, and the total number of client ends is 10,
- the process of obtaining the first ratio it is also possible to perform regional classification on the respective clients in the past preset time period according to the area information of the IP segments of the respective clients to obtain the IP segments of the clients in the same area.
- After obtaining the IP segments of each client in the same area arrange the IP segments of each client in the same area in an orderly manner in the past preset time period, and obtain consecutive IP segments in the past preset time period in the The first sub-proportion in the IP segment of each user terminal in the same area, so as to finally obtain the first sub-proportion corresponding to different regions in the past preset time period, after obtaining the first sub-proportion
- the first proportion of consecutive IP segments in the IP segments of the user terminals in the past preset time period can be obtained.
- Step S33 Set the first ratio to a first aggregation ratio corresponding to the IP information collected by the server during the past preset time period.
- the first ratio is set as a first aggregation ratio of corresponding IP information of the server in the past preset time period.
- the second aggregation ratio of the protocol stack information collected by the website server in the past preset time period is also obtained by matching and checking the protocol stack on the client side to determine Whether the first request received by the website server in the past preset time period contains multiple regular requests from the same client or multiple regular requests made by the same client after switching IP masquerading, specifically,
- the troubleshooting process is: determining a second aggregation ratio of the protocol stack information collected by the server side according to each receiving time interval corresponding to receiving a different first request and the packet loss rate of each client first request corresponding request packet.
- a third aggregation ratio of the server-side corresponding user agent UA information in the past preset time period is also obtained, wherein the updating of the server-side in the past preset time period is performed.
- the third aggregation ratio step of the collected user agent UA information includes:
- Step S34 Acquire information of each operating system and version, CPU type, browser, and version used by each client collected by the server during the past preset time period to obtain the operating system and version The number of clients with the same CPU type, browser, and version;
- the user agent UA information includes information about the operating system and version, CPU type, browser, and version used by the client.
- the server corresponds to the user agent UA information.
- the operating system and operating system version information of each client CPU type information of each client, browser information of each client, browser version information of each client, and determine the operating system and version, CPU type, browser, and The number of clients with exactly the same version.
- Step S35 Obtain the ratio of the number of the completely identical clients to the number of all the clients collected by all the servers to obtain a third proportion.
- the number of the identical clients in the past preset time period is taken to account for all users collected by the server Number of ends to get the third share.
- step S36 the third ratio is set to a third aggregation ratio of the server-side corresponding user agent UA information in the past preset time period.
- the third ratio is set as a third aggregation ratio of the server-side corresponding user agent UA information in the past preset time period.
- the acquisition order of the first aggregation ratio, the second aggregation ratio, and the third aggregation ratio may be changed, and the acquisition order is not necessarily determined.
- step S40 if the first aggregation ratio is greater than the first preset value, the second aggregation ratio is greater than the second preset value, and the third aggregation ratio is greater than the third preset value, it is determined that there is a risk of being crawled by the server. .
- the website server stores a first preset value, a second preset value, and a third preset value.
- the first preset value, the second preset value, and the third preset value can be determined according to If the actual situation is changed, if the first aggregation ratio is greater than the first preset value, the second aggregation ratio is greater than the second preset value, and the third aggregation ratio is greater than the third preset value, it is determined that the server end has been affected.
- the risk of the crawler is determined that the existence of a server-side server is determined when the first aggregation ratio is greater than the first preset value, the second aggregation ratio is greater than the second preset value, and the third aggregation ratio is greater than the third preset value.
- the risk of the crawler can therefore effectively prevent the crawler from crawling by constantly switching the proxy IP and UA information, etc., so as to obtain the network information corresponding to the server.
- the step of updating the second aggregation ratio of the protocol stack information collected by the server during the past preset time period includes:
- Step A1 Obtain receiving time intervals corresponding to different first requests received by the server in the past preset time period
- each of the receiving time intervals corresponding to different first requests received by the server during the past preset time period is obtained, where the receiving time intervals may be regularly changed, and for the regularly changed receiving time
- the interval needs to be at least 4 or more.
- the receiving interval is the same.
- the first request is received every second or the first request is received every two seconds.
- the receiving interval is one second or two seconds.
- the receiving time interval may not change regularly. For example, after receiving a certain first request, another second request is received again two seconds later, and after receiving the other first request again four seconds later, since the receiving time interval is Uncertain, therefore, the receiving time interval does not change regularly.
- step A2 it is determined whether each of the receiving time intervals changes in a partly regular manner. If the respective receiving time intervals change in a partly regular manner, obtaining the time interval of the regular change corresponds to a second proportion of all the respective receiving time intervals. Proportion
- each receiving time interval changes partly regularly, and if each receiving time interval changes partly regularly, then obtaining a regular change time interval corresponds to a second ratio of all the respective receiving time intervals, For example, if the number of regular changes in the receiving time interval is 10 and each receiving time interval is 20, it is clear that the second proportion is 50%.
- step A3 the second ratio is set to a second aggregation ratio of the server-side corresponding protocol stack information in the past preset time period.
- the step of setting the second ratio to the second aggregation ratio of the server-side corresponding protocol stack information in the past preset time period includes:
- Step B1 acquiring each request packet corresponding to a different first request received by the server during the past preset time period
- each request packet corresponding to a different first request received by the server during the past preset time period may also be obtained, where each request packet is composed of each request sub-packet, such as a request packet Can include 5 sub-packets.
- Step B2 Obtain lost sub-packets in each request packet, obtain a packet loss rate of each request packet, and acquire each request packet with the same packet loss rate;
- Step B3 Obtain the sending sequence number of the lost sub-packet in the corresponding request packet from the request packets with the same packet loss rate;
- step B4 it is determined whether the sending sequence numbers are the same. If the sending sequence numbers are the same, the proportion of the request packets among all the request packets with the same packet loss rate is obtained;
- step B5 it is judged whether the proportion of the request packet is larger than the proportion of the preset request packet. If the proportion of the request packet is greater than the proportion of the preset request packet, the second proportion is set to be within the past preset time period.
- the second aggregation ratio of server-side corresponding protocol stack information is set to be within the past preset time period.
- the second ratio is set to the second aggregation ratio of the server-side corresponding protocol stack information in the past preset time period, and if the request packet ratio is less than the preset request packet ratio, the request is set The packet ratio is set to the second aggregation ratio of the server-side corresponding protocol stack information in the past preset time period.
- each of the receiving time intervals changes in part in a regular manner.
- the receiving time interval changes partly regularly, then the second time ratio corresponding to the respective receiving time intervals corresponding to the regularly changing time interval is obtained; and the second accounting ratio is set to be within the past preset time period.
- the second aggregation ratio of the protocol stack information collected by the server is described. Because the second aggregate ratio is obtained through the time interval corresponding to the regular change corresponding to the second ratio of all the respective reception time intervals, it is possible to effectively determine whether there is a crawling phenomenon on the server side after the user end masquerades.
- FIG. 3 is a schematic diagram of a device structure of a hardware operating environment according to a solution of an embodiment of the present application.
- the crawler identification device in the embodiment of the present application may be a PC, or may be a smart phone, a tablet computer, an e-book reader, MP3 (Moving Picture Experts Group Audio Layer III, standard audio layer 3) player, MP4 (Moving Picture Experts Group Audio Layer IV, compression standard audio layer for motion picture experts 3) Terminal equipment such as players, portable computers.
- MP3 Moving Picture Experts Group Audio Layer III, standard audio layer 3
- MP4 Moving Picture Experts Group Audio Layer IV, compression standard audio layer for motion picture experts 3
- Terminal equipment such as players, portable computers.
- the crawler identification device may include: a processor 1001, such as a CPU, a memory 1005, and a communication bus 1002.
- the communication bus 1002 is used to implement connection and communication between the processor 1001 and the memory 1005.
- the memory 1005 may be a high-speed RAM memory or a non-volatile memory. memory), such as disk storage.
- the memory 1005 may optionally be a storage device independent of the foregoing processor 1001.
- the crawler recognition device may further include a target user interface, a network interface, a camera, an RF (Radio Frequency) circuits, sensors, audio circuits, WiFi modules, and more.
- the target user interface may include a display screen, an input unit such as a keyboard, and the optional target user interface may further include a standard wired interface and a wireless interface.
- the network interface can optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
- the structure of the crawler identification device shown in FIG. 3 does not constitute a limitation on the crawler identification device, and may include more or fewer components than shown in the figure, or some components may be combined, or different components. Layout.
- the memory 1005 as a computer storage medium may include an operating system, a network communication module, and computer-readable instructions.
- the operating system is a program that manages and controls the hardware and software resources of the crawler identification device, and supports the operation of computer-readable instructions and other software and / or programs.
- the network communication module is used to implement communication between components in the memory 1005 and to communicate with other hardware and software in the crawler identification device.
- the processor 1001 is configured to execute computer-readable instructions stored in the memory 1005 to implement the steps of the crawler recognition method according to any one of the foregoing.
- the specific implementation manner of the crawler identification device of the present application is basically the same as each embodiment of the crawler identification method described above, and details are not described herein again.
- the present application also provides a crawler identification device.
- the crawler identification device includes:
- a collection module configured to receive a request instruction for requesting network information sent by a user end to a server end;
- a judging module configured to judge whether the request instruction is a request instruction for requesting network information including the identification display content
- a first determining module configured to: if the request instruction is a request instruction requesting network information including the identification and display content, determine that the request instruction corresponds to a crawler risk at the user end and crawl at the server end, and send the request The client of the instruction is included in a blacklist of the server.
- the specific implementation manner of the crawler identification device of the present application is basically the same as each embodiment of the crawler identification method described above, and details are not described herein again.
- the non-volatile readable storage medium stores one or more programs, and the one or more programs can also be processed by one or more processors. Performing steps for implementing the crawler identification method according to any of the above.
- non-volatile readable storage medium of this application is basically the same as each embodiment of the crawler identification method described above, and details are not described herein again.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Transfer Between Computers (AREA)
Abstract
L'invention concerne un procédé, un appareil et un dispositif de reconnaissance de collecteur, ainsi qu'un support de stockage lisible non volatil. Un côté serveur utilisant un procédé de reconnaissance de collecteur est muni d'un code de reconnaissance, et un contenu d'affichage de reconnaissance est à l'état "cacher l'affichage" sur une page correspondant côté serveur. Le procédé de reconnaissance de collecteur comprend les étapes consistant à : chaque fois qu'une instruction de requête envoyée par un côté utilisateur à un côté serveur et utilisée pour demander des informations de réseau est reçue, déterminer si l'instruction de requête est une instruction de requête pour demander les informations de réseau comprenant le contenu d'affichage de reconnaissance ; et si l'instruction de requête est une instruction de requête pour demander les informations de réseau comprenant le contenu d'affichage de reconnaissance, déterminer que le côté utilisateur correspondant à l'instruction de requête implique un risque d'exploration du côté serveur. La présente invention résout les problèmes techniques actuels de grande difficulté à reconnaître un collecteur et de grande difficulté à limiter un collecteur.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811128989.9 | 2018-09-26 | ||
CN201811128989.9A CN109670093A (zh) | 2018-09-26 | 2018-09-26 | 爬虫识别方法、装置、设备及可读存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020062655A1 true WO2020062655A1 (fr) | 2020-04-02 |
Family
ID=66142000
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/123184 WO2020062655A1 (fr) | 2018-09-26 | 2018-12-24 | Procédé, appareil et dispositif de reconnaissance de collecteur, et support de stockage lisible non volatil |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109670093A (fr) |
WO (1) | WO2020062655A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113890762A (zh) * | 2021-09-29 | 2022-01-04 | 中孚安全技术有限公司 | 一种基于流量数据的网络爬虫行为检测方法及系统 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111641643A (zh) * | 2020-05-29 | 2020-09-08 | 深圳壹账通智能科技有限公司 | 网络爬虫检测方法、网络爬虫检测装置及终端设备 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279516A (zh) * | 2013-05-27 | 2013-09-04 | 百度在线网络技术(北京)有限公司 | 网络爬虫识别方法 |
US20150180899A1 (en) * | 2006-07-10 | 2015-06-25 | Websense, Inc. | System and method of analyzing web content |
CN105187396A (zh) * | 2015-08-11 | 2015-12-23 | 小米科技有限责任公司 | 识别网络爬虫的方法及装置 |
CN107196968A (zh) * | 2017-07-12 | 2017-09-22 | 深圳市活力天汇科技股份有限公司 | 一种爬虫识别方法 |
CN107341395A (zh) * | 2016-05-03 | 2017-11-10 | 北京京东尚科信息技术有限公司 | 一种拦截爬虫的方法 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9154364B1 (en) * | 2009-04-25 | 2015-10-06 | Dasient, Inc. | Monitoring for problems and detecting malware |
CN104967632B (zh) * | 2014-04-22 | 2017-02-15 | 腾讯科技(深圳)有限公司 | 网页异常数据处理方法、数据服务器及系统 |
CN104601601B (zh) * | 2015-02-25 | 2018-09-04 | 小米科技有限责任公司 | 网络爬虫的检测方法及装置 |
CN105930727B (zh) * | 2016-04-25 | 2018-11-09 | 无锡中科富农物联科技有限公司 | 基于Web的爬虫识别方法 |
CN108282443B (zh) * | 2017-01-05 | 2021-04-23 | 阿里巴巴集团控股有限公司 | 一种爬虫行为识别方法和装置 |
-
2018
- 2018-09-26 CN CN201811128989.9A patent/CN109670093A/zh active Pending
- 2018-12-24 WO PCT/CN2018/123184 patent/WO2020062655A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150180899A1 (en) * | 2006-07-10 | 2015-06-25 | Websense, Inc. | System and method of analyzing web content |
CN103279516A (zh) * | 2013-05-27 | 2013-09-04 | 百度在线网络技术(北京)有限公司 | 网络爬虫识别方法 |
CN105187396A (zh) * | 2015-08-11 | 2015-12-23 | 小米科技有限责任公司 | 识别网络爬虫的方法及装置 |
CN107341395A (zh) * | 2016-05-03 | 2017-11-10 | 北京京东尚科信息技术有限公司 | 一种拦截爬虫的方法 |
CN107196968A (zh) * | 2017-07-12 | 2017-09-22 | 深圳市活力天汇科技股份有限公司 | 一种爬虫识别方法 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113890762A (zh) * | 2021-09-29 | 2022-01-04 | 中孚安全技术有限公司 | 一种基于流量数据的网络爬虫行为检测方法及系统 |
CN113890762B (zh) * | 2021-09-29 | 2023-09-29 | 中孚安全技术有限公司 | 一种基于流量数据的网络爬虫行为检测方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
CN109670093A (zh) | 2019-04-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020155360A1 (fr) | Procédé de distribution de réseau pour appareil électroménager, appareil électroménager, terminal mobile et support de données | |
WO2012002776A2 (fr) | Appareil et procédé permettant de contrôler un accès à une pluralité de services | |
WO2016137307A1 (fr) | Attestation par mandataire | |
WO2020258657A1 (fr) | Procédé et appareil de détection d'anomalie, dispositif informatique et support d'informations | |
WO2021072881A1 (fr) | Procédé, appareil et dispositif de traitement de demande fondée sur un stockage d'objet, et support de stockage | |
WO2016013718A1 (fr) | Système et procédé permettant de fournir une publicité en ligne au moyen d'un réseau wi-fi | |
CN109688280A (zh) | 请求处理方法、请求处理设备、浏览器及存储介质 | |
WO2020119062A1 (fr) | Procédé de gestion d'applications pre-installées, dispositif et support de stockage | |
WO2020077832A1 (fr) | Procédé, appareil et dispositif d'accès à un bureau dans le nuage et support de stockage | |
WO2020062644A1 (fr) | Procédé, appareil et dispositif de détection du bogue de détournement json et support d'enregistrement | |
WO2012128553A2 (fr) | Procédé et dispositif pour la prestation d'un service d'enseignement et d'apprentissage | |
WO2014112754A1 (fr) | Procédé de pousser de service web, serveur de pousser de service web et serveur de fourniture de service web l'exécutant | |
WO2020258672A1 (fr) | Procédé et dispositif de détection d'anomalie d'accès au réseau | |
WO2015194829A2 (fr) | Procédé de détection d'un certain nombre de dispositifs sélectionnés parmi une pluralité de terminaux clients dans un réseau privé à l'aide du même ip public par un serveur web doté d'un nom de domaine non spécifié supplémentaire à partir d'un trafic de demandes d'accès à l'internet du terminal client faisant une demande d'accès à l'internet, et système de détection sélective pour un dispositif dans un état dans lequel un ip public est partagé | |
WO2020062655A1 (fr) | Procédé, appareil et dispositif de reconnaissance de collecteur, et support de stockage lisible non volatil | |
WO2014115918A1 (fr) | Système et procédé de présentation de publicité | |
WO2017175904A1 (fr) | Procédé et système de fourniture publicitaire basés sur un message à l'aide d'un écran de verrouillage | |
WO2015161644A1 (fr) | Procédé, dispositif et système de génération d'adresse de ressource | |
WO2016126013A1 (fr) | Système pour fournir un contenu publicitaire basé sur un mot-clé et procédé associé | |
WO2018014594A1 (fr) | Procédé de traitement de demande et de réponse de réseau, dispositif, terminal, serveur, et support de stockage | |
WO2020186780A1 (fr) | Procédé et appareil d'enregistrement et de restauration d'opération d'utilisateur, dispositif et support d'informations lisible | |
WO2024144299A1 (fr) | Appareil et procédé de contrôle de débit sur la base d'un schéma de contrôle de section | |
WO2024143920A1 (fr) | Serveur de gestion d'entrée, système et procédé d'orchestration de trafic sur la base d'un service numérique | |
WO2016182400A1 (fr) | Dispositif mobile et système muni d'un écran d'informations de communication et de fonctions d'accès et procédé associé | |
WO2019033718A1 (fr) | Procédé de notification d'informations de tiers, système et support d'enregistrement lisible par ordinateur |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18934941 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 08/07/2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18934941 Country of ref document: EP Kind code of ref document: A1 |