WO2017190641A1

WO2017190641A1 - Crawler interception method and device, server terminal and computer readable medium

Info

Publication number: WO2017190641A1
Application number: PCT/CN2017/082707
Authority: WO
Inventors: 王向维; 韩笑跃; 王飞; 谢刚; 费艳茹; 韩勇; 马顺风
Original assignee: 北京京东尚科信息技术有限公司
Priority date: 2016-05-03
Filing date: 2017-05-02
Publication date: 2017-11-09
Also published as: CN107341160A; CN107341160B

Abstract

Proposed are a crawler interception method and device, a server and a medium. The method comprises: after receiving an access request, sent by a client, for accessing a page, a server end generating a current field value for recognizing a crawler and generating a picture attribute value for saving the field value in a picture; saving a picture uniform resource locator (URL) path that contains the picture attribute value in the requested page; the server end determining whether a current page to be accessed belongs to a direct access allowed page; if so, returning the requested page to the client; if not, further determining whether the access request contains a valid field value for recognizing the crawler; if there is a valid field value, returning the requested page to the client; and if no field value is contained for recognizing the crawler, or a contained field value is invalid, confirming that same is the crawler, and returning a first classified page of the page to be accessed to the client. By means of the present invention, crawler access can be effectively intercepted.

Description

Method, device, server terminal and computer readable medium for intercepting crawler

Technical field

The present invention relates to network technologies, and in particular, to a method, an apparatus, a server terminal, and a computer readable medium for intercepting a crawler.

Background technique

Web crawlers are a fundamental part of search engine technology. The web crawler technology starts from the URL (Uniform Resource Locator) of one or several initial web pages, and obtains the URL on the initial webpage. In the process of crawling the webpage information, according to the crawling strategy of the webpage, the current crawling strategy continuously The web page extracts a new URL into the queue until some stop condition is met. The crawled web page information is then stored in the search engine's server.

In the prior art, in order to ensure normal user access, some websites adopt a method of filtering a client IP, or a method of filtering a specific User-Agent header of an HTTP request to intercept access from a web crawler, but in a very large amount of access. In the case, when many normal users share an IP, these normal users will be mistaken for web crawlers and thus filtered out. On the other hand, according to the HTTP protocol specification, the value of the User-Agent header can be arbitrarily set, so many web crawlers set their User-Agent headers to evade filtering like ordinary browsers, thus causing interception of network crawlers. The efficiency is not high.

Summary of the invention

An object of the embodiments of the present invention is to provide a method, an apparatus, a server terminal, and a computer readable medium for intercepting a crawler, which can effectively intercept crawler access.

To achieve the above object, an embodiment of the present invention provides a method for intercepting a crawler, the method comprising:

After receiving the access request from the access page sent by the client, the server generates the current request. a field value for identifying the crawler, and generating a picture attribute value for saving the field value to the picture; saving the picture uniform resource locator URL path containing the picture attribute value to the requested page;

The server determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the client; if not, further determines whether the access request includes a valid field value for identifying the crawler, If it is a valid field value, the requested page is returned to the client; if the field value for identifying the crawler is not included, or the included field value is invalid, it is confirmed as a crawler, and the first page of the category to be accessed is returned. To the client.

In order to achieve the above object, the embodiment of the present invention further provides a device for intercepting a crawler, and the device is applied to a server, and includes:

Generating a saving unit, after receiving an access request of the access page sent by the client, generating a field value currently used to identify the crawler, and generating an image attribute value for saving the field value into the image; the image attribute is included The value of the Uniform Resource Locator URL path is saved to the requested page;

The processing unit determines whether the currently accessed page belongs to the directly allowed access page, and if yes, returns the requested page to the client; if not, further determines whether the access request includes a valid field value for identifying the crawler If the value is a valid field value, the requested page is returned to the client; if the field value for identifying the crawler is not included, or the field value contained is invalid, it is confirmed as a crawler, and the first page of the page to be accessed is to be accessed. Return to the client.

In order to achieve the above object, the embodiment of the present invention further provides a device for intercepting a crawler, and the device is applied to a client as a browser, including:

The download unit downloads the image to the browser according to the image URL path included in the page returned by the server;

The extracting unit parses the image, extracts a field value for identifying the crawler, and saves the field value for identifying the crawler in the access request when the browser accesses the other page.

In order to achieve the above object, the embodiment of the present invention further provides a server terminal, where the server terminal includes:

One or more processors;

a storage device for storing one or more programs,

When the one or more programs are executed by the one or more processors, the one or more processors implement a method of intercepting a crawler of an embodiment of the present invention.

In order to achieve the above object, an embodiment of the present invention further provides a computer readable medium having stored thereon a computer program, the program being executed by a processor to implement a method for intercepting a crawler according to an embodiment of the present invention.

In the embodiment of the present invention, the server receives the access request of the access page sent by the client, and generates the current a field value for identifying the crawler, and generating a picture attribute value for saving the field value to the picture; saving the picture uniform resource locator URL path containing the picture attribute value to the requested page; Determining whether the currently accessed page belongs to the directly allowed access page, and if so, returning the requested page to the client; if not, further determining whether the access request includes a valid field value for identifying the crawler, if A valid field value returns the requested page to the client; if it does not contain a field value for identifying the crawler, or if the included field value is invalid, it is confirmed as a crawler, and the first page of the category to be accessed is returned to the client. end. It can be seen that the present invention utilizes the feature that the crawler does not execute the Javascript (JS) method, and does not download the image in the webpage, and the server side saves the generated field cookie value for identifying the crawler to the image, and the crawler does not download. The picture, therefore, after the application of the invention, effectively improves the interception rate of the crawler, reduces the pressure on the server, and ensures the stability and high concurrency of the website. And normal user access will not be blocked.

DRAWINGS

1 is an exemplary system architecture diagram to which an embodiment of the present invention may be applied.

2 is a schematic flow chart of a method for intercepting a crawler according to an embodiment of the present invention.

FIG. 3 is a schematic structural diagram of an apparatus for intercepting a reptile applied to the above method according to an embodiment of the present invention.

4 is a block diagram showing the structure of a computer system suitable for implementing a terminal device or a server of an embodiment of the present invention.

detailed description

In order to make the objects, the technical solutions and the advantages of the present invention more comprehensible, the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.

FIG. 1 illustrates an exemplary system architecture 100 in which the intercept crawler method or intercept crawler device of the present application can be applied.

As shown in FIG. 1, system architecture 100 can include

terminal devices

101, 102, 103, network 104, and server 105. The network 104 is used to provide a medium for communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various types of connections, such as wired, wireless communication links, fiber optic cables, and the like.

The user can interact with the server 105 over the network 104 using the

terminal devices

101, 102, 103 to receive or transmit messages and the like. Various communication client applications such as a shopping application, a web browser application, a search application, an instant communication tool, a mailbox client, a social platform software, and the like can be installed on the

terminal devices

101, 102, and 103 (for example only).

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop portable computers, desktop computers, and the like.

The server 105 may be a server that provides various services, such as a background management server that provides support for a shopping website browsed by the user using the

terminal devices

101, 102, and 103. (for example only). The background management server may analyze and process data such as the received product information query request, and feed back the processing result (for example, target push information, product information--only examples) to the terminal device.

It should be noted that the intercepting crawling method provided by the embodiment of the present invention is generally performed by the server 105. Accordingly, the intercepting crawling device is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in Figure 1 is merely illustrative. Depending on the implementation needs, there can be any number of terminal devices, networks, and servers.

The invention saves the normal access of the browser, effectively blocks the crawler, uses the crawler does not execute the JS method, and does not download the image in the webpage, and the server side saves the generated field cookie value for identifying the crawler into the image. The crawler does not download the image. Therefore, the crawler does not carry the cookie value in the access request sent by the crawler, and then distinguishes the crawler request and the browser request by carrying the cookie value in the access request, and finally realizes the crawler. Effective interception.

The embodiment of the invention discloses a method for intercepting a reptile, which comprises the following steps. The schematic diagram of the process is shown in FIG. 2 .

Step 21: After receiving the access request of the access page sent by the client, the server generates a field value currently used to identify the crawler, and generates a picture attribute value that saves the field value to the image; the image is included The image URL path of the attribute value is saved to the requested page.

The field value used to identify the crawler may be a cookie value; the image attribute value may be a picture name. Briefly, the server receives the access request of the access page sent by the client, for example, after the HTTP request, generates a cookie value and a picture name, and then saves the image URL path containing the picture name to the requested page. specifically,

The method for generating the cookie value currently used by the server to identify the crawler includes: the server The terminal selects the value of the current timestamp according to the valid time of the cookie value; encrypts the string of the selected current timestamp with the configured current first key, for example, the md5 message digest operation, Current cookie value.

The method for generating a picture name by the server includes: the server selects the value of the current timestamp according to the valid time of the cookie value; and encrypts the string of the selected current timestamp with the configured current second key. For example, it can be an md5 message digest operation to get the name of the picture.

It should be noted that there are various methods for generating a cookie value and a picture name, including but not limited to the above method. Since the cookie value in the present invention is time-sensitive, the generation time is related to the timestamp, and the other time is obtained by the timestamp. Both the value and the method of the picture name are within the scope of the present invention.

A URL is an identification method for completely describing the addresses of web pages and other resources on the Internet. Correspondingly, each web page on the Internet has a unique URL. When the client needs to access the webpage in the server, the URL of the webpage is first obtained.

In the HTTP request of the access page sent by the client in this embodiment, the URL path information of the page is carried. It should be noted that the image URL path is further saved in the page, and the specific location of the save may be set according to a specific implementation. In one embodiment, the image URL path may be saved in an image tag of the page.

Step 22: The server determines whether the currently accessed page belongs to the directly allowed access page, and if yes, returns the requested page to the client; if not, further determines whether the access request includes a valid identifier for identifying the crawler. The field value, if it is a valid field value, returns the requested page to the client; if it does not contain the field value used to identify the crawler, or if the included field value is invalid, it is confirmed as a crawler, and the page to be accessed is classified. A page is returned to the client.

The method for the server to determine whether the current page to be accessed is directly allowed to access the page includes: the server side is preset with a page range that allows direct access to the page; the server determines whether the currently accessed page is within the range, and if so, Belongs to directly allow access to the page.

The method for the server to determine whether the HTTP request includes a valid cookie value includes: the server compares the cookie value generated by the server with the cookie value carried in the HTTP request, and if the two are equal, the cookie carried in the HTTP request is determined. The value is a valid cookie value. Obviously, if the two are not equal, the cookie value is invalid.

It should be noted that, in the present invention, in order to prevent the reptile from being imitated, the cookie value generated by the server side changes every predetermined time. Conversely, assuming that the predetermined time is 10 minutes, the cookie value generated by the server is the same every 10 minutes. Then the server will return the page containing the cookie value to the client, so as long as the client is a browser, the cookie value can be parsed, carried in the next HTTP request, and sent to the server, then, as long as Within the same 10 minutes, the cookie value received by the server will be the same as the cookie value generated by the server itself, which indicates that the cookie value is valid. If, in the next 10 minutes, the client still sends the HTTP request to the server with the previous cookie value, the server generates a new cookie value, which causes the server to receive the cookie value and the server itself. The cookie value is inconsistent, which means that the cookie value is invalid.

If it is a crawler, after receiving the HTTP request from the crawler, the server will also save the image URL path to the requested page. Then, the server determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the crawler. This is because, in practical applications, crawlers are generally allowed to access a limited number of pages, which in one embodiment may be 1-10 pages of the same category. If the server determines that the current page to be accessed is not directly allowed to access the page, for example, the crawler wants to access page 11, it further determines whether the HTTP request contains a valid cookie value, after judging the crawler The HTTP request does not carry a cookie value, so the crawler's request is intercepted and the first page of the current classification is returned to the crawler. In this way, the crawler always gets the first page of the current category and won't get more pages.

If it is a browser, after receiving the HTTP request from the browser, the server will save the image URL path to the requested page. Then, the server determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the browser. At this time, the browser downloads the image to the browser according to the image URL path included in the page returned by the server; parses the image in Javascript, extracts the cookie value, and saves it for the browser to access other pages. The cookie value is carried in the HTTP request. Suppose the browser accesses the page 11 and carries the parsed cookie value in the HTTP request. After receiving the HTTP request, the server determines whether the cookie value is valid. If it is valid, it allows access to page 11. If it is invalid, then Return the first page of the current category to the browser.

In addition, in the present invention, in order to further alleviate the pressure on the server, the page allowing direct access is cached on a CDN (Content Delivery Network) server, and when the client requests a page in which direct access is permitted, the CDN server will The requested page is returned to the client. CDN technology forms a layer of intelligent virtual network on the existing Internet by placing CDN servers throughout the network. Usually, a large amount of data can be cached on the CDN server. When the user accesses the stored content data, the CDN server can directly provide the data. Give the user a quick response service. In this way, the traffic of the crawler is directed to the CDN server of each province and city, thereby protecting the server and ensuring normal access by the user.

In order to clearly illustrate the present invention, a specific scenario will be described below.

In this embodiment, it is assumed that the cookie value generated by the server side changes every 10 minutes, that is, the cookie value is valid for 10 minutes. Then, after receiving the HTTP request of the access page sent by the client, the server takes the first 11 digits of the current timestamp, 20160101081: It means 10 minutes from 8:10 to 19:00 on January 1, 2016. Therefore, the string of 20160101081 and the current first key is merged into an md5 message digest operation to obtain the current cookie value. The string of the combination of 20160101081 and the current second key is subjected to the md5 message digest operation to obtain the name of the picture. The server puts the obtained cookie value into the description information of the picture, generates a new picture and saves the new picture with the obtained picture name, and then the server side saves the picture URL path containing the picture name to the requested one. In the page. Here, the description information of the picture includes, but is not limited to, the time of photographing, the resolution of the photo, the type of the camera, and the like. The new image named after the image name contains the cookie value.

Embodiment 1, in one embodiment,

1) The browser sends an HTTP request to the server to request the first page of the current classification;

The server generates a picture URL path containing the cookie value and saves it to the first page;

The server side presets a page range of 1-10 pages that allows direct access to the page, and the server determines that the first page belongs to the direct access range, and therefore returns the first page containing the image URL path to the browser;

The browser automatically downloads the image to the browser according to the image URL path included in the page of the first page of the current classification; parses the image with the JS method, extracts the cookie value, and saves the cookie value; .

2) The browser sends an HTTP request carrying a cookie value to the server, requesting the current classification page 10;

The server generates a picture URL path containing the cookie value, and saves it to page 10; wherein, since the valid time is 10 minutes, the cookie value generated by the server is the same as the cookie value carried in the HTTP request;

The server side presets the range of pages that allow direct access to the page to be 1-10 pages, and the server determines that the 10th page belongs to the direct access range. Therefore, it is not necessary to determine whether the cookie value is valid at this time, and directly includes the image URL path. 10 pages are returned to the browser.

The browser automatically downloads the image to the browser according to the image URL path included in the page on page 10 of the current classification; parses the image with the JS method and extracts the cookie Value, and save; carry the cookie value when page turning.

3) The browser sends an HTTP request carrying a cookie value to the server, requesting the current classification page 11;

The server generates a picture URL path containing the cookie value, and saves it to page 11; wherein, since the valid time is 10 minutes, the cookie value generated by the server is the same as the cookie value carried in the HTTP request;

The server side presets a page range of 1-10 pages that allows direct access to the page, and the server side judges that the 11th page does not belong to the direct access range. Therefore, it is further determined whether the cookie value is valid. The foregoing has explained that since the effective time is within 10 minutes. Therefore, at this time, the cookie value generated by the server side is the same as the cookie value carried in the HTTP request, so it is determined that the cookie value is valid, and the 11th page including the image URL path is returned to the browser.

The browser automatically downloads the image to the browser according to the image URL path included in the page on page 11 of the current classification; parses the image with the JS method, extracts the cookie value, and saves the cookie value; .

Thereby achieving normal access to the browser.

Embodiment 2, in another embodiment,

If the browser receives a link to the 10th page of the category,

The browser sends an HTTP request to the server to request the 10th page of the current classification;

The server generates a picture URL path containing the cookie value and saves it to page 10;

The server side presets a page range of 1-10 pages that allows direct access to the page, and the server side judges that the 10th page belongs to the direct access range. Therefore, although the HTTP request does not have a cookie value at this time, the image will be directly included. The 10th page of the URL path is returned to the browser.

Embodiment 3, in another embodiment,

If the browser receives a link to the 11th page of the category,

The browser sends an HTTP request to the server to request the current classification page 11;

The server generates a picture URL path containing the cookie value and saves it to page 11;

The server side judges that page 11 is not a direct access range. Therefore, it is further determined whether the HTTP request has a cookie value. Since the link is directly received by the browser, the HTTP request does not have a cookie value, so the browsing is performed. Returns the first page of the current classification.

Next, if you want to continue to access other pages, you can repeat the operation in the first example to achieve normal access to the page.

Embodiment 4

In another embodiment,

The crawler sends an HTTP request to the server to request the first page of the current classification;

The server side presets a page range of 1-10 pages that allows direct access to the page, and the server side judges that the first page belongs to the direct access range, and therefore returns the first page containing the image URL path to the crawler;

It should be noted that the crawler does not download images in the prior art, nor does it use the JS method to parse the image, because if executed, it will greatly increase the cost of the crawler, including CPU and bandwidth costs. Therefore, the crawler does not extract the cookie value in the image as the browser does, and it is carried when accessing other pages. Then it will be intercepted by the server.

Embodiment 5

In another embodiment,

The crawler sends an HTTP request to the server to request the current classification page 11;

The server side judges that page 11 is not a direct access range. Therefore, it is further determined whether the HTTP request has a cookie value. Since the HTTP request sent by the crawler to the server side cannot have a cookie value, the server returns the current classification to the crawler. One page.

It can be seen from the above that, by the solution of the invention, the web crawler can only capture a limited number of pages, ensuring normal access of the browser.

Based on the same inventive concept, an embodiment of the present invention also provides a device for intercepting a crawler, which is applied to a server end, as shown in FIG. The device includes:

The generating and saving unit 301, after receiving the access request of the access page sent by the client, generates a field value currently used to identify the crawler, and generates a picture attribute value that saves the field value to the image; the image is included The image uniform resource locator URL path of the attribute value is saved to the requested page;

The processing unit 302 determines whether the currently accessed page belongs to the directly allowed access page, and if yes, returns the requested page to the client; if not, further determines whether the access request includes a valid field for identifying the crawler. Value, if it is a valid field value, returns the requested page to the client; if it does not contain a field value for identifying the crawler, or if the included field value is invalid, it is confirmed as a crawler, and the page to be accessed is first The page is returned to the client.

The invention also proposes a device for intercepting a crawler, which is applied to a client as a browser, comprising:

Referring now to Figure 4, there is shown a block diagram of a computer system 400 suitable for use in implementing a terminal device in accordance with an embodiment of the present invention. The terminal device shown in FIG. 4 is just an example, There is no limitation to the function and scope of use of the embodiments of the present invention.

As shown in FIG. 4, computer system 400 includes a central processing unit (CPU) 401 that can be loaded into a program in random access memory (RAM) 403 according to a program stored in read only memory (ROM) 402 or from storage portion 408. And perform various appropriate actions and processes. In the RAM 403, various programs and data required for the operation of the system 400 are also stored. The CPU 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404. An input/output (I/O) interface 405 is also coupled to bus 404.

The following components are connected to the I/O interface 405: an input portion 406 including a keyboard, a mouse, etc.; an output portion 407 including a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a storage portion 408 including a hard disk or the like. And a communication portion 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the Internet. Driver 410 is also coupled to I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like is mounted on the drive 410 as needed so that a computer program read therefrom is installed into the storage portion 408 as needed.

In particular, the processes described above with reference to the flowcharts may be implemented as a computer software program in accordance with embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for executing the method illustrated in the flowchart. In such an embodiment, the computer program can be downloaded and installed from the network via the communication portion 409, and/or installed from the removable medium 411. When the computer program is executed by the central processing unit (CPU) 401, the above-described functions defined in the system of the present invention are performed.

It should be noted that the computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. Computer More specific examples of readable storage media may include, but are not limited to, electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable Read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain or store a program, which can be used by or in connection with an instruction execution system, apparatus or device. In the present invention, a computer readable signal medium may include a data signal that is propagated in the baseband or as part of a carrier, in which computer readable program code is carried. Such propagated data signals can take a variety of forms including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer readable signal medium can also be any computer readable medium other than a computer readable storage medium, which can transmit, propagate, or transport a program for use by or in connection with the instruction execution system, apparatus, or device. . Program code embodied on a computer readable medium can be transmitted by any suitable medium, including but not limited to wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality and operation of possible implementations of systems, methods and computer program products in accordance with various embodiments of the invention. In this regard, each block of the flowchart or block diagrams can represent a module, a program segment, or a portion of code that includes one or more Executable instructions. It should also be noted that in some alternative implementations, the functions noted in the blocks may also occur in a different order than that illustrated in the drawings. For example, two successively represented blocks may in fact be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams or flowcharts, and combinations of blocks in the block diagrams or flowcharts, can be implemented by a dedicated hardware-based system that performs the specified function or operation, or can be used A combination of dedicated hardware and computer instructions is implemented.

The units involved in the embodiments of the present invention may be implemented by software or by hardware. The described unit can also be set to handle In the device, for example, it can be described that a processor includes a generation save unit and a processing unit. The name of these units does not constitute a limitation on the unit itself in some cases. For example, the generation and storage unit may also be described as “generating the current identification for the access request after receiving the access page sent by the client. The unit of the crawler's field value."

In another aspect, the present invention also provides a computer readable medium, which may be included in the apparatus described in the above embodiments, or may be separately present and not incorporated in the apparatus. The computer readable medium carries one or more programs. When the one or more programs are executed by the device, the device includes: after the server receives the access request of the access page sent by the client, generating the current use. Identifying the field value of the crawler, and generating a picture attribute value for saving the field value into the picture; saving the picture uniform resource locator URL path containing the picture attribute value to the requested page; Whether the current page to be accessed is directly allowed to access the page, and if so, the requested page is returned to the client; if not, it is further determined whether the access request contains a valid field value for identifying the crawler, if it is valid The field value returns the requested page to the client; if it does not contain the field value used to identify the crawler, or if the included field value is invalid, it is confirmed as a crawler, and the first page of the category to be accessed is returned to the client. .

The beneficial effects of the present invention are that

First, effective interception of crawlers, in the case of normal user access is very large, you can also avoid accidentally killing normal users, to ensure the normal access of the browser.

Second, in the website promotion peak, intercept crawler access, reduce server pressure, ensure the site is stable and high concurrency. And can intercept malicious attacks.

Third, the CDN server settings, the crawler traffic is directed to the CDN server of each province and city, thereby further protecting the server and ensuring that users can access normally.

The above is only the preferred embodiment of the present invention and is not intended to limit the scope of the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Claims

A method of intercepting a reptile, characterized in that the method comprises:

After receiving the access request of the access page sent by the client, the server generates a field value currently used to identify the crawler, and generates a picture attribute value that saves the field value to the image; the image attribute value is included The picture uniform resource locator URL path is saved to the requested page;

The server determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the client; if not, further determines whether the access request includes a valid field value for identifying the crawler, If it is a valid field value, the requested page is returned to the client; if the field value for identifying the crawler is not included, or the included field value is invalid, it is confirmed as a crawler, and the first page of the category to be accessed is returned. To the client.
The method of claim 1, wherein when the client is a browser, the method further comprises:

The browser downloads the image to the browser according to the image URL path included in the page returned by the server; parses the image, extracts the field value for identifying the crawler, and saves it for the browser to access other pages. The field value used to identify the crawler is carried in the access request.
The method of claim 1, wherein the field value for identifying the crawler is a cookie value; and the method for generating the cookie value comprises:

The server selects the value of the current timestamp according to the valid time of the cookie value, and encrypts the string of the selected current timestamp with the configured current first key to obtain the current cookie value.
The method according to claim 1, wherein said field value for identifying a crawler is a cookie value; said picture attribute value is a picture name; said generating picture name The methods include:

The server selects the value of the current timestamp according to the valid time of the cookie value, and encrypts the string of the selected current timestamp with the configured current second key to obtain the name of the picture.
The method of claim 1, wherein the method for determining, by the server, whether the currently accessed page belongs to the directly allowed access page comprises:

The server side is pre-set with a range of pages that allow direct access to the page;

The server determines whether the currently accessed page is within the range, and if so, belongs to the directly allowed access page.
The method according to claim 1, wherein the field value for identifying the crawler is a cookie value; and determining whether the access request includes a valid field value for identifying the crawler comprises:

The server compares the cookie value generated by itself with the cookie value carried in the access request. If the two are equal, it determines that the cookie value carried in the access request is a valid cookie value.
The method of claim 1, further comprising: caching the page allowing direct access on the CDN server, and when the client requests the page in which the direct access is permitted, the requested page is requested by the CDN server. Return to the client.
A device for intercepting a reptile is characterized in that the device is applied to a server, and includes:

Generating a saving unit, after receiving an access request of the access page sent by the client, generating a field value currently used to identify the crawler, and generating an image attribute value for saving the field value into the image; the image attribute is included The value of the Uniform Resource Locator URL path is saved to the requested page;

The processing unit determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the client; if not, further determines the Whether the access request contains a valid field value for identifying the crawler, if it is a valid field value, the requested page is returned to the client; if the field value for identifying the crawler is not included, or the included field value is invalid, It is confirmed as a crawler, and the first page of the category to be accessed is returned to the client.
A device for intercepting a crawler, characterized in that the device is applied to a client as a browser, comprising:

The download unit downloads the image to the browser according to the image URL path included in the page returned by the server;

The extracting unit parses the image, extracts a field value for identifying the crawler, and saves the field value for identifying the crawler in the access request when the browser accesses the other page.
A server terminal, comprising:

One or more processors;

a storage device for storing one or more programs,

The one or more programs are executed by the one or more processors such that the one or more processors implement the method of any of claims 1-7.
A computer readable medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the method of any of claims 1-7.