WO2017190641A1 - Crawler interception method and device, server terminal and computer readable medium - Google Patents

Crawler interception method and device, server terminal and computer readable medium Download PDF

Info

Publication number
WO2017190641A1
WO2017190641A1 PCT/CN2017/082707 CN2017082707W WO2017190641A1 WO 2017190641 A1 WO2017190641 A1 WO 2017190641A1 CN 2017082707 W CN2017082707 W CN 2017082707W WO 2017190641 A1 WO2017190641 A1 WO 2017190641A1
Authority
WO
WIPO (PCT)
Prior art keywords
page
crawler
value
server
field value
Prior art date
Application number
PCT/CN2017/082707
Other languages
French (fr)
Chinese (zh)
Inventor
王向维
韩笑跃
王飞
谢刚
费艳茹
韩勇
马顺风
Original Assignee
北京京东尚科信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京京东尚科信息技术有限公司 filed Critical 北京京东尚科信息技术有限公司
Publication of WO2017190641A1 publication Critical patent/WO2017190641A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to network technologies, and in particular, to a method, an apparatus, a server terminal, and a computer readable medium for intercepting a crawler.
  • Web crawlers are a fundamental part of search engine technology.
  • the web crawler technology starts from the URL (Uniform Resource Locator) of one or several initial web pages, and obtains the URL on the initial webpage.
  • the URL Uniform Resource Locator
  • the current crawling strategy continuously
  • the web page extracts a new URL into the queue until some stop condition is met.
  • the crawled web page information is then stored in the search engine's server.
  • An object of the embodiments of the present invention is to provide a method, an apparatus, a server terminal, and a computer readable medium for intercepting a crawler, which can effectively intercept crawler access.
  • an embodiment of the present invention provides a method for intercepting a crawler, the method comprising:
  • the server After receiving the access request from the access page sent by the client, the server generates the current request. a field value for identifying the crawler, and generating a picture attribute value for saving the field value to the picture; saving the picture uniform resource locator URL path containing the picture attribute value to the requested page;
  • the server determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the client; if not, further determines whether the access request includes a valid field value for identifying the crawler, If it is a valid field value, the requested page is returned to the client; if the field value for identifying the crawler is not included, or the included field value is invalid, it is confirmed as a crawler, and the first page of the category to be accessed is returned. To the client.
  • the embodiment of the present invention further provides a device for intercepting a crawler, and the device is applied to a server, and includes:
  • Generating a saving unit after receiving an access request of the access page sent by the client, generating a field value currently used to identify the crawler, and generating an image attribute value for saving the field value into the image; the image attribute is included
  • the value of the Uniform Resource Locator URL path is saved to the requested page;
  • the processing unit determines whether the currently accessed page belongs to the directly allowed access page, and if yes, returns the requested page to the client; if not, further determines whether the access request includes a valid field value for identifying the crawler If the value is a valid field value, the requested page is returned to the client; if the field value for identifying the crawler is not included, or the field value contained is invalid, it is confirmed as a crawler, and the first page of the page to be accessed is to be accessed. Return to the client.
  • the embodiment of the present invention further provides a device for intercepting a crawler, and the device is applied to a client as a browser, including:
  • the download unit downloads the image to the browser according to the image URL path included in the page returned by the server;
  • the extracting unit parses the image, extracts a field value for identifying the crawler, and saves the field value for identifying the crawler in the access request when the browser accesses the other page.
  • the embodiment of the present invention further provides a server terminal, where the server terminal includes:
  • One or more processors are One or more processors;
  • a storage device for storing one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement a method of intercepting a crawler of an embodiment of the present invention.
  • an embodiment of the present invention further provides a computer readable medium having stored thereon a computer program, the program being executed by a processor to implement a method for intercepting a crawler according to an embodiment of the present invention.
  • the server receives the access request of the access page sent by the client, and generates the current a field value for identifying the crawler, and generating a picture attribute value for saving the field value to the picture; saving the picture uniform resource locator URL path containing the picture attribute value to the requested page; Determining whether the currently accessed page belongs to the directly allowed access page, and if so, returning the requested page to the client; if not, further determining whether the access request includes a valid field value for identifying the crawler, if A valid field value returns the requested page to the client; if it does not contain a field value for identifying the crawler, or if the included field value is invalid, it is confirmed as a crawler, and the first page of the category to be accessed is returned to the client.
  • the present invention utilizes the feature that the crawler does not execute the Javascript (JS) method, and does not download the image in the webpage, and the server side saves the generated field cookie value for identifying the crawler to the image, and the crawler does not download.
  • the picture therefore, after the application of the invention, effectively improves the interception rate of the crawler, reduces the pressure on the server, and ensures the stability and high concurrency of the website. And normal user access will not be blocked.
  • FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present invention may be applied.
  • FIG. 2 is a schematic flow chart of a method for intercepting a crawler according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of an apparatus for intercepting a reptile applied to the above method according to an embodiment of the present invention.
  • FIG. 4 is a block diagram showing the structure of a computer system suitable for implementing a terminal device or a server of an embodiment of the present invention.
  • FIG. 1 illustrates an exemplary system architecture 100 in which the intercept crawler method or intercept crawler device of the present application can be applied.
  • system architecture 100 can include terminal devices 101, 102, 103, network 104, and server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • Network 104 may include various types of connections, such as wired, wireless communication links, fiber optic cables, and the like.
  • the user can interact with the server 105 over the network 104 using the terminal devices 101, 102, 103 to receive or transmit messages and the like.
  • Various communication client applications such as a shopping application, a web browser application, a search application, an instant communication tool, a mailbox client, a social platform software, and the like can be installed on the terminal devices 101, 102, and 103 (for example only).
  • the terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop portable computers, desktop computers, and the like.
  • the server 105 may be a server that provides various services, such as a background management server that provides support for a shopping website browsed by the user using the terminal devices 101, 102, and 103. (for example only).
  • the background management server may analyze and process data such as the received product information query request, and feed back the processing result (for example, target push information, product information--only examples) to the terminal device.
  • the intercepting crawling method provided by the embodiment of the present invention is generally performed by the server 105. Accordingly, the intercepting crawling device is generally disposed in the server 105.
  • terminal devices, networks, and servers in Figure 1 is merely illustrative. Depending on the implementation needs, there can be any number of terminal devices, networks, and servers.
  • the invention saves the normal access of the browser, effectively blocks the crawler, uses the crawler does not execute the JS method, and does not download the image in the webpage, and the server side saves the generated field cookie value for identifying the crawler into the image.
  • the crawler does not download the image. Therefore, the crawler does not carry the cookie value in the access request sent by the crawler, and then distinguishes the crawler request and the browser request by carrying the cookie value in the access request, and finally realizes the crawler. Effective interception.
  • the embodiment of the invention discloses a method for intercepting a reptile, which comprises the following steps.
  • the schematic diagram of the process is shown in FIG. 2 .
  • Step 21 After receiving the access request of the access page sent by the client, the server generates a field value currently used to identify the crawler, and generates a picture attribute value that saves the field value to the image; the image is included The image URL path of the attribute value is saved to the requested page.
  • the field value used to identify the crawler may be a cookie value; the image attribute value may be a picture name.
  • the server receives the access request of the access page sent by the client, for example, after the HTTP request, generates a cookie value and a picture name, and then saves the image URL path containing the picture name to the requested page. specifically,
  • the method for generating the cookie value currently used by the server to identify the crawler includes: the server The terminal selects the value of the current timestamp according to the valid time of the cookie value; encrypts the string of the selected current timestamp with the configured current first key, for example, the md5 message digest operation, Current cookie value.
  • the method for generating a picture name by the server includes: the server selects the value of the current timestamp according to the valid time of the cookie value; and encrypts the string of the selected current timestamp with the configured current second key. For example, it can be an md5 message digest operation to get the name of the picture.
  • the cookie value in the present invention is time-sensitive, the generation time is related to the timestamp, and the other time is obtained by the timestamp. Both the value and the method of the picture name are within the scope of the present invention.
  • a URL is an identification method for completely describing the addresses of web pages and other resources on the Internet.
  • each web page on the Internet has a unique URL.
  • the URL path information of the page is carried. It should be noted that the image URL path is further saved in the page, and the specific location of the save may be set according to a specific implementation. In one embodiment, the image URL path may be saved in an image tag of the page.
  • Step 22 The server determines whether the currently accessed page belongs to the directly allowed access page, and if yes, returns the requested page to the client; if not, further determines whether the access request includes a valid identifier for identifying the crawler.
  • the field value if it is a valid field value, returns the requested page to the client; if it does not contain the field value used to identify the crawler, or if the included field value is invalid, it is confirmed as a crawler, and the page to be accessed is classified. A page is returned to the client.
  • the method for the server to determine whether the current page to be accessed is directly allowed to access the page includes: the server side is preset with a page range that allows direct access to the page; the server determines whether the currently accessed page is within the range, and if so, Belongs to directly allow access to the page.
  • the method for the server to determine whether the HTTP request includes a valid cookie value includes: the server compares the cookie value generated by the server with the cookie value carried in the HTTP request, and if the two are equal, the cookie carried in the HTTP request is determined. The value is a valid cookie value. Obviously, if the two are not equal, the cookie value is invalid.
  • the cookie value generated by the server side changes every predetermined time. Conversely, assuming that the predetermined time is 10 minutes, the cookie value generated by the server is the same every 10 minutes. Then the server will return the page containing the cookie value to the client, so as long as the client is a browser, the cookie value can be parsed, carried in the next HTTP request, and sent to the server, then, as long as Within the same 10 minutes, the cookie value received by the server will be the same as the cookie value generated by the server itself, which indicates that the cookie value is valid.
  • the server If, in the next 10 minutes, the client still sends the HTTP request to the server with the previous cookie value, the server generates a new cookie value, which causes the server to receive the cookie value and the server itself.
  • the cookie value is inconsistent, which means that the cookie value is invalid.
  • the server After receiving the HTTP request from the crawler, the server will also save the image URL path to the requested page. Then, the server determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the crawler. This is because, in practical applications, crawlers are generally allowed to access a limited number of pages, which in one embodiment may be 1-10 pages of the same category.
  • the server determines that the current page to be accessed is not directly allowed to access the page, for example, the crawler wants to access page 11, it further determines whether the HTTP request contains a valid cookie value, after judging the crawler The HTTP request does not carry a cookie value, so the crawler's request is intercepted and the first page of the current classification is returned to the crawler. In this way, the crawler always gets the first page of the current category and won't get more pages.
  • the server After receiving the HTTP request from the browser, the server will save the image URL path to the requested page. Then, the server determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the browser. At this time, the browser downloads the image to the browser according to the image URL path included in the page returned by the server; parses the image in Javascript, extracts the cookie value, and saves it for the browser to access other pages.
  • the cookie value is carried in the HTTP request. Suppose the browser accesses the page 11 and carries the parsed cookie value in the HTTP request. After receiving the HTTP request, the server determines whether the cookie value is valid. If it is valid, it allows access to page 11. If it is invalid, then Return the first page of the current category to the browser.
  • the page allowing direct access is cached on a CDN (Content Delivery Network) server, and when the client requests a page in which direct access is permitted, the CDN server will The requested page is returned to the client.
  • CDN technology forms a layer of intelligent virtual network on the existing Internet by placing CDN servers throughout the network. Usually, a large amount of data can be cached on the CDN server. When the user accesses the stored content data, the CDN server can directly provide the data. Give the user a quick response service. In this way, the traffic of the crawler is directed to the CDN server of each province and city, thereby protecting the server and ensuring normal access by the user.
  • the cookie value generated by the server side changes every 10 minutes, that is, the cookie value is valid for 10 minutes.
  • the server takes the first 11 digits of the current timestamp, 20160101081: It means 10 minutes from 8:10 to 19:00 on January 1, 2016. Therefore, the string of 20160101081 and the current first key is merged into an md5 message digest operation to obtain the current cookie value.
  • the string of the combination of 20160101081 and the current second key is subjected to the md5 message digest operation to obtain the name of the picture.
  • the server puts the obtained cookie value into the description information of the picture, generates a new picture and saves the new picture with the obtained picture name, and then the server side saves the picture URL path containing the picture name to the requested one.
  • the description information of the picture includes, but is not limited to, the time of photographing, the resolution of the photo, the type of the camera, and the like.
  • the new image named after the image name contains the cookie value.
  • Embodiment 1 in one embodiment,
  • the browser sends an HTTP request to the server to request the first page of the current classification
  • the server generates a picture URL path containing the cookie value and saves it to the first page;
  • the server side presets a page range of 1-10 pages that allows direct access to the page, and the server determines that the first page belongs to the direct access range, and therefore returns the first page containing the image URL path to the browser;
  • the browser automatically downloads the image to the browser according to the image URL path included in the page of the first page of the current classification; parses the image with the JS method, extracts the cookie value, and saves the cookie value; .
  • the browser sends an HTTP request carrying a cookie value to the server, requesting the current classification page 10;
  • the server generates a picture URL path containing the cookie value, and saves it to page 10; wherein, since the valid time is 10 minutes, the cookie value generated by the server is the same as the cookie value carried in the HTTP request;
  • the server side presets the range of pages that allow direct access to the page to be 1-10 pages, and the server determines that the 10th page belongs to the direct access range. Therefore, it is not necessary to determine whether the cookie value is valid at this time, and directly includes the image URL path. 10 pages are returned to the browser.
  • the browser automatically downloads the image to the browser according to the image URL path included in the page on page 10 of the current classification; parses the image with the JS method and extracts the cookie Value, and save; carry the cookie value when page turning.
  • the browser sends an HTTP request carrying a cookie value to the server, requesting the current classification page 11;
  • the server generates a picture URL path containing the cookie value, and saves it to page 11; wherein, since the valid time is 10 minutes, the cookie value generated by the server is the same as the cookie value carried in the HTTP request;
  • the server side presets a page range of 1-10 pages that allows direct access to the page, and the server side judges that the 11th page does not belong to the direct access range. Therefore, it is further determined whether the cookie value is valid.
  • the foregoing has explained that since the effective time is within 10 minutes. Therefore, at this time, the cookie value generated by the server side is the same as the cookie value carried in the HTTP request, so it is determined that the cookie value is valid, and the 11th page including the image URL path is returned to the browser.
  • the browser automatically downloads the image to the browser according to the image URL path included in the page on page 11 of the current classification; parses the image with the JS method, extracts the cookie value, and saves the cookie value; .
  • Embodiment 2 in another embodiment,
  • the browser sends an HTTP request to the server to request the 10th page of the current classification
  • the server generates a picture URL path containing the cookie value and saves it to page 10;
  • the server side presets a page range of 1-10 pages that allows direct access to the page, and the server side judges that the 10th page belongs to the direct access range. Therefore, although the HTTP request does not have a cookie value at this time, the image will be directly included.
  • the 10th page of the URL path is returned to the browser.
  • the browser automatically downloads the image to the browser according to the image URL path included in the page on page 10 of the current classification; parses the image with the JS method and extracts the cookie Value, and save; carry the cookie value when page turning.
  • Embodiment 3 in another embodiment,
  • the browser sends an HTTP request to the server to request the current classification page 11;
  • the server generates a picture URL path containing the cookie value and saves it to page 11;
  • the server side judges that page 11 is not a direct access range. Therefore, it is further determined whether the HTTP request has a cookie value. Since the link is directly received by the browser, the HTTP request does not have a cookie value, so the browsing is performed. Returns the first page of the current classification.
  • the crawler sends an HTTP request to the server to request the first page of the current classification
  • the server generates a picture URL path containing the cookie value and saves it to the first page;
  • the server side presets a page range of 1-10 pages that allows direct access to the page, and the server side judges that the first page belongs to the direct access range, and therefore returns the first page containing the image URL path to the crawler;
  • the crawler does not download images in the prior art, nor does it use the JS method to parse the image, because if executed, it will greatly increase the cost of the crawler, including CPU and bandwidth costs. Therefore, the crawler does not extract the cookie value in the image as the browser does, and it is carried when accessing other pages. Then it will be intercepted by the server.
  • the crawler sends an HTTP request to the server to request the current classification page 11;
  • the server generates a picture URL path containing the cookie value and saves it to page 11;
  • the server side judges that page 11 is not a direct access range. Therefore, it is further determined whether the HTTP request has a cookie value. Since the HTTP request sent by the crawler to the server side cannot have a cookie value, the server returns the current classification to the crawler. One page.
  • the web crawler can only capture a limited number of pages, ensuring normal access of the browser.
  • an embodiment of the present invention also provides a device for intercepting a crawler, which is applied to a server end, as shown in FIG.
  • the device includes:
  • the generating and saving unit 301 after receiving the access request of the access page sent by the client, generates a field value currently used to identify the crawler, and generates a picture attribute value that saves the field value to the image; the image is included The image uniform resource locator URL path of the attribute value is saved to the requested page;
  • the processing unit 302 determines whether the currently accessed page belongs to the directly allowed access page, and if yes, returns the requested page to the client; if not, further determines whether the access request includes a valid field for identifying the crawler. Value, if it is a valid field value, returns the requested page to the client; if it does not contain a field value for identifying the crawler, or if the included field value is invalid, it is confirmed as a crawler, and the page to be accessed is first The page is returned to the client.
  • the invention also proposes a device for intercepting a crawler, which is applied to a client as a browser, comprising:
  • the download unit downloads the image to the browser according to the image URL path included in the page returned by the server;
  • the extracting unit parses the image, extracts a field value for identifying the crawler, and saves the field value for identifying the crawler in the access request when the browser accesses the other page.
  • FIG. 4 there is shown a block diagram of a computer system 400 suitable for use in implementing a terminal device in accordance with an embodiment of the present invention.
  • the terminal device shown in FIG. 4 is just an example, There is no limitation to the function and scope of use of the embodiments of the present invention.
  • computer system 400 includes a central processing unit (CPU) 401 that can be loaded into a program in random access memory (RAM) 403 according to a program stored in read only memory (ROM) 402 or from storage portion 408. And perform various appropriate actions and processes.
  • RAM random access memory
  • ROM read only memory
  • RAM 403 various programs and data required for the operation of the system 400 are also stored.
  • the CPU 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404.
  • An input/output (I/O) interface 405 is also coupled to bus 404.
  • the following components are connected to the I/O interface 405: an input portion 406 including a keyboard, a mouse, etc.; an output portion 407 including a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a storage portion 408 including a hard disk or the like. And a communication portion 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the Internet.
  • Driver 410 is also coupled to I/O interface 405 as needed.
  • a removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like is mounted on the drive 410 as needed so that a computer program read therefrom is installed into the storage portion 408 as needed.
  • embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for executing the method illustrated in the flowchart.
  • the computer program can be downloaded and installed from the network via the communication portion 409, and/or installed from the removable medium 411.
  • CPU central processing unit
  • the computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium or any combination of the two.
  • the computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above.
  • Computer More specific examples of readable storage media may include, but are not limited to, electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable Read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing.
  • a computer readable storage medium may be any tangible medium that can contain or store a program, which can be used by or in connection with an instruction execution system, apparatus or device.
  • a computer readable signal medium may include a data signal that is propagated in the baseband or as part of a carrier, in which computer readable program code is carried. Such propagated data signals can take a variety of forms including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer readable signal medium can also be any computer readable medium other than a computer readable storage medium, which can transmit, propagate, or transport a program for use by or in connection with the instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable medium can be transmitted by any suitable medium, including but not limited to wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
  • each block of the flowchart or block diagrams can represent a module, a program segment, or a portion of code that includes one or more Executable instructions.
  • the functions noted in the blocks may also occur in a different order than that illustrated in the drawings. For example, two successively represented blocks may in fact be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams or flowcharts, and combinations of blocks in the block diagrams or flowcharts can be implemented by a dedicated hardware-based system that performs the specified function or operation, or can be used A combination of dedicated hardware and computer instructions is implemented.
  • the units involved in the embodiments of the present invention may be implemented by software or by hardware.
  • the described unit can also be set to handle In the device, for example, it can be described that a processor includes a generation save unit and a processing unit.
  • the name of these units does not constitute a limitation on the unit itself in some cases.
  • the generation and storage unit may also be described as “generating the current identification for the access request after receiving the access page sent by the client.
  • the unit of the crawler's field value may be described as “generating the current identification for the access request after receiving the access page sent by the client.
  • the present invention also provides a computer readable medium, which may be included in the apparatus described in the above embodiments, or may be separately present and not incorporated in the apparatus.
  • the computer readable medium carries one or more programs.
  • the device includes: after the server receives the access request of the access page sent by the client, generating the current use.
  • the crawler traffic is directed to the CDN server of each province and city, thereby further protecting the server and ensuring that users can access normally.

Abstract

Proposed are a crawler interception method and device, a server and a medium. The method comprises: after receiving an access request, sent by a client, for accessing a page, a server end generating a current field value for recognizing a crawler and generating a picture attribute value for saving the field value in a picture; saving a picture uniform resource locator (URL) path that contains the picture attribute value in the requested page; the server end determining whether a current page to be accessed belongs to a direct access allowed page; if so, returning the requested page to the client; if not, further determining whether the access request contains a valid field value for recognizing the crawler; if there is a valid field value, returning the requested page to the client; and if no field value is contained for recognizing the crawler, or a contained field value is invalid, confirming that same is the crawler, and returning a first classified page of the page to be accessed to the client. By means of the present invention, crawler access can be effectively intercepted.

Description

拦截爬虫的方法、装置、服务器终端以及计算机可读介质Method, device, server terminal and computer readable medium for intercepting crawler 技术领域Technical field
本发明涉及网络技术,特别涉及一种拦截爬虫的方法、装置、服务器终端以及计算机可读介质。The present invention relates to network technologies, and in particular, to a method, an apparatus, a server terminal, and a computer readable medium for intercepting a crawler.
背景技术Background technique
网络爬虫是搜索引擎技术的基础组成部分。网络爬虫技术是从一个或若干初始网页的URL(Uniform Resource Locator,统一资源定位符)开始,获得初始网页上的URL,在抓取网页信息的过程中,根据网页的抓取策略,不断从当前网页上抽取新的URL放入队列,直到满足某种停止条件。然后将抓取到的网页信息存储在搜索引擎的服务器中。Web crawlers are a fundamental part of search engine technology. The web crawler technology starts from the URL (Uniform Resource Locator) of one or several initial web pages, and obtains the URL on the initial webpage. In the process of crawling the webpage information, according to the crawling strategy of the webpage, the current crawling strategy continuously The web page extracts a new URL into the queue until some stop condition is met. The crawled web page information is then stored in the search engine's server.
现有技术中,为了确保正常用户的访问,有些网站采取过滤客户端IP的方法,或者过滤HTTP请求的特定User-Agent头的方法来拦截来自网络爬虫的访问,但是,在访问量非常大的情况下,当很多正常用户共用一个IP的情况下,会将这些正常用户误认为是网络爬虫从而被过滤掉。另一方面,根据HTTP协议规范,User-Agent头的值是可以任意设置的,因此很多网络爬虫都把自己的User-Agent头设置成与普通的浏览器一样来逃避过滤,所以导致拦截网络爬虫的效率不高。In the prior art, in order to ensure normal user access, some websites adopt a method of filtering a client IP, or a method of filtering a specific User-Agent header of an HTTP request to intercept access from a web crawler, but in a very large amount of access. In the case, when many normal users share an IP, these normal users will be mistaken for web crawlers and thus filtered out. On the other hand, according to the HTTP protocol specification, the value of the User-Agent header can be arbitrarily set, so many web crawlers set their User-Agent headers to evade filtering like ordinary browsers, thus causing interception of network crawlers. The efficiency is not high.
发明内容Summary of the invention
本发明实施例的目的在于提供一种拦截爬虫的方法、装置、服务器终端以及计算机可读介质,能够有效拦截爬虫访问。An object of the embodiments of the present invention is to provide a method, an apparatus, a server terminal, and a computer readable medium for intercepting a crawler, which can effectively intercept crawler access.
为实现上述发明目的,本发明实施例提供了一种拦截爬虫的方法,该方法包括:To achieve the above object, an embodiment of the present invention provides a method for intercepting a crawler, the method comprising:
服务器端接收到客户端发送的访问页面的访问请求后,生成当前 用于识别爬虫的字段值,并生成将所述字段值保存到图片中的图片属性值;将包含有所述图片属性值的图片统一资源定位符URL路径保存到所请求的页面中;After receiving the access request from the access page sent by the client, the server generates the current request. a field value for identifying the crawler, and generating a picture attribute value for saving the field value to the picture; saving the picture uniform resource locator URL path containing the picture attribute value to the requested page;
服务器端判断当前要访问页面是否属于直接允许访问页面,如果是,则将所请求的页面返回给客户端;如果否,则进一步判断所述访问请求中是否包含用于识别爬虫的有效字段值,如果是有效字段值,则将所请求的页面返回给客户端;如果不包含用于识别爬虫的字段值,或者所包含的字段值无效,则确认为爬虫,将要访问页面的分类第一页返回给客户端。The server determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the client; if not, further determines whether the access request includes a valid field value for identifying the crawler, If it is a valid field value, the requested page is returned to the client; if the field value for identifying the crawler is not included, or the included field value is invalid, it is confirmed as a crawler, and the first page of the category to be accessed is returned. To the client.
为实现上述发明目的,本发明实施例还提供了一种拦截爬虫的装置,该装置应用于服务器端,包括:In order to achieve the above object, the embodiment of the present invention further provides a device for intercepting a crawler, and the device is applied to a server, and includes:
生成保存单元,接收到客户端发送的访问页面的访问请求后,生成当前用于识别爬虫的字段值,并生成将所述字段值保存到图片中的图片属性值;将包含有所述图片属性值的图片统一资源定位符URL路径保存到所请求的页面中;Generating a saving unit, after receiving an access request of the access page sent by the client, generating a field value currently used to identify the crawler, and generating an image attribute value for saving the field value into the image; the image attribute is included The value of the Uniform Resource Locator URL path is saved to the requested page;
处理单元,判断当前要访问页面是否属于直接允许访问页面,如果是,则将所请求的页面返回给客户端;如果否,则进一步判断所述访问请求中是否包含用于识别爬虫的有效字段值,如果是有效字段值,则将所请求的页面返回给客户端;如果不包含用于识别爬虫的字段值,或者所包含的字段值无效,则确认为爬虫,将要访问页面的分类第一页返回给客户端。The processing unit determines whether the currently accessed page belongs to the directly allowed access page, and if yes, returns the requested page to the client; if not, further determines whether the access request includes a valid field value for identifying the crawler If the value is a valid field value, the requested page is returned to the client; if the field value for identifying the crawler is not included, or the field value contained is invalid, it is confirmed as a crawler, and the first page of the page to be accessed is to be accessed. Return to the client.
为实现上述发明目的,本发明实施例还提供了一种拦截爬虫的装置,该装置应用于作为浏览器的客户端,包括:In order to achieve the above object, the embodiment of the present invention further provides a device for intercepting a crawler, and the device is applied to a client as a browser, including:
下载单元,根据服务器端返回的页面中包含的图片URL路径将图片下载到浏览器上;The download unit downloads the image to the browser according to the image URL path included in the page returned by the server;
提取单元,解析所述图片,提取其中的用于识别爬虫的字段值,并进行保存,用于浏览器访问其他页面时在访问请求中携带该用于识别爬虫的字段值。 The extracting unit parses the image, extracts a field value for identifying the crawler, and saves the field value for identifying the crawler in the access request when the browser accesses the other page.
为实现上述发明目的,本发明实施例还提供了一种服务器终端,该服务器终端包括:In order to achieve the above object, the embodiment of the present invention further provides a server terminal, where the server terminal includes:
一个或多个处理器;One or more processors;
存储装置,用于存储一个或多个程序,a storage device for storing one or more programs,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现本发明实施例的拦截爬虫的方法。When the one or more programs are executed by the one or more processors, the one or more processors implement a method of intercepting a crawler of an embodiment of the present invention.
为实现上述发明目的,本发明实施例还提供了一种计算机可读介质,其上存储有计算机程序,所述程序被处理器执行时实现本发明实施例的拦截爬虫的方法。In order to achieve the above object, an embodiment of the present invention further provides a computer readable medium having stored thereon a computer program, the program being executed by a processor to implement a method for intercepting a crawler according to an embodiment of the present invention.
综上所述,本发明实施例提供的拦截爬虫的方法、装置、服务器终端以及计算机可读介质,在本发明实施例中,服务器端接收到客户端发送的访问页面的访问请求后,生成当前用于识别爬虫的字段值,并生成将所述字段值保存到图片中的图片属性值;将包含有所述图片属性值的图片统一资源定位符URL路径保存到所请求的页面中;服务器端判断当前要访问页面是否属于直接允许访问页面,如果是,则将所请求的页面返回给客户端;如果否,则进一步判断所述访问请求中是否包含用于识别爬虫的有效字段值,如果是有效字段值,则将所请求的页面返回给客户端;如果不包含用于识别爬虫的字段值,或者所包含的字段值无效,则确认为爬虫,将要访问页面的分类第一页返回给客户端。由此可见,本发明利用爬虫不会执行Javascript(JS)方法,以及不会下载网页中的图片的特性,服务器端将生成的用于识别爬虫的字段cookie值保存到图片中,爬虫不会下载图片,因此,应用本发明之后,有效提高了爬虫的拦截率,降低了服务器的压力,保证网站稳定和高并发。并且正常用户的访问也不会被拦截。In the embodiment of the present invention, the server receives the access request of the access page sent by the client, and generates the current a field value for identifying the crawler, and generating a picture attribute value for saving the field value to the picture; saving the picture uniform resource locator URL path containing the picture attribute value to the requested page; Determining whether the currently accessed page belongs to the directly allowed access page, and if so, returning the requested page to the client; if not, further determining whether the access request includes a valid field value for identifying the crawler, if A valid field value returns the requested page to the client; if it does not contain a field value for identifying the crawler, or if the included field value is invalid, it is confirmed as a crawler, and the first page of the category to be accessed is returned to the client. end. It can be seen that the present invention utilizes the feature that the crawler does not execute the Javascript (JS) method, and does not download the image in the webpage, and the server side saves the generated field cookie value for identifying the crawler to the image, and the crawler does not download. The picture, therefore, after the application of the invention, effectively improves the interception rate of the crawler, reduces the pressure on the server, and ensures the stability and high concurrency of the website. And normal user access will not be blocked.
附图说明 DRAWINGS
图1是本发明实施例可以应用于其中的示例性系统架构图。1 is an exemplary system architecture diagram to which an embodiment of the present invention may be applied.
图2为本发明实施例拦截爬虫的方法流程示意图。2 is a schematic flow chart of a method for intercepting a crawler according to an embodiment of the present invention.
图3为本发明具体实施例中应用于上述方法的拦截爬虫的装置结构示意图。FIG. 3 is a schematic structural diagram of an apparatus for intercepting a reptile applied to the above method according to an embodiment of the present invention.
图4是适于用来实现本发明实施例的终端设备或服务器的计算机系统的结构示意图。4 is a block diagram showing the structure of a computer system suitable for implementing a terminal device or a server of an embodiment of the present invention.
具体实施方式detailed description
为使本发明的目的、技术方案及优点更加清楚明白,以下参照附图并举实施例,对本发明所述方案作进一步地详细说明。In order to make the objects, the technical solutions and the advantages of the present invention more comprehensible, the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.
图1示出了可以应用本申请的拦截爬虫方法或拦截爬虫装置的示例性系统架构100。FIG. 1 illustrates an exemplary system architecture 100 in which the intercept crawler method or intercept crawler device of the present application can be applied.
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1, system architecture 100 can include terminal devices 101, 102, 103, network 104, and server 105. The network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various types of connections, such as wired, wireless communication links, fiber optic cables, and the like.
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如购物类应用、网页浏览器应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等(仅为示例)。The user can interact with the server 105 over the network 104 using the terminal devices 101, 102, 103 to receive or transmit messages and the like. Various communication client applications such as a shopping application, a web browser application, a search application, an instant communication tool, a mailbox client, a social platform software, and the like can be installed on the terminal devices 101, 102, and 103 (for example only).
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop portable computers, desktop computers, and the like.
服务器105可以是提供各种服务的服务器,例如对用户利用终端设备101、102、103所浏览的购物类网站提供支持的后台管理服务器 (仅为示例)。后台管理服务器可以对接收到的产品信息查询请求等数据进行分析等处理,并将处理结果(例如目标推送信息、产品信息--仅为示例)反馈给终端设备。The server 105 may be a server that provides various services, such as a background management server that provides support for a shopping website browsed by the user using the terminal devices 101, 102, and 103. (for example only). The background management server may analyze and process data such as the received product information query request, and feed back the processing result (for example, target push information, product information--only examples) to the terminal device.
需要说明的是,本发明实施例所提供的拦截爬虫方法一般由服务器105执行,相应地,拦截爬虫装置一般设置于服务器105中。It should be noted that the intercepting crawling method provided by the embodiment of the present invention is generally performed by the server 105. Accordingly, the intercepting crawling device is generally disposed in the server 105.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the number of terminal devices, networks, and servers in Figure 1 is merely illustrative. Depending on the implementation needs, there can be any number of terminal devices, networks, and servers.
本发明为确保浏览器的正常访问,有效拦截爬虫,利用爬虫不会执行JS方法,以及不会下载网页中的图片的特性,服务器端将生成的用于识别爬虫的字段cookie值保存到图片中,爬虫不会下载图片,因此,在爬虫向服务器端发送的访问请求中不会携带cookie值,进而通过在访问请求中是否携带cookie值来区分爬虫的请求和浏览器的请求,最终实现对爬虫的有效拦截。The invention saves the normal access of the browser, effectively blocks the crawler, uses the crawler does not execute the JS method, and does not download the image in the webpage, and the server side saves the generated field cookie value for identifying the crawler into the image. The crawler does not download the image. Therefore, the crawler does not carry the cookie value in the access request sent by the crawler, and then distinguishes the crawler request and the browser request by carrying the cookie value in the access request, and finally realizes the crawler. Effective interception.
本发明实施例公开了一种拦截爬虫的方法,包括以下步骤,流程示意图如图2所示。The embodiment of the invention discloses a method for intercepting a reptile, which comprises the following steps. The schematic diagram of the process is shown in FIG. 2 .
步骤21、服务器端接收到客户端发送的访问页面的访问请求后,生成当前用于识别爬虫的字段值,并生成将所述字段值保存到图片中的图片属性值;将包含有所述图片属性值的图片URL路径保存到所请求的页面中。Step 21: After receiving the access request of the access page sent by the client, the server generates a field value currently used to identify the crawler, and generates a picture attribute value that saves the field value to the image; the image is included The image URL path of the attribute value is saved to the requested page.
其中,用于识别爬虫的字段值可以为cookie值;图片属性值可以为图片名称。简单讲,服务器端接收到客户端发送的访问页面的访问请求,例如HTTP请求后,生成cookie值和图片名称,然后将包含有该图片名称的图片URL路径保存到所请求的页面中。具体地,The field value used to identify the crawler may be a cookie value; the image attribute value may be a picture name. Briefly, the server receives the access request of the access page sent by the client, for example, after the HTTP request, generates a cookie value and a picture name, and then saves the image URL path containing the picture name to the requested page. specifically,
服务器端生成当前用于识别爬虫的cookie值的方法包括:服务器 端根据cookie值的有效时间对当前时间戳的值进行选取;将所选取的当前时间戳的值与配置的当前第一密钥合并的字符串进行加密运算,例如可以是md5消息摘要运算,得到当前cookie值。The method for generating the cookie value currently used by the server to identify the crawler includes: the server The terminal selects the value of the current timestamp according to the valid time of the cookie value; encrypts the string of the selected current timestamp with the configured current first key, for example, the md5 message digest operation, Current cookie value.
服务器端生成图片名称的方法包括:服务器端根据cookie值的有效时间对当前时间戳的值进行选取;将所选取的当前时间戳的值与配置的当前第二密钥合并的字符串进行加密运算,例如可以是md5消息摘要运算,得到图片的名称。The method for generating a picture name by the server includes: the server selects the value of the current timestamp according to the valid time of the cookie value; and encrypts the string of the selected current timestamp with the configured current second key. For example, it can be an md5 message digest operation to get the name of the picture.
需要说明的是,生成cookie值和图片名称的方法有多种,包括但不限于上述方法,由于本发明中cookie值是有时效的,所以生成时与时间戳有关,其他通过时间戳获取得到cookie值和图片名称的方法都在本发明的保护范围内。It should be noted that there are various methods for generating a cookie value and a picture name, including but not limited to the above method. Since the cookie value in the present invention is time-sensitive, the generation time is related to the timestamp, and the other time is obtained by the timestamp. Both the value and the method of the picture name are within the scope of the present invention.
URL是用于完整描述因特网(Internet)上网页和其他资源的地址的一种标识方法,对应的,Internet上的每一个网页都具有一个唯一的URL。当客户端需要访问服务器端中的网页时,就要先获取到该网页的URL。A URL is an identification method for completely describing the addresses of web pages and other resources on the Internet. Correspondingly, each web page on the Internet has a unique URL. When the client needs to access the webpage in the server, the URL of the webpage is first obtained.
本实施例客户端发送的访问页面的HTTP请求中,携带的是该页面的URL路径信息。需要说明的是,图片URL路径进一步保存在该页面中,保存的具体位置可以根据具体实现而设定,一个实施例可以是图片URL路径保存在该页面的图像(image)标签中。In the HTTP request of the access page sent by the client in this embodiment, the URL path information of the page is carried. It should be noted that the image URL path is further saved in the page, and the specific location of the save may be set according to a specific implementation. In one embodiment, the image URL path may be saved in an image tag of the page.
步骤22、服务器端判断当前要访问页面是否属于直接允许访问页面,如果是,则将所请求的页面返回给客户端;如果否,则进一步判断所述访问请求中是否包含用于识别爬虫的有效字段值,如果是有效字段值,则将所请求的页面返回给客户端;如果不包含用于识别爬虫的字段值,或者所包含的字段值无效,则确认为爬虫,将要访问页面的分类第一页返回给客户端。 Step 22: The server determines whether the currently accessed page belongs to the directly allowed access page, and if yes, returns the requested page to the client; if not, further determines whether the access request includes a valid identifier for identifying the crawler. The field value, if it is a valid field value, returns the requested page to the client; if it does not contain the field value used to identify the crawler, or if the included field value is invalid, it is confirmed as a crawler, and the page to be accessed is classified. A page is returned to the client.
其中,服务器端判断当前要访问页面是否属于直接允许访问页面的方法包括:服务器端预先设置有允许直接访问页面的页面范围;服务器端判断当前要访问页面是否在所述范围内,如果在,则属于直接允许访问页面。The method for the server to determine whether the current page to be accessed is directly allowed to access the page includes: the server side is preset with a page range that allows direct access to the page; the server determines whether the currently accessed page is within the range, and if so, Belongs to directly allow access to the page.
服务器端判断所述HTTP请求中是否包含有效cookie值的方法包括:服务器端将自身生成的cookie值,与HTTP请求中携带的cookie值进行比较,如果二者相等,则判断HTTP请求中携带的cookie值为有效cookie值。显然,如果二者不相等,则cookie值无效。The method for the server to determine whether the HTTP request includes a valid cookie value includes: the server compares the cookie value generated by the server with the cookie value carried in the HTTP request, and if the two are equal, the cookie carried in the HTTP request is determined. The value is a valid cookie value. Obviously, if the two are not equal, the cookie value is invalid.
需要说明的是,本发明中,为了防止爬虫的模仿,服务器端生成的cookie值每隔预定时间是发生变化的。反过来说,假设预定时间是10分钟,则每个10分钟内,服务器端生成的cookie值是相同的。然后服务器端就会将包含该cookie值的页面返回给客户端,因此,只要客户端是浏览器,就可以将该cookie值解析出来,携带在下一个HTTP请求中,发送给服务器端,那么,只要在同一个10分钟之内,服务器端接收的cookie值就会与服务器端自身生成的cookie值一致,这就说明该cookie值有效。如果在下一个10分钟的时候,客户端仍然携带之前的cookie值向服务器端发送HTTP请求,服务器端又生成了新的cookie值,这就导致服务器端接收的cookie值就会与服务器端自身生成的cookie值不一致,这就说明该cookie值无效。It should be noted that, in the present invention, in order to prevent the reptile from being imitated, the cookie value generated by the server side changes every predetermined time. Conversely, assuming that the predetermined time is 10 minutes, the cookie value generated by the server is the same every 10 minutes. Then the server will return the page containing the cookie value to the client, so as long as the client is a browser, the cookie value can be parsed, carried in the next HTTP request, and sent to the server, then, as long as Within the same 10 minutes, the cookie value received by the server will be the same as the cookie value generated by the server itself, which indicates that the cookie value is valid. If, in the next 10 minutes, the client still sends the HTTP request to the server with the previous cookie value, the server generates a new cookie value, which causes the server to receive the cookie value and the server itself. The cookie value is inconsistent, which means that the cookie value is invalid.
如果是爬虫的话,服务器端接收到爬虫的HTTP请求后,同样会将图片URL路径保存到所请求的页面中。然后,服务器端判断当前要访问页面是否属于直接允许访问页面,如果是,则将所请求的页面返回给爬虫。这是因为,在实际应用中,一般都会允许爬虫访问有限的几页,在一个实施例中可以是同一分类的1-10页。如果服务器判断当前要访问页面不属于直接允许访问页面,例如爬虫要访问第11页,则进一步判断HTTP请求中是否包含有效cookie值,经过判断爬虫的 HTTP请求中并不带有cookie值,因此,将爬虫的请求拦截,将当前分类的第1页返回给爬虫。这样,爬虫始终获取的是当前分类的第1页,不会获取更多的页面。If it is a crawler, after receiving the HTTP request from the crawler, the server will also save the image URL path to the requested page. Then, the server determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the crawler. This is because, in practical applications, crawlers are generally allowed to access a limited number of pages, which in one embodiment may be 1-10 pages of the same category. If the server determines that the current page to be accessed is not directly allowed to access the page, for example, the crawler wants to access page 11, it further determines whether the HTTP request contains a valid cookie value, after judging the crawler The HTTP request does not carry a cookie value, so the crawler's request is intercepted and the first page of the current classification is returned to the crawler. In this way, the crawler always gets the first page of the current category and won't get more pages.
如果是浏览器的话,服务器端接收到浏览器的HTTP请求后,会将图片URL路径保存到所请求的页面中。然后,服务器端判断当前要访问页面是否属于直接允许访问页面,如果是,则将所请求的页面返回给浏览器。这时,浏览器根据服务器端返回的页面中包含的图片URL路径将图片下载到浏览器上;用Javascript方法解析图片,提取其中的cookie值,并进行保存,用于浏览器访问其他页面时在HTTP请求中携带该cookie值。假设浏览器访问第11页时,在HTTP请求中携带有解析出的cookie值,服务器端接收到该HTTP请求后,判断该cookie值是否有效,如果有效则允许访问第11页,如果无效,则将当前分类的第1页返回给浏览器。If it is a browser, after receiving the HTTP request from the browser, the server will save the image URL path to the requested page. Then, the server determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the browser. At this time, the browser downloads the image to the browser according to the image URL path included in the page returned by the server; parses the image in Javascript, extracts the cookie value, and saves it for the browser to access other pages. The cookie value is carried in the HTTP request. Suppose the browser accesses the page 11 and carries the parsed cookie value in the HTTP request. After receiving the HTTP request, the server determines whether the cookie value is valid. If it is valid, it allows access to page 11. If it is invalid, then Return the first page of the current category to the browser.
另外,本发明中,为了进一步缓解服务器的压力,将允许直接访问的页面缓存在CDN(Content Delivery Network,内容分发网络)服务器上,当客户端请求其中允许直接访问的页面时,由CDN服务器将所请求的页面返回给客户端。CDN技术通过在网络各处放置CDN服务器,构成现有互联网上的一层智能虚拟网络,通常在CDN服务器上可缓存大量数据,当用户访问已存储的内容数据时,CDN服务器可将数据直接提供给用户,快速完成响应服务。这样,把爬虫的流量都指向各个省市的CDN服务器上,从而起到保护服务器,保证用户可以正常访问的作用。In addition, in the present invention, in order to further alleviate the pressure on the server, the page allowing direct access is cached on a CDN (Content Delivery Network) server, and when the client requests a page in which direct access is permitted, the CDN server will The requested page is returned to the client. CDN technology forms a layer of intelligent virtual network on the existing Internet by placing CDN servers throughout the network. Usually, a large amount of data can be cached on the CDN server. When the user accesses the stored content data, the CDN server can directly provide the data. Give the user a quick response service. In this way, the traffic of the crawler is directed to the CDN server of each province and city, thereby protecting the server and ensuring normal access by the user.
为清楚说明本发明,下面列举具体场景进行说明。In order to clearly illustrate the present invention, a specific scenario will be described below.
本实施例中假设服务器端生成的cookie值每隔10分钟发生变化,即cookie值的有效时间为10分钟。则,服务器端接收到客户端发送的访问页面的HTTP请求后,取当前时间戳的前11位,20160101081: 表示2016年1月1号8点10分到19分这10分钟。因此,将20160101081和当前第一密钥合并的字符串进行md5消息摘要运算,得到当前cookie值。将20160101081和当前第二密钥合并的字符串进行md5消息摘要运算,得到图片的名称。服务器端将所得的cookie值放入图片的描述信息里,生成新的图片并以得到的图片名称对该新的图片进行命名保存,然后服务器端将包含有图片名称的图片URL路径保存到所请求的页面中。这里,图片的描述信息包括但不限于拍照的时间,照片的分辨率,相机的类型等。以该图片名称命名的新的图片包含了cookie值。In this embodiment, it is assumed that the cookie value generated by the server side changes every 10 minutes, that is, the cookie value is valid for 10 minutes. Then, after receiving the HTTP request of the access page sent by the client, the server takes the first 11 digits of the current timestamp, 20160101081: It means 10 minutes from 8:10 to 19:00 on January 1, 2016. Therefore, the string of 20160101081 and the current first key is merged into an md5 message digest operation to obtain the current cookie value. The string of the combination of 20160101081 and the current second key is subjected to the md5 message digest operation to obtain the name of the picture. The server puts the obtained cookie value into the description information of the picture, generates a new picture and saves the new picture with the obtained picture name, and then the server side saves the picture URL path containing the picture name to the requested one. In the page. Here, the description information of the picture includes, but is not limited to, the time of photographing, the resolution of the photo, the type of the camera, and the like. The new image named after the image name contains the cookie value.
实施例一,在一个实施方式中,Embodiment 1, in one embodiment,
1)浏览器向服务器端发送HTTP请求,请求当前分类第一页;1) The browser sends an HTTP request to the server to request the first page of the current classification;
服务器端生成包含cookie值的图片URL路径,保存到第一页中;The server generates a picture URL path containing the cookie value and saves it to the first page;
服务器端预先设置有允许直接访问页面的页面范围为1-10页,服务器端判断第1页属于直接访问范围,因此,将包含有图片URL路径的第1页返回给浏览器;The server side presets a page range of 1-10 pages that allows direct access to the page, and the server determines that the first page belongs to the direct access range, and therefore returns the first page containing the image URL path to the browser;
浏览器根据返回的当前分类第一页的页面中包含的图片URL路径将图片自动下载到浏览器上;用JS方法解析图片,提取其中的cookie值,并保存;后续翻页时携带该cookie值。The browser automatically downloads the image to the browser according to the image URL path included in the page of the first page of the current classification; parses the image with the JS method, extracts the cookie value, and saves the cookie value; .
2)浏览器向服务器端发送携带cookie值的HTTP请求,请求当前分类第10页;2) The browser sends an HTTP request carrying a cookie value to the server, requesting the current classification page 10;
服务器端生成包含cookie值的图片URL路径,保存到第10页中;其中,由于在有效时间10分钟内,所以此时服务器端生成的cookie值与HTTP请求中携带的cookie值相同;The server generates a picture URL path containing the cookie value, and saves it to page 10; wherein, since the valid time is 10 minutes, the cookie value generated by the server is the same as the cookie value carried in the HTTP request;
服务器端预先设置有允许直接访问页面的页面范围为1-10页,服务器端判断第10页属于直接访问范围,因此,此时不需要判断cookie值是否有效,直接将包含有图片URL路径的第10页返回给浏览器。The server side presets the range of pages that allow direct access to the page to be 1-10 pages, and the server determines that the 10th page belongs to the direct access range. Therefore, it is not necessary to determine whether the cookie value is valid at this time, and directly includes the image URL path. 10 pages are returned to the browser.
浏览器根据返回的当前分类第10页的页面中包含的图片URL路径将图片自动下载到浏览器上;用JS方法解析图片,提取其中的cookie 值,并保存;后续翻页时携带该cookie值。The browser automatically downloads the image to the browser according to the image URL path included in the page on page 10 of the current classification; parses the image with the JS method and extracts the cookie Value, and save; carry the cookie value when page turning.
3)浏览器向服务器端发送携带cookie值的HTTP请求,请求当前分类第11页;3) The browser sends an HTTP request carrying a cookie value to the server, requesting the current classification page 11;
服务器端生成包含cookie值的图片URL路径,保存到第11页中;其中,由于在有效时间10分钟内,所以此时服务器端生成的cookie值与HTTP请求中携带的cookie值相同;The server generates a picture URL path containing the cookie value, and saves it to page 11; wherein, since the valid time is 10 minutes, the cookie value generated by the server is the same as the cookie value carried in the HTTP request;
服务器端预先设置有允许直接访问页面的页面范围为1-10页,服务器端判断第11页不属于直接访问范围,因此,进一步判断cookie值是否有效,前述已经说明,由于在有效时间10分钟内,所以此时服务器端生成的cookie值与HTTP请求中携带的cookie值相同,因此判断cookie值有效,将包含有图片URL路径的第11页返回给浏览器。The server side presets a page range of 1-10 pages that allows direct access to the page, and the server side judges that the 11th page does not belong to the direct access range. Therefore, it is further determined whether the cookie value is valid. The foregoing has explained that since the effective time is within 10 minutes. Therefore, at this time, the cookie value generated by the server side is the same as the cookie value carried in the HTTP request, so it is determined that the cookie value is valid, and the 11th page including the image URL path is returned to the browser.
浏览器根据返回的当前分类第11页的页面中包含的图片URL路径将图片自动下载到浏览器上;用JS方法解析图片,提取其中的cookie值,并保存;后续翻页时携带该cookie值。The browser automatically downloads the image to the browser according to the image URL path included in the page on page 11 of the current classification; parses the image with the JS method, extracts the cookie value, and saves the cookie value; .
从而实现浏览器的正常访问。Thereby achieving normal access to the browser.
实施例二,在另一个实施方式中, Embodiment 2, in another embodiment,
如果浏览器接收到指向分类第10页的链接,则,If the browser receives a link to the 10th page of the category,
浏览器向服务器端发送HTTP请求,请求当前分类第10页;The browser sends an HTTP request to the server to request the 10th page of the current classification;
服务器端生成包含cookie值的图片URL路径,保存到第10页中;The server generates a picture URL path containing the cookie value and saves it to page 10;
服务器端预先设置有允许直接访问页面的页面范围为1-10页,服务器端判断第10页属于直接访问范围,因此,此时虽然HTTP请求中并不带有cookie值,还是直接将包含有图片URL路径的第10页返回给浏览器。The server side presets a page range of 1-10 pages that allows direct access to the page, and the server side judges that the 10th page belongs to the direct access range. Therefore, although the HTTP request does not have a cookie value at this time, the image will be directly included. The 10th page of the URL path is returned to the browser.
浏览器根据返回的当前分类第10页的页面中包含的图片URL路径将图片自动下载到浏览器上;用JS方法解析图片,提取其中的cookie 值,并保存;后续翻页时携带该cookie值。The browser automatically downloads the image to the browser according to the image URL path included in the page on page 10 of the current classification; parses the image with the JS method and extracts the cookie Value, and save; carry the cookie value when page turning.
实施例三,在另一个实施方式中,Embodiment 3, in another embodiment,
如果浏览器接收到指向分类第11页的链接,则,If the browser receives a link to the 11th page of the category,
浏览器向服务器端发送HTTP请求,请求当前分类第11页;The browser sends an HTTP request to the server to request the current classification page 11;
服务器端生成包含cookie值的图片URL路径,保存到第11页中;The server generates a picture URL path containing the cookie value and saves it to page 11;
服务器端判断第11页不属于直接访问范围,因此,进一步判断HTTP请求中是否带有cookie值,由于是浏览器直接接收到的链接,所以HTTP请求中并不带有cookie值,因此,向浏览器返回当前分类第一页。The server side judges that page 11 is not a direct access range. Therefore, it is further determined whether the HTTP request has a cookie value. Since the link is directly received by the browser, the HTTP request does not have a cookie value, so the browsing is performed. Returns the first page of the current classification.
接下来,如果要继续访问其他页面,可以重复实施例一中的操作,实现页面的正常访问。Next, if you want to continue to access other pages, you can repeat the operation in the first example to achieve normal access to the page.
实施例四Embodiment 4
在另一个实施方式中,In another embodiment,
爬虫向服务器端发送HTTP请求,请求当前分类第一页;The crawler sends an HTTP request to the server to request the first page of the current classification;
服务器端生成包含cookie值的图片URL路径,保存到第一页中;The server generates a picture URL path containing the cookie value and saves it to the first page;
服务器端预先设置有允许直接访问页面的页面范围为1-10页,服务器端判断第1页属于直接访问范围,因此,将包含有图片URL路径的第1页返回给爬虫;The server side presets a page range of 1-10 pages that allows direct access to the page, and the server side judges that the first page belongs to the direct access range, and therefore returns the first page containing the image URL path to the crawler;
需要注意的是,现有技术中爬虫并不会下载图片,也不会用JS方法解析图片,因为,如果执行的话会大大增加爬虫的成本,包括CPU和带宽成本。因此,爬虫不会像浏览器那样提取图片中的cookie值,在访问其他页面时携带。进而就会被服务器端拦截。It should be noted that the crawler does not download images in the prior art, nor does it use the JS method to parse the image, because if executed, it will greatly increase the cost of the crawler, including CPU and bandwidth costs. Therefore, the crawler does not extract the cookie value in the image as the browser does, and it is carried when accessing other pages. Then it will be intercepted by the server.
实施例五Embodiment 5
在另一个实施方式中,In another embodiment,
爬虫向服务器端发送HTTP请求,请求当前分类第11页;The crawler sends an HTTP request to the server to request the current classification page 11;
服务器端生成包含cookie值的图片URL路径,保存到第11页中; The server generates a picture URL path containing the cookie value and saves it to page 11;
服务器端判断第11页不属于直接访问范围,因此,进一步判断HTTP请求中是否带有cookie值,由于爬虫向服务器端发送的HTTP请求不可能带有cookie值,所以服务器端向爬虫返回当前分类第一页。The server side judges that page 11 is not a direct access range. Therefore, it is further determined whether the HTTP request has a cookie value. Since the HTTP request sent by the crawler to the server side cannot have a cookie value, the server returns the current classification to the crawler. One page.
由此可以看出,通过本发明的方案,网络爬虫只能抓取有限的页面,确保了浏览器的正常访问。It can be seen from the above that, by the solution of the invention, the web crawler can only capture a limited number of pages, ensuring normal access of the browser.
基于同样的发明构思,本发明实施例还提出了一种拦截爬虫的装置,应用于服务器端,如图3所示。该装置包括:Based on the same inventive concept, an embodiment of the present invention also provides a device for intercepting a crawler, which is applied to a server end, as shown in FIG. The device includes:
生成保存单元301,接收到客户端发送的访问页面的访问请求后,生成当前用于识别爬虫的字段值,并生成将所述字段值保存到图片中的图片属性值;将包含有所述图片属性值的图片统一资源定位符URL路径保存到所请求的页面中;The generating and saving unit 301, after receiving the access request of the access page sent by the client, generates a field value currently used to identify the crawler, and generates a picture attribute value that saves the field value to the image; the image is included The image uniform resource locator URL path of the attribute value is saved to the requested page;
处理单元302,判断当前要访问页面是否属于直接允许访问页面,如果是,则将所请求的页面返回给客户端;如果否,则进一步判断所述访问请求中是否包含用于识别爬虫的有效字段值,如果是有效字段值,则将所请求的页面返回给客户端;如果不包含用于识别爬虫的字段值,或者所包含的字段值无效,则确认为爬虫,将要访问页面的分类第一页返回给客户端。The processing unit 302 determines whether the currently accessed page belongs to the directly allowed access page, and if yes, returns the requested page to the client; if not, further determines whether the access request includes a valid field for identifying the crawler. Value, if it is a valid field value, returns the requested page to the client; if it does not contain a field value for identifying the crawler, or if the included field value is invalid, it is confirmed as a crawler, and the page to be accessed is first The page is returned to the client.
本发明还提出了一种拦截爬虫的装置,该装置应用于作为浏览器的客户端,包括:The invention also proposes a device for intercepting a crawler, which is applied to a client as a browser, comprising:
下载单元,根据服务器端返回的页面中包含的图片URL路径将图片下载到浏览器上;The download unit downloads the image to the browser according to the image URL path included in the page returned by the server;
提取单元,解析所述图片,提取其中的用于识别爬虫的字段值,并进行保存,用于浏览器访问其他页面时在访问请求中携带该用于识别爬虫的字段值。The extracting unit parses the image, extracts a field value for identifying the crawler, and saves the field value for identifying the crawler in the access request when the browser accesses the other page.
下面参考图4,其示出了适于用来实现本发明实施例的终端设备的计算机系统400的结构示意图。图4示出的终端设备仅仅是一个示例, 不应对本发明实施例的功能和使用范围带来任何限制。Referring now to Figure 4, there is shown a block diagram of a computer system 400 suitable for use in implementing a terminal device in accordance with an embodiment of the present invention. The terminal device shown in FIG. 4 is just an example, There is no limitation to the function and scope of use of the embodiments of the present invention.
如图4所示,计算机系统400包括中央处理单元(CPU)401,其可以根据存储在只读存储器(ROM)402中的程序或者从存储部分408加载到随机访问存储器(RAM)403中的程序而执行各种适当的动作和处理。在RAM 403中,还存储有系统400操作所需的各种程序和数据。CPU 401、ROM 402以及RAM 403通过总线404彼此相连。输入/输出(I/O)接口405也连接至总线404。As shown in FIG. 4, computer system 400 includes a central processing unit (CPU) 401 that can be loaded into a program in random access memory (RAM) 403 according to a program stored in read only memory (ROM) 402 or from storage portion 408. And perform various appropriate actions and processes. In the RAM 403, various programs and data required for the operation of the system 400 are also stored. The CPU 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404. An input/output (I/O) interface 405 is also coupled to bus 404.
以下部件连接至I/O接口405:包括键盘、鼠标等的输入部分406;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分407;包括硬盘等的存储部分408;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分409。通信部分409经由诸如因特网的网络执行通信处理。驱动器410也根据需要连接至I/O接口405。可拆卸介质411,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器410上,以便于从其上读出的计算机程序根据需要被安装入存储部分408。The following components are connected to the I/O interface 405: an input portion 406 including a keyboard, a mouse, etc.; an output portion 407 including a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a storage portion 408 including a hard disk or the like. And a communication portion 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the Internet. Driver 410 is also coupled to I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like is mounted on the drive 410 as needed so that a computer program read therefrom is installed into the storage portion 408 as needed.
特别地,根据本发明公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本发明公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分409从网络上被下载和安装,和/或从可拆卸介质411被安装。在该计算机程序被中央处理单元(CPU)401执行时,执行本发明的系统中限定的上述功能。In particular, the processes described above with reference to the flowcharts may be implemented as a computer software program in accordance with embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for executing the method illustrated in the flowchart. In such an embodiment, the computer program can be downloaded and installed from the network via the communication portion 409, and/or installed from the removable medium 411. When the computer program is executed by the central processing unit (CPU) 401, the above-described functions defined in the system of the present invention are performed.
需要说明的是,本发明所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机 可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本发明中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本发明中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。It should be noted that the computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. Computer More specific examples of readable storage media may include, but are not limited to, electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable Read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain or store a program, which can be used by or in connection with an instruction execution system, apparatus or device. In the present invention, a computer readable signal medium may include a data signal that is propagated in the baseband or as part of a carrier, in which computer readable program code is carried. Such propagated data signals can take a variety of forms including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer readable signal medium can also be any computer readable medium other than a computer readable storage medium, which can transmit, propagate, or transport a program for use by or in connection with the instruction execution system, apparatus, or device. . Program code embodied on a computer readable medium can be transmitted by any suitable medium, including but not limited to wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
附图中的流程图和框图,图示了按照本发明各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality and operation of possible implementations of systems, methods and computer program products in accordance with various embodiments of the invention. In this regard, each block of the flowchart or block diagrams can represent a module, a program segment, or a portion of code that includes one or more Executable instructions. It should also be noted that in some alternative implementations, the functions noted in the blocks may also occur in a different order than that illustrated in the drawings. For example, two successively represented blocks may in fact be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams or flowcharts, and combinations of blocks in the block diagrams or flowcharts, can be implemented by a dedicated hardware-based system that performs the specified function or operation, or can be used A combination of dedicated hardware and computer instructions is implemented.
描述于本发明实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理 器中,例如,可以描述为:一种处理器包括生成保存单元和处理单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,生成保存单元还可以被描述为“接收到客户端发送的访问页面的访问请求后,生成当前用于识别爬虫的字段值的单元”。The units involved in the embodiments of the present invention may be implemented by software or by hardware. The described unit can also be set to handle In the device, for example, it can be described that a processor includes a generation save unit and a processing unit. The name of these units does not constitute a limitation on the unit itself in some cases. For example, the generation and storage unit may also be described as “generating the current identification for the access request after receiving the access page sent by the client. The unit of the crawler's field value."
作为另一方面,本发明还提供了一种计算机可读介质,该计算机可读介质可以是上述实施例中描述的设备中所包含的;也可以是单独存在,而未装配入该设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被一个该设备执行时,使得该设备包括:服务器端接收到客户端发送的访问页面的访问请求后,生成当前用于识别爬虫的字段值,并生成将所述字段值保存到图片中的图片属性值;将包含有所述图片属性值的图片统一资源定位符URL路径保存到所请求的页面中;服务器端判断当前要访问页面是否属于直接允许访问页面,如果是,则将所请求的页面返回给客户端;如果否,则进一步判断所述访问请求中是否包含用于识别爬虫的有效字段值,如果是有效字段值,则将所请求的页面返回给客户端;如果不包含用于识别爬虫的字段值,或者所包含的字段值无效,则确认为爬虫,将要访问页面的分类第一页返回给客户端。In another aspect, the present invention also provides a computer readable medium, which may be included in the apparatus described in the above embodiments, or may be separately present and not incorporated in the apparatus. The computer readable medium carries one or more programs. When the one or more programs are executed by the device, the device includes: after the server receives the access request of the access page sent by the client, generating the current use. Identifying the field value of the crawler, and generating a picture attribute value for saving the field value into the picture; saving the picture uniform resource locator URL path containing the picture attribute value to the requested page; Whether the current page to be accessed is directly allowed to access the page, and if so, the requested page is returned to the client; if not, it is further determined whether the access request contains a valid field value for identifying the crawler, if it is valid The field value returns the requested page to the client; if it does not contain the field value used to identify the crawler, or if the included field value is invalid, it is confirmed as a crawler, and the first page of the category to be accessed is returned to the client. .
本发明的有益效果在于,The beneficial effects of the present invention are that
一、有效拦截爬虫,在正常用户访问量非常大的情况下也可以避免误杀正常用户,确保浏览器的正常访问。First, effective interception of crawlers, in the case of normal user access is very large, you can also avoid accidentally killing normal users, to ensure the normal access of the browser.
二、在网站促销高峰,拦截爬虫访问,降低服务器压力,保证网站稳定和高并发。并可以对恶意攻击拦截。Second, in the website promotion peak, intercept crawler access, reduce server pressure, ensure the site is stable and high concurrency. And can intercept malicious attacks.
三、CDN服务器的设置,将爬虫的流量都指向各个省市的CDN服务器上,从而进一步起到保护服务器,保证用户可以正常访问的作用。 Third, the CDN server settings, the crawler traffic is directed to the CDN server of each province and city, thereby further protecting the server and ensuring that users can access normally.
以上所述,仅为本发明的较佳实施例而已,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。 The above is only the preferred embodiment of the present invention and is not intended to limit the scope of the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Claims (11)

  1. 一种拦截爬虫的方法,其特征在于,该方法包括:A method of intercepting a reptile, characterized in that the method comprises:
    服务器端接收到客户端发送的访问页面的访问请求后,生成当前用于识别爬虫的字段值,并生成将所述字段值保存到图片中的图片属性值;将包含有所述图片属性值的图片统一资源定位符URL路径保存到所请求的页面中;After receiving the access request of the access page sent by the client, the server generates a field value currently used to identify the crawler, and generates a picture attribute value that saves the field value to the image; the image attribute value is included The picture uniform resource locator URL path is saved to the requested page;
    服务器端判断当前要访问页面是否属于直接允许访问页面,如果是,则将所请求的页面返回给客户端;如果否,则进一步判断所述访问请求中是否包含用于识别爬虫的有效字段值,如果是有效字段值,则将所请求的页面返回给客户端;如果不包含用于识别爬虫的字段值,或者所包含的字段值无效,则确认为爬虫,将要访问页面的分类第一页返回给客户端。The server determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the client; if not, further determines whether the access request includes a valid field value for identifying the crawler, If it is a valid field value, the requested page is returned to the client; if the field value for identifying the crawler is not included, or the included field value is invalid, it is confirmed as a crawler, and the first page of the category to be accessed is returned. To the client.
  2. 如权利要求1所述的方法,其特征在于,当客户端为浏览器时,该方法进一步包括:The method of claim 1, wherein when the client is a browser, the method further comprises:
    浏览器根据服务器端返回的页面中包含的图片URL路径将图片下载到浏览器上;解析所述图片,提取其中的用于识别爬虫的字段值,并进行保存,用于浏览器访问其他页面时在访问请求中携带该用于识别爬虫的字段值。The browser downloads the image to the browser according to the image URL path included in the page returned by the server; parses the image, extracts the field value for identifying the crawler, and saves it for the browser to access other pages. The field value used to identify the crawler is carried in the access request.
  3. 如权利要求1所述的方法,其特征在于,所述用于识别爬虫的字段值为cookie值;所述生成cookie值的方法包括:The method of claim 1, wherein the field value for identifying the crawler is a cookie value; and the method for generating the cookie value comprises:
    服务器端根据cookie值的有效时间对当前时间戳的值进行选取;将所选取的当前时间戳的值与配置的当前第一密钥合并的字符串进行加密运算,得到当前cookie值。The server selects the value of the current timestamp according to the valid time of the cookie value, and encrypts the string of the selected current timestamp with the configured current first key to obtain the current cookie value.
  4. 如权利要求1所述的方法,其特征在于,所述用于识别爬虫的字段值为cookie值;所述图片属性值为图片名称;所述生成图片名称 的方法包括:The method according to claim 1, wherein said field value for identifying a crawler is a cookie value; said picture attribute value is a picture name; said generating picture name The methods include:
    服务器端根据cookie值的有效时间对当前时间戳的值进行选取;将所选取的当前时间戳的值与配置的当前第二密钥合并的字符串进行加密运算,得到图片的名称。The server selects the value of the current timestamp according to the valid time of the cookie value, and encrypts the string of the selected current timestamp with the configured current second key to obtain the name of the picture.
  5. 如权利要求1所述的方法,其特征在于,服务器端判断当前要访问页面是否属于直接允许访问页面的方法包括:The method of claim 1, wherein the method for determining, by the server, whether the currently accessed page belongs to the directly allowed access page comprises:
    服务器端预先设置有允许直接访问页面的页面范围;The server side is pre-set with a range of pages that allow direct access to the page;
    服务器端判断当前要访问页面是否在所述范围内,如果在,则属于直接允许访问页面。The server determines whether the currently accessed page is within the range, and if so, belongs to the directly allowed access page.
  6. 如权利要求1所述的方法,其特征在于,所述用于识别爬虫的字段值为cookie值;则判断所述访问请求中是否包含用于识别爬虫的有效字段值的方法包括:The method according to claim 1, wherein the field value for identifying the crawler is a cookie value; and determining whether the access request includes a valid field value for identifying the crawler comprises:
    服务器端将自身生成的cookie值,与访问请求中携带的cookie值进行比较,如果二者相等,则判断访问请求中携带的cookie值为有效cookie值。The server compares the cookie value generated by itself with the cookie value carried in the access request. If the two are equal, it determines that the cookie value carried in the access request is a valid cookie value.
  7. 如权利要求1所述的方法,其特征在于,该方法还包括:将允许直接访问的页面缓存在CDN服务器上,当客户端请求其中允许直接访问的页面时,由CDN服务器将所请求的页面返回给客户端。The method of claim 1, further comprising: caching the page allowing direct access on the CDN server, and when the client requests the page in which the direct access is permitted, the requested page is requested by the CDN server. Return to the client.
  8. 一种拦截爬虫的装置,其特征在于,该装置应用于服务器端,包括:A device for intercepting a reptile is characterized in that the device is applied to a server, and includes:
    生成保存单元,接收到客户端发送的访问页面的访问请求后,生成当前用于识别爬虫的字段值,并生成将所述字段值保存到图片中的图片属性值;将包含有所述图片属性值的图片统一资源定位符URL路径保存到所请求的页面中;Generating a saving unit, after receiving an access request of the access page sent by the client, generating a field value currently used to identify the crawler, and generating an image attribute value for saving the field value into the image; the image attribute is included The value of the Uniform Resource Locator URL path is saved to the requested page;
    处理单元,判断当前要访问页面是否属于直接允许访问页面,如果是,则将所请求的页面返回给客户端;如果否,则进一步判断所述 访问请求中是否包含用于识别爬虫的有效字段值,如果是有效字段值,则将所请求的页面返回给客户端;如果不包含用于识别爬虫的字段值,或者所包含的字段值无效,则确认为爬虫,将要访问页面的分类第一页返回给客户端。The processing unit determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the client; if not, further determines the Whether the access request contains a valid field value for identifying the crawler, if it is a valid field value, the requested page is returned to the client; if the field value for identifying the crawler is not included, or the included field value is invalid, It is confirmed as a crawler, and the first page of the category to be accessed is returned to the client.
  9. 一种拦截爬虫的装置,其特征在于,该装置应用于作为浏览器的客户端,包括:A device for intercepting a crawler, characterized in that the device is applied to a client as a browser, comprising:
    下载单元,根据服务器端返回的页面中包含的图片URL路径将图片下载到浏览器上;The download unit downloads the image to the browser according to the image URL path included in the page returned by the server;
    提取单元,解析所述图片,提取其中的用于识别爬虫的字段值,并进行保存,用于浏览器访问其他页面时在访问请求中携带该用于识别爬虫的字段值。The extracting unit parses the image, extracts a field value for identifying the crawler, and saves the field value for identifying the crawler in the access request when the browser accesses the other page.
  10. 一种服务器终端,其特征在于,包括:A server terminal, comprising:
    一个或多个处理器;One or more processors;
    存储装置,用于存储一个或多个程序,a storage device for storing one or more programs,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-7中任一所述的方法。The one or more programs are executed by the one or more processors such that the one or more processors implement the method of any of claims 1-7.
  11. 一种计算机可读介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行时实现如权利要求1-7中任一所述的方法。 A computer readable medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the method of any of claims 1-7.
PCT/CN2017/082707 2016-05-03 2017-05-02 Crawler interception method and device, server terminal and computer readable medium WO2017190641A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610286222.3A CN107341160B (en) 2016-05-03 2016-05-03 Crawler intercepting method and device
CN201610286222.3 2016-05-03

Publications (1)

Publication Number Publication Date
WO2017190641A1 true WO2017190641A1 (en) 2017-11-09

Family

ID=60202740

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/082707 WO2017190641A1 (en) 2016-05-03 2017-05-02 Crawler interception method and device, server terminal and computer readable medium

Country Status (2)

Country Link
CN (1) CN107341160B (en)
WO (1) WO2017190641A1 (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657176A (en) * 2018-10-16 2019-04-19 深圳壹账通智能科技有限公司 Web vector graphic state identification method, device, equipment and readable storage medium storing program for executing
CN110069688A (en) * 2019-03-16 2019-07-30 平安城市建设科技(深圳)有限公司 Page display method, server, storage medium and the device of anti-crawler
CN110209911A (en) * 2019-06-03 2019-09-06 桂林电子科技大学 A kind of self-adapting dormancy time adjustment method based on request success rate
CN111428108A (en) * 2020-03-25 2020-07-17 山东浪潮通软信息科技有限公司 Anti-crawler method, device and medium based on deep learning
CN111614652A (en) * 2020-05-15 2020-09-01 广东科徕尼智能科技有限公司 Crawler identification interception method, equipment and storage medium
CN112003819A (en) * 2020-07-07 2020-11-27 瑞数信息技术(上海)有限公司 Method, device, equipment and computer storage medium for identifying crawler
CN112073412A (en) * 2020-09-08 2020-12-11 北京天融信网络安全技术有限公司 Anti-crawler method, device, processor and computer readable medium
CN113010818A (en) * 2021-02-23 2021-06-22 腾讯科技(深圳)有限公司 Access current limiting method and device, electronic equipment and storage medium
CN113515682A (en) * 2021-05-19 2021-10-19 平安国际智慧城市科技股份有限公司 Data crawling method and device, computer equipment and storage medium
CN113704080A (en) * 2020-05-22 2021-11-26 北京沃东天骏信息技术有限公司 Automatic testing method and device
CN113806614A (en) * 2021-10-10 2021-12-17 北京亚鸿世纪科技发展有限公司 Web crawler quick recognition device based on analysis Http request
CN115037507A (en) * 2022-04-22 2022-09-09 京东科技控股股份有限公司 Method, device and system for user access management
CN115632817A (en) * 2022-09-22 2023-01-20 浪潮卓数大数据产业发展有限公司 Android terminal reverse climbing method and device
CN116455660A (en) * 2023-05-04 2023-07-18 北京数美时代科技有限公司 Page access request control method, system, storage medium and electronic equipment
CN116932854A (en) * 2023-09-14 2023-10-24 百鸟数据科技(北京)有限责任公司 Webpage information anticreeper method, device, system, equipment and storage medium

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109784960A (en) * 2017-11-10 2019-05-21 北京奇虎科技有限公司 A kind of intention automation checking method, device and equipment
CN108763274B (en) * 2018-04-09 2021-06-11 北京三快在线科技有限公司 Access request identification method and device, electronic equipment and storage medium
CN109492146B (en) * 2018-11-09 2021-06-29 杭州安恒信息技术股份有限公司 Method and device for preventing WEB crawler
CN110958228A (en) * 2019-11-19 2020-04-03 用友网络科技股份有限公司 Crawler access interception method and device, server and computer readable storage medium
CN111683098B (en) * 2020-06-10 2022-12-23 创新奇智(成都)科技有限公司 Anti-crawler method and device, electronic equipment and storage medium
CN111783006A (en) * 2020-07-22 2020-10-16 网易(杭州)网络有限公司 Page generation method and device, electronic equipment and computer readable medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635622A (en) * 2008-07-24 2010-01-27 阿里巴巴集团控股有限公司 Method, system and equipment for encrypting and decrypting web page
US20110208714A1 (en) * 2010-02-19 2011-08-25 c/o Microsoft Corporation Large scale search bot detection
CN102833212A (en) * 2011-06-14 2012-12-19 阿里巴巴集团控股有限公司 Webpage visitor identity identification method and system
US20140019488A1 (en) * 2012-07-16 2014-01-16 Salesforce.Com, Inc. Methods and systems for regulating database activity
CN105426415A (en) * 2015-10-30 2016-03-23 Tcl集团股份有限公司 Management method, device and system of website access request

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7130466B2 (en) * 2000-12-21 2006-10-31 Cobion Ag System and method for compiling images from a database and comparing the compiled images with known images
CN103107948B (en) * 2011-11-15 2016-02-03 阿里巴巴集团控股有限公司 A kind of flow control methods and device
CA2762544C (en) * 2011-12-20 2019-03-05 Ibm Canada Limited - Ibm Canada Limitee Identifying requests that invalidate user sessions
CN102663025B (en) * 2012-03-22 2014-04-02 浙江盘石信息技术有限公司 Illegal online commodity detection method
CN104281607A (en) * 2013-07-08 2015-01-14 上海锐英软件技术有限公司 Microblog hot topic analyzing method
CN104281626B (en) * 2013-07-12 2018-01-19 阿里巴巴集团控股有限公司 Web page display method and web page display device based on pictured processing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635622A (en) * 2008-07-24 2010-01-27 阿里巴巴集团控股有限公司 Method, system and equipment for encrypting and decrypting web page
US20110208714A1 (en) * 2010-02-19 2011-08-25 c/o Microsoft Corporation Large scale search bot detection
CN102833212A (en) * 2011-06-14 2012-12-19 阿里巴巴集团控股有限公司 Webpage visitor identity identification method and system
US20140019488A1 (en) * 2012-07-16 2014-01-16 Salesforce.Com, Inc. Methods and systems for regulating database activity
CN105426415A (en) * 2015-10-30 2016-03-23 Tcl集团股份有限公司 Management method, device and system of website access request

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657176A (en) * 2018-10-16 2019-04-19 深圳壹账通智能科技有限公司 Web vector graphic state identification method, device, equipment and readable storage medium storing program for executing
CN110069688A (en) * 2019-03-16 2019-07-30 平安城市建设科技(深圳)有限公司 Page display method, server, storage medium and the device of anti-crawler
CN110209911A (en) * 2019-06-03 2019-09-06 桂林电子科技大学 A kind of self-adapting dormancy time adjustment method based on request success rate
CN110209911B (en) * 2019-06-03 2023-03-28 桂林电子科技大学 Self-adaptive sleep time adjusting method based on request success rate
CN111428108A (en) * 2020-03-25 2020-07-17 山东浪潮通软信息科技有限公司 Anti-crawler method, device and medium based on deep learning
CN111614652A (en) * 2020-05-15 2020-09-01 广东科徕尼智能科技有限公司 Crawler identification interception method, equipment and storage medium
CN113704080A (en) * 2020-05-22 2021-11-26 北京沃东天骏信息技术有限公司 Automatic testing method and device
CN112003819B (en) * 2020-07-07 2022-07-01 瑞数信息技术(上海)有限公司 Method, device, equipment and computer storage medium for identifying crawler
CN112003819A (en) * 2020-07-07 2020-11-27 瑞数信息技术(上海)有限公司 Method, device, equipment and computer storage medium for identifying crawler
CN112073412A (en) * 2020-09-08 2020-12-11 北京天融信网络安全技术有限公司 Anti-crawler method, device, processor and computer readable medium
CN113010818B (en) * 2021-02-23 2023-06-30 腾讯科技(深圳)有限公司 Access current limiting method, device, electronic equipment and storage medium
CN113010818A (en) * 2021-02-23 2021-06-22 腾讯科技(深圳)有限公司 Access current limiting method and device, electronic equipment and storage medium
CN113515682A (en) * 2021-05-19 2021-10-19 平安国际智慧城市科技股份有限公司 Data crawling method and device, computer equipment and storage medium
CN113806614A (en) * 2021-10-10 2021-12-17 北京亚鸿世纪科技发展有限公司 Web crawler quick recognition device based on analysis Http request
CN115037507A (en) * 2022-04-22 2022-09-09 京东科技控股股份有限公司 Method, device and system for user access management
CN115037507B (en) * 2022-04-22 2024-04-05 京东科技控股股份有限公司 User access management method, device and system
CN115632817A (en) * 2022-09-22 2023-01-20 浪潮卓数大数据产业发展有限公司 Android terminal reverse climbing method and device
CN115632817B (en) * 2022-09-22 2023-09-05 浪潮卓数大数据产业发展有限公司 Method and device for preventing climbing of An Zhuo Duan
CN116455660A (en) * 2023-05-04 2023-07-18 北京数美时代科技有限公司 Page access request control method, system, storage medium and electronic equipment
CN116455660B (en) * 2023-05-04 2023-10-17 北京数美时代科技有限公司 Page access request control method, system, storage medium and electronic equipment
CN116932854A (en) * 2023-09-14 2023-10-24 百鸟数据科技(北京)有限责任公司 Webpage information anticreeper method, device, system, equipment and storage medium

Also Published As

Publication number Publication date
CN107341160A (en) 2017-11-10
CN107341160B (en) 2020-09-01

Similar Documents

Publication Publication Date Title
WO2017190641A1 (en) Crawler interception method and device, server terminal and computer readable medium
US9684628B2 (en) Mechanism for inserting trustworthy parameters into AJAX via server-side proxy
US8484373B2 (en) System and method for redirecting a request for a non-canonical web page
US9088462B2 (en) Common web accessible data store for client side page processing
CN107547548B (en) Data processing method and system
WO2010051766A1 (en) Method and device for acquiring target resource information
US11416564B1 (en) Web scraper history management across multiple data centers
WO2017088369A1 (en) Data cross-domain request method, device and system
WO2015123990A1 (en) Page push method, device, server and system
US20190370293A1 (en) Method and apparatus for processing information
WO2017020597A1 (en) Resource cache method and apparatus
CN113452733A (en) File downloading method and device
CN108810070B (en) Resource sharing method and device, intelligent equipment and storage medium
EP4227829A1 (en) Web scraping through use of proxies, and applications thereof
US9191392B2 (en) Security configuration
Wu et al. Lightweight, low-rate denial-of-service attack prevention and control program for IoT devices
CN116150513A (en) Data processing method, device, electronic equipment and computer readable storage medium
US20230018983A1 (en) Traffic counting for proxy web scraping
US20160164751A1 (en) Brokering data access requests and responses
CN113765972A (en) Data request response method, device, system, server and storage medium
CN106899652A (en) A kind of method and device of transmission service result
CN112448931B (en) Network hijacking monitoring method and device
CN116186723A (en) Authority control system, method, equipment, medium and product
US20150149582A1 (en) Sending mobile applications to mobile devices from personal computers
CN112948727A (en) WebView-based data injection method, device, equipment and storage medium

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17792467

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 21.02.19)

122 Ep: pct application non-entry in european phase

Ref document number: 17792467

Country of ref document: EP

Kind code of ref document: A1