WO2019136987A1 - 网络爬虫识别方法、装置、计算机设备和存储介质 - Google Patents

网络爬虫识别方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2019136987A1
WO2019136987A1 PCT/CN2018/099876 CN2018099876W WO2019136987A1 WO 2019136987 A1 WO2019136987 A1 WO 2019136987A1 CN 2018099876 W CN2018099876 W CN 2018099876W WO 2019136987 A1 WO2019136987 A1 WO 2019136987A1
Authority
WO
WIPO (PCT)
Prior art keywords
resource
node
access
request
identifier
Prior art date
Application number
PCT/CN2018/099876
Other languages
English (en)
French (fr)
Inventor
李武奇
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2019136987A1 publication Critical patent/WO2019136987A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Definitions

  • the present application relates to a web crawler identification method, apparatus, computer device and storage medium.
  • Web crawlers also known as web spiders or web robots
  • Web crawlers are instructions or scripts that automatically grab web information in accordance with certain rules. Web crawlers can cause excessive server stress and can also cause a lot of data leakage. So many websites use anti-crawlers to block web crawlers from accessing them.
  • the traditional anti-crawling method usually determines whether the requesting end is a crawler or a normal terminal by monitoring the IP address of the requesting end and the corresponding request frequency. When it is detected that the request frequency of the same IP address within a certain time length is higher than a preset number of times, it may be determined that the request end corresponding to the IP address is a crawler. For this traditional anti-crawling method, as long as the proxy IP pool is established, the crawler can access the target IP address by rotating the proxy IP address to avoid being identified. Therefore, the traditional anti-reptile method is less efficient in identifying web crawlers.
  • a web crawler identification method, apparatus, computer device, and storage medium capable of improving network crawler recognition efficiency are provided.
  • a network crawler identification method includes: receiving a plurality of resource access requests, each resource access request including a request end identifier and a resource identifier; extracting resource identifiers corresponding to the same request end identifier, and forming a resource identifier set of each request end; Matching all resource identifiers in the resource identifier set with resource nodes in a preset resource structure tree, using a resource node that matches the resource identifier as an access node; and when there is an isolated access node, determining The request end corresponding to the resource identifier set is a request end that initiates a resource access request by using a network crawler; the parent node of the child node corresponding to the isolated access node is not an access node.
  • a network crawler identification device includes: a resource access request receiving module, configured to receive a plurality of resource access requests, each resource access request includes a request end identifier and a resource identifier; and a resource identifier extraction module, configured to extract the same request end identifier a corresponding resource identifier, forming a resource identifier set of each requesting end; a resource node matching module, configured to match all resource identifiers in the resource identifier set with resource nodes in a preset resource structure tree, and The resource node that matches the resource identifier is used as the access node; and the network crawler identification module is configured to determine that the requesting end corresponding to the resource identifier set is a requesting end that initiates a resource access request by using a network crawler when an isolated access node exists; The parent node of the child node corresponding to the isolated access node is not the access node.
  • a computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executable by the processor to cause the one or more processors to execute The following steps: receiving a plurality of resource access requests, each resource access request includes a request end identifier and a resource identifier; extracting resource identifiers corresponding to the same request end identifier, forming a resource identifier set of each request end; The resource identifiers in the resource tree are matched with the resource nodes in the preset resource structure tree, and the resource nodes matching the resource identifiers are used as access nodes; and when there are isolated access nodes, the resource identifier sets are determined to be corresponding.
  • the requester side is a requesting end that initiates a resource access request through a web crawler; the parent node of the child node corresponding to the isolated access node is not an access node.
  • One or more non-transitory computer readable storage mediums storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the steps of: receiving a plurality of resources An access request, where each resource access request includes a request end identifier and a resource identifier; extracting resource identifiers corresponding to the same request end identifier, forming a resource identifier set of each request end; and identifying all resource identifiers in the resource identifier set
  • the resource node in the resource structure tree is matched, and the resource node matching the resource identifier is used as an access node; and when there is an isolated access node, determining that the request end corresponding to the resource identifier set is a crawler through the network
  • the requesting end of the resource access request is initiated; the parent node of the child node corresponding to the isolated access node is not the access node.
  • Computer readable instructions computer readable instructions
  • FIG. 1 is an application scenario diagram of a web crawler identification method according to one or more embodiments.
  • FIG. 2 is a flow diagram of a web crawler identification method in accordance with one or more embodiments.
  • 3A is a schematic diagram of a resource structure tree of a first set of resource identifiers in accordance with one or more embodiments.
  • 3B is a schematic diagram of a resource structure tree of a second set of resource identifiers in accordance with one or more embodiments.
  • FIG. 4 is a flow diagram of the steps of identifying a web crawler when there are no orphaned access nodes in accordance with one or more embodiments.
  • FIG. 5 is a schematic flow chart of a web crawler identification method according to another or more embodiments.
  • FIG. 6 is a block diagram showing the structure of a web crawler identification device in accordance with one or more embodiments.
  • FIG. 7 is a diagram showing the internal structure of a computer device in accordance with one or more embodiments.
  • first, second and the like may be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another.
  • first resource identification set may be referred to as a second resource identification set without departing from the scope of the present application, and similarly, the second resource identification set may be referred to as a first resource identification set.
  • Both the first terminal and the second resource identification set are resource identification sets, but they are not the same resource identification set.
  • Terminal 102 communicates with server 104 over a network over a network.
  • the server 104 can receive a resource access request sent by the terminal 102.
  • the terminal 102 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablets, and portable wearable devices, and the server 104 can be implemented with a stand-alone server or a server cluster composed of a plurality of servers.
  • a network crawler identification method is provided.
  • the method is applied to the server in FIG. 1 as an example, and includes the following steps:
  • Step 202 Receive multiple resource access requests, where each resource access request includes a request end identifier and a resource identifier.
  • a resource access request is a request sent by a requesting end to request a resource from a server.
  • the request side can be a terminal for a personal computer, a laptop, a smart phone, a tablet, and a portable wearable device, or a virtual machine that runs through a software and has a complete hardware system function and runs in a completely isolated environment.
  • the request side identifier refers to an identifier used to distinguish each request side from other request terminals, including but not limited to one of a network interface card (NIC), a virtual NIC (vNIC), an IP address, or a Domain Name System (DNS) name or a cookie. Combination of species or multiples.
  • the resource identifier refers to an identifier used to distinguish each target resource from other target resources.
  • the target resource includes but is not limited to a static resource such as a web resource, a picture resource, a text resource, a JS script resource, and an advertisement resource, and may also be a background database.
  • Dynamic resources are data that is used to dynamically populate a web page. Each requester can send one or more resource access requests.
  • the server may extract the corresponding static resource from the disk in the server according to the resource identifier in the resource access request, or the server may also send the resource access request to the resource access request.
  • the WEB container obtains dynamic resources from the back-end database through the WEB container.
  • a resource access response may be generated according to the obtained static resource and/or dynamic resource and launched to the requesting end corresponding to the requesting end identifier in the resource access request.
  • the resource access response may be used to instruct the requesting end to render the target page according to the acquired target resource through the browser.
  • Step 204 Extract resource identifiers corresponding to identifiers of the same requesting end, and form a resource identifier set of each requesting end.
  • a resource identifier set refers to a set of resource identifiers corresponding to the same requester identifier.
  • a resource identifier set may contain a resource identifier corresponding to all resources accessed by the requesting end.
  • the resource identifier in the resource access request may be extracted continuously, or the resource identifier corresponding to the same requester identifier may be extracted according to the resource access request received in the preset time period.
  • the received plurality of resource access requests may also be stored as a resource access record.
  • Each resource access record may include a resource identifier, a request side identifier, and a resource access request receiving time, and the resource access records may be classified according to the requester identifier. After the resource access request received in the preset time period is filtered according to the resource access request receiving time, the resource identifier corresponding to the same requesting end identifier is extracted from the resource identifier.
  • Step 206 Match all the resource identifiers in the resource identifier set with the resource nodes in the preset resource structure tree, and use the resource node that matches the resource identifier as the access node.
  • a resource structure tree is a resource structure tree constructed according to the access logic of a normal terminal accessing a resource.
  • the resource structure tree contains multiple resource nodes, each of which represents the node where the corresponding target resource is located in the access logic.
  • the target resource refers to the target resource that the requester can request to access.
  • the normal terminal needs to click the button B in the page A to jump to the page C after accessing the page A, and the page A node is the parent node of the page C node.
  • Each resource identifier has a corresponding resource node in the resource structure tree, and all resource identifiers in the resource identifier set can be matched with each resource node in the resource structure tree according to a preset traversal path, and the matching will exist.
  • the resource node identified by the resource acts as an access node.
  • the access node refers to the node corresponding to the target resource accessed by the requester in the resource structure tree.
  • a parameter n may be preset to mark the access node, and the parameter n may be the number of resource identifiers in the resource identification set that match the access node. If the requesting end initiates a resource access request for the target resource corresponding to the resource node K, the resource identifier of the resource identifier set that matches the resource node K is three, and n is 3.
  • Step 208 When there is an isolated access node, determine that the requesting end corresponding to the resource identifier set is a requesting end that initiates a resource access request by using a web crawler.
  • the parent node of the child node corresponding to the isolated access node is not the access node. Since the normal terminal initiates the resource access request according to the normal access logic, the request initiated by the network crawler may not conform to the normal access logic, and therefore, when an isolated access node exists in the resource structure tree corresponding to the resource identifier set, That is, when the parent node of the child node corresponding to the access node is not the access node, it may be determined that there is a resource access request that does not meet the normal access logic, and the request end corresponding to the resource identifier set is a resource access request initiated by the network crawler. Request side.
  • FIG. 3A is a schematic diagram of a resource structure tree of a first resource identifier set
  • FIG. 3B is a schematic diagram of a resource structure tree of a second resource identifier set.
  • 3A and 3B may represent a site map of a four-layer website architecture, each resource node representing a web page in a website.
  • the filled circle represents the access node, and the open circle represents the resource node that does not have a matching resource identifier in the resource access set.
  • the parent nodes of the other access nodes are all access nodes, which conform to the normal access logic.
  • the parent nodes of the access nodes 24b, 31b, and 32b are not access nodes, and the access nodes 24b, 31b, and 32b are isolated access nodes, and the request end corresponding to the second resource identifier set can be determined.
  • the resource identifiers corresponding to the resource access requests having the same requester identifier are extracted to form a resource identifier set of each requesting end, and the resource identifier set is All resource identifiers in the resource structure tree are matched with the resource nodes in the resource structure tree, and all access nodes corresponding to the resource identifier set in the resource structure tree are obtained.
  • the request end corresponding to the resource identifier set is determined as The requesting side of the resource access request is initiated by the web crawler.
  • the web crawler is identified by the access characteristics of the web crawler, and the web crawler that skips the proxy IP address for resource access is avoided, which improves the accuracy of the web crawler recognition, thereby improving the recognition efficiency of the web crawler.
  • the resource node matching the resource identifier is used as the access node, and further includes when there is no
  • the step of identifying a web crawler when accessing a node includes:
  • Step 402 When there are no isolated access nodes, count the number of matches that each access node matches the resource identifier in the resource identifier set.
  • the number of matches matching each resource node and the resource identifier in the resource identifier set may be counted.
  • the number of matches refers to the number of resource identifiers in the resource ID set that match the access node.
  • the resource identifier matching the resource node is marked, so that the marked resource identifier is no longer performed and then matched with other resource nodes. , saving matching time.
  • Step 404 Obtain a node weight of each access node.
  • the node weight indicates the degree to which the normal terminal accesses the resource corresponding to the access node.
  • the greater the possibility of normal terminal access the smaller the possibility that the network crawler initiates a resource access request, the greater the node weight; the less likely the access is, the more likely the network crawler initiates the resource access request, the greater the node The smaller the weight.
  • Multiple sub-resource nodes that can be included in the same resource node, such as multiple URLs (Uniform Resource Locator) contained in the same page resource.
  • URL is a page resource, so each URL can correspond to one sub-port. Resource node.
  • the access nodes 21a and 22a in FIG. 3A are both child nodes of the access node 11a, but the possibility that the normal terminal accesses the target resource corresponding to the access node 21a may be different from the possibility of accessing the target resource corresponding to the access node 22a. .
  • the ratio of the irrelevant resource in each page resource can be calculated.
  • the unrelated resources include at least one of a picture resource, a JS script resource, and an advertisement resource.
  • the inventory includes keyword inventory, keyword, flash animation, video and other display-type inventory on the page after the keyword search, and the advertisement text corresponds to the hyperlinked inventory of the hyperlink.
  • Step 406 Perform calculation according to the number of matches of each access node and the corresponding node weight, and obtain an integrated weight corresponding to the resource identifier set.
  • the comprehensive weight indicates the possibility that the request end corresponding to the resource identifier set is a normal terminal.
  • the matching number of each access node may be multiplied by the corresponding node weight to be summed to obtain an integrated weight corresponding to the resource identification set.
  • Step 408 When the comprehensive weight is less than the preset weight, determine that the request end corresponding to the resource identifier set is a request end that initiates a resource access request by using a network crawler.
  • the resource access request is determined to be a crawler request, and the request end corresponding to the resource identifier set is a requesting end that initiates a resource access request by using the network crawler.
  • the comprehensive right is greater than or equal to the preset weight, it is determined that the requesting end is a normal terminal.
  • the comprehensive weight is calculated by the matching times and the node weights to determine whether the terminal is a normal terminal, and the accuracy of the network crawler identification is improved.
  • obtaining the node weight of each access node includes: calculating a page similarity between the sub-page resource corresponding to the access node and the parent page resource corresponding to the parent node of the access node; and the statistical sub-page resource includes The first quantity of the unrelated resource, and the second quantity of the update resource included in the subpage resource, the update resource is an unrelated resource not included in the parent page resource; and is calculated according to the page similarity, the first quantity, and the second quantity, The node weight of each access node.
  • the first number refers to the number of unrelated resources included in the subpage resource
  • the second quantity refers to the number of updated resources included in the subpage resource.
  • An update resource is an unrelated resource that exists only in a subpage resource and no longer in a parent page resource.
  • the page similarity between the sub-page resource Q corresponding to the access node and the parent page resource P corresponding to the parent node of the access node may be calculated by using features such as the site feature, the path feature, and the path depth feature.
  • the corresponding feature vector quantity can also be generated according to the characteristics of the subpage resource Q and the parent page resource P.
  • is a constant and ⁇ is a global statistical parameter.
  • When there is no extraneous resource in the Q page, ⁇ is 1. When there are irrelevant resources in the page, ⁇ passes Calculated. Total Q is the first number of unrelated resources contained in the subpage resource Q. New Q is the second number of update resources contained in the subpage resource Q. The update resource refers to an irrelevant resource that exists in the subpage resource Q but does not exist in the parent page resource P. W(U P ) refers to the node weight of the page resource P.
  • the node weight of the root node can be preset to one.
  • the access nodes of the remaining child nodes range from 0 to 1.
  • the number of matching times can be statistically calculated as N 00 , N 11 , N 12 , N 21 , N 22 , N 24 , N 33 can also calculate that the corresponding node weights are W 00 , W 11 , W 12 , W 21 , W 22 , W 24 , W 33 , respectively, and the synthesis of the request end corresponding to the resource structure tree shown in FIG. 3A
  • the weight W A can be calculated by the following formula:
  • W A N 00 ⁇ W 00 + N 11 ⁇ W 11 + N 12 ⁇ W 12 + N 21 ⁇ W 21 + N 22 ⁇ W 22 + N 24 ⁇ W 24 + N 33 ⁇ W 33
  • the node weight of the access node is obtained, so that the calculated node weight is not only related to the feature of the access node itself, but also combined with the access.
  • the location feature of the node in the resource structure tree improves the accuracy of the node weight, thereby improving the efficiency of network crawler recognition.
  • the method further includes: when not When there is an isolated access node, the number of matches between each access node and the resource identifier in the resource identifier set is counted; when there is an access node whose matching number is greater than the preset number of times, the request end corresponding to the resource identifier set is determined to be through the network.
  • the crawler initiates a request for a resource access request.
  • the resource access request in the resource structure tree After the resource identifier corresponding to the identifier of the same request end is extracted, and the resource identifier set of each request end is formed, and the resource identifier set of the request end and the resource in the resource structure tree are collected.
  • the total number of matches of the node When the total number of matches is greater than the preset total number of times, the requesting end corresponding to the resource identifier set is the requesting end that initiates the resource access request by the web crawler.
  • the method further includes: sending the authentication verification page to the requesting end corresponding to the web crawler for identity verification.
  • the resource access request is intercepted and intercepted.
  • the default authentication page is sent to the requester corresponding to the web crawler for authentication. If the verification fails, the page that fails the verification is sent to the requesting end; if the verification is passed, the preset authentication page may be sent to the requesting end every preset time period.
  • the method further includes: determining whether the requesting end identifier corresponding to the network crawler is a whitelist crawler identifier; if yes, The resource access request initiated by the web crawler; if not, the resource access request initiated by the web crawler is rejected.
  • the whitelist crawler identifier refers to the request side identifier corresponding to the crawler's whitelisted web crawler.
  • the crawler whitelist can be preset, and the IP addresses of some allowed web crawlers are stored in the crawler whitelist.
  • the web crawler's IP address and all whitelists in the crawler whitelist are identified.
  • the IP address of the crawler ID is matched. If the requester ID corresponding to the web crawler is a whitelist crawler identifier, access is not restricted.
  • Baidu crawler is usually a subdomain of baidu.com or baidu.jp
  • google crawler is usually a subdomain of googlebot.com
  • Microsoft bing search engine crawler is a subdomain of search.msn.com
  • Sogou crawler is rawl.sogou
  • the subdomains of these search engines can be stored in the crawler whitelist, and when the corresponding subdomain is detected, the resource access request initiated by the web crawler can be obtained.
  • the crawler whitelist it is possible to filter out the allowed resource access requests more quickly, and reduce the error rate of rejecting normal resource access requests.
  • FIG. 5 another web crawling method is provided, the method comprising the following steps:
  • Step 502 Receive multiple resource access requests, where each resource access request includes a request end identifier and a resource identifier.
  • a resource access request sent by multiple requesting ends may be received, and the resource access request may include a requesting end identifier such as a network interface card (NIC), a virtual NIC (vNIC), an IP address, or a Domain Name System (DNS) name or a cookie.
  • a requesting end identifier such as a network interface card (NIC), a virtual NIC (vNIC), an IP address, or a Domain Name System (DNS) name or a cookie.
  • the target resources of the resource access request such as web resources, image resources, text resources, JS script resources, and advertising resources.
  • Step 504 Extract resource identifiers corresponding to the same requester identifier, and form a resource identifier set of each requesting end.
  • the received resource access request may be counted every preset time period, or the resource access request receiving time may be recorded while receiving the resource access request, and the same request is received according to the resource access request receiving time statistics preset duration.
  • Step 506 Match all the resource identifiers in the resource identifier set with the resource nodes in the preset resource structure tree, and use the resource node that matches the resource identifier as the access node.
  • the resource identifier set may be matched with each resource node in the resource structure tree according to a preset traversal path, and the number of matching times may be counted when the resource identifier in the resource identifier set matches each resource node.
  • Step 508 Detect whether there is an isolated access node.
  • An isolated access node is an access node whose parent node is not an access node.
  • the access nodes 24b, 31b, and 32b in FIG. 3B are all isolated access nodes. If yes, go to step 510; if not, go to step 512.
  • Step 510 Determine that the request end corresponding to the resource identifier set is a requesting end that initiates a resource access request by using a network crawler.
  • the request end identifier corresponding to the resource identifier set is a whitelist crawler identifier, and if not, the resource access request including the request end identifier is no longer passed.
  • Step 512 Count the number of matches that each access node matches the resource identifier in the resource identifier set.
  • Step 514 Calculate the page similarity of the sub-page resource corresponding to the access node and the parent page resource corresponding to the parent node of the access node.
  • the corresponding feature vector quantity is generated according to the characteristics of the subpage resource Q and the parent page resource P.
  • Space vector similarity formula Calculate page similarity between subpage resource Q and parent page resource P
  • Step 516 The first quantity of the unrelated resources included in the subpage resource and the second quantity of the update resource included in the subpage resource are updated, and the update resource is an unrelated resource not included in the parent page resource.
  • web crawlers will not request for irrelevant resources that are not useful to web crawlers, but display the necessary target resources for normal web page rendering. Normal terminals will request irrelevant resources, so when When there is no isolated access node, the network crawler can be identified according to the unrelated resources corresponding to the resource access request.
  • Step 518 Perform calculation according to page similarity, first quantity, and second quantity, and obtain node weights of each access node.
  • is a constant and ⁇ is a global statistical parameter.
  • is 1.
  • passes Calculated.
  • Total Q is the first number of unrelated resources contained in the subpage resource Q.
  • New Q is the second number of update resources contained in the subpage resource Q.
  • the update resource refers to an irrelevant resource that exists in the subpage resource Q but does not exist in the parent page resource P.
  • W(U P ) refers to the node weight of the page resource P.
  • Step 520 Perform calculation according to the number of matches of each access node and the corresponding node weight, and obtain an integrated weight corresponding to the resource identifier set.
  • Step 522 When the comprehensive weight is less than the preset weight, determine that the request end corresponding to the resource identifier set is the request end that initiates the resource access request by using the network crawler.
  • the resource identifier set is matched with the resource structure tree, it is determined whether there is an isolated access node.
  • the node weight of each access node is calculated by the similarity between the target resource corresponding to the access node and the target resource corresponding to the parent node, and the irrelevant resource content included in the access node. . It reduces the possibility of omission of web crawler recognition, thereby improving the recognition efficiency of web crawlers.
  • a network crawler identification device 600 including: a resource access request receiving module 602, configured to receive multiple resource access requests, where each resource access request includes a requesting end. An identifier and a resource identifier; the resource identifier extraction module 604 is configured to extract a resource identifier corresponding to the same requester identifier, and form a resource identifier set of each requesting end; the resource node matching module 606 is configured to identify all the resource identifiers in the resource identifier set.
  • the resource node in the preset resource structure tree is matched, and the resource node matching the resource identifier is used as the access node; the network crawler identification module 608 is configured to determine the request corresponding to the resource identifier set when there is an isolated access node
  • the terminal is a requesting end that initiates a resource access request through a web crawler; the parent node of the child node corresponding to the isolated access node is not an access node.
  • the network crawler identification module 608 is further configured to: when there is no isolated access node, count the number of matches each resource node matches the resource identifier in the resource identifier set; obtain the node weight of each access node Calculating according to the number of matches of each access node and the corresponding node weight, and obtaining the comprehensive weight corresponding to the resource identifier set; when the comprehensive weight is less than the preset weight, determining that the request end corresponding to the resource identifier set is initiated by the network crawler The request side of the resource access request.
  • the web crawler identification module 608 is further configured to calculate a page similarity of the sub-page resource corresponding to the access node and the parent page resource corresponding to the parent node of the access node; and count the irrelevant resources included in the sub-page resource.
  • the first quantity, and the second quantity of the update resource included in the subpage resource, the update resource is an unrelated resource not included in the parent page resource; and the calculation is performed according to the page similarity, the first quantity, and the second quantity, and each access is obtained.
  • the node weight of the node is further configured to calculate a page similarity of the sub-page resource corresponding to the access node and the parent page resource corresponding to the parent node of the access node; and count the irrelevant resources included in the sub-page resource.
  • the first quantity, and the second quantity of the update resource included in the subpage resource, the update resource is an unrelated resource not included in the parent page resource; and the calculation is performed according to the page similarity, the first quantity, and the second quantity, and each access is obtained.
  • the unrelated resources include at least one of a picture resource, a JS script resource, and an advertisement resource.
  • the network crawler identification module 608 is further configured to: when there is no isolated access node, count the number of matches between each access node and the resource identifier in the resource identifier set; when there are matching times greater than the preset number of times When the node is accessed, it is determined that the requesting end corresponding to the resource identifier set is a requesting end that initiates a resource access request by using a web crawler.
  • the web crawler identification module 608 is further configured to send an authentication page to the requesting end corresponding to the web crawler for authentication.
  • the web crawler identification module 608 is further configured to determine whether the requester identifier corresponding to the web crawler is a whitelist crawler identifier; if yes, the resource access request initiated by the web crawler; if not, reject the web crawler Initiated resource access request.
  • Each of the above-described web crawler identification devices may be implemented in whole or in part by software, hardware, and combinations thereof.
  • Each of the above modules may be embedded in or independent of the processor in the computer device, or may be stored in a memory in the computer device in a software form, so that the processor invokes the operations corresponding to the above modules.
  • the web crawler identification device described above can be implemented in the form of a computer readable instruction that can be executed on a computer device as shown in FIG.
  • a computer device which may be a server, and its internal structure diagram may be as shown in FIG.
  • the computer device includes a processor, memory, network interface, and database connected by a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-transitory computer readable storage medium, an internal memory.
  • the non-transitory computer readable storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of an operating system and computer readable instructions in a non-transitory computer readable storage medium.
  • the database of the computer device is used to store static resources, dynamic resources, resource structure trees, and the like.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • the computer readable instructions are executed by the processor to implement a web crawler identification method.
  • FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied.
  • the specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
  • a computer apparatus comprising a memory and one or more processors having stored therein computer readable instructions that, when executed by a processor, implement any of the embodiments of the present application The steps provided by the web crawler identification method.
  • one or more non-transitory computer readable storage mediums storing computer readable instructions that, when executed by one or more processors, cause one or more processes
  • the steps of the web crawler identification method provided in any one of the embodiments of the present application are implemented.
  • Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • Synchlink DRAM SLDRAM
  • Memory Bus Radbus
  • RDRAM Direct RAM
  • DRAM Direct Memory Bus Dynamic RAM
  • RDRAM Memory Bus Dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种网络爬虫识别方法,包括:接收多个资源访问请求,每个资源访问请求中包含请求端标识和资源标识;提取相同请求端标识所对应的资源标识,形成每个请求端的资源标识集合;将资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与资源标识相匹配的资源节点作为访问节点;及当存在孤立的访问节点时,判定资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端;孤立的访问节点所对应的子节点的父节点不是访问节点。

Description

网络爬虫识别方法、装置、计算机设备和存储介质
本申请要求于2018年01月12日提交中国专利局,申请号为2018100313502,申请名称为“网络爬虫识别方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及一种网络爬虫识别方法、装置、计算机设备和存储介质。
背景技术
随着互联网技术的发展,出现了网络爬虫技术。网络爬虫又被称为网页蜘蛛或网络机器人等,是一种按照一定的规则自动地抓取万维网信息的指令或者脚本。网络爬虫会造成过大的服务器压力,还可能造成大量数据泄露。因此许多网站通过反爬虫来阻止网络爬虫的访问。
然而,传统的反爬虫方式,通常是通过监控对请求端IP地址和相应的请求频繁度,来判断请求端是爬虫还是正常终端。当检测到同一IP地址的在一定时长内的请求频繁度高于预设的次数,则可判断该IP地址对应的请求端是爬虫。而针对这种传统的反爬虫方式,只要建立了代理IP池,爬虫就能够在访问目标网址时通过轮换代理IP地址实现访问,从而避免被识别出来。因此,传统反爬虫方式识别网络爬虫的效率较低。
发明内容
根据本申请公开的各种实施例,提供一种能够提高网络爬虫识别效率的网络爬虫识别方法、装置、计算机设备和存储介质。
一种网络爬虫识别方法,包括:接收多个资源访问请求,每个资源访问请求中包含请求端标识和资源标识;提取相同请求端标识所对应的资源标识,形成每个请求端的资源标识集合;将所述资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与所述资源标识相匹配的资源节点作为访问节点;及当存在孤立的访问节点时,判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端;孤立的访问节点所对应的子节点的父节点不是访问节点。
一种网络爬虫识别装置,包括:资源访问请求接收模块,用于接收多个资源访问请求,每个资源访问请求中包含请求端标识和资源标识;资源标识提取模块,用于提取相同请求端标识所对应的资源标识,形成每个请求端的资源标识集合;资源节点匹配模块,用于将所述资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与所述资源标识相匹配的资源节点作为访问节点;及网络爬虫识别模块,用于当存在孤立的访 问节点时,判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端;孤立的访问节点所对应的子节点的父节点不是访问节点。
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:接收多个资源访问请求,每个资源访问请求中包含请求端标识和资源标识;提取相同请求端标识所对应的资源标识,形成每个请求端的资源标识集合;将所述资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与所述资源标识相匹配的资源节点作为访问节点;及当存在孤立的访问节点时,判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端;孤立的访问节点所对应的子节点的父节点不是访问节点。
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:接收多个资源访问请求,每个资源访问请求中包含请求端标识和资源标识;提取相同请求端标识所对应的资源标识,形成每个请求端的资源标识集合;将所述资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与所述资源标识相匹配的资源节点作为访问节点;及当存在孤立的访问节点时,判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端;孤立的访问节点所对应的子节点的父节点不是访问节点。计算机可读指令计算机可读指令
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为根据一个或多个实施例中网络爬虫识别方法的应用场景图。
图2为根据一个或多个实施例中网络爬虫识别方法的流程示意图。
图3A为根据一个或多个实施例中第一资源标识集合的资源结构树的示意图。
图3B为根据一个或多个实施例中第二资源标识集合的资源结构树的示意图。
图4为根据一个或多个实施例中当不存在孤立的访问节点时识别网络爬虫的步骤的流程示意图。
图5为根据另一个或多个实施例中网络爬虫识别方法的流程示意图。
图6为根据一个或多个实施例中网络爬虫识别装置的结构框图。
图7为根据一个或多个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
可以理解,本申请所使用的术语“第一”、“第二”等可在本文中用于描述各种元件,但这些元件不受这些术语的限制。这些术语仅用于将第一个元件与另一个元件区分。举例来说,在不脱离本申请的范围的情况下,可以将第一资源标识集合称为第二资源标识集合,且类似地,可将第二资源标识集合称为第一资源标识集合。第一终端和第二资源标识集合两者都是资源标识集合,但其不是同一资源标识集合。
本申请提供的网络爬虫识别方法,可以应用于如图1所示的应用环境中。终端102通过网络与服务器104通过网络进行通信。服务器104可接收终端102发送的资源访问请求。终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在其中一个实施例中,如图2所示,提供了一种网络爬虫识别方法,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:
步骤202,接收多个资源访问请求,每个资源访问请求中包含请求端标识和资源标识。
资源访问请求是指由请求端发送的用于向服务器索要资源的请求。请求端可为个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备的终端,也可为通过软件模拟的具有完整硬件系统功能的、运行在一个完全隔离环境中的虚拟机。请求端标识是指用于将每个请求端区分于其他请求端的标识,包括但不限于网络接口卡(NIC)、虚拟NIC(vNIC)、IP地址或域名系统(DNS)名称或cookie等其中一种或多种的组合。资源标识是指用于将每个目标资源区分于其他目标资源的标识,目标资源包括但不限于网页资源、图片资源、文本资源、JS脚本资源及广告资源等静态资源,还可以是后台数据库中的动态资源。动态资源是指用于动态填充至网页的数据。每个请求端都可以发送一个或多个资源访问请求。
在其中一个实施例中,服务器接收到请求端发送的资源访问请求之后,可以根据资源访问请求中的资源标识从服务器中的磁盘中取出相应的静态资源,或者服务器也可以将资源访问请求发送至WEB容器,通过WEB容器从后台数据库中获取动态资源。根据获取的静态资源和/或动态资源可生成资源访问响应并发动至与资源访问请求中的请求端标识所对应的请求端。资源访问响应可以用于指示请求端通过浏览器根据获取的目标资源进行渲染,生成目标页面。
步骤204,提取相同请求端标识所对应的资源标识,形成每个请求端的资源标识集合。
资源标识集合是指由相同请求端标识所对应的资源标识所构成的集合。一个资源标识集合可以包含一个请求端所访问的所有资源所对应的资源标识。可以持续对资源访问请 求中的资源标识进行提取,也可以根据预设时间段内接收的资源访问请求,提取相同请求端标识所对应的资源标识。
在其中一个实施例中,还可以将接收的多个资源访问请求,存储为资源访问记录。每条资源访问记录可包括资源标识、请求端标识及资源访问请求接收时间,可根据请求端标识将资源访问记录进行分类。还可以根据资源访问请求接收时间筛选出预设时间段内接收的资源访问请求之后,再从中提取相同请求端标识所对应的资源标识。
步骤206,将资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与资源标识相匹配的资源节点作为访问节点。
资源结构树是指按照正常终端访问资源的访问逻辑所构建的资源结构树。资源结构树中包含多个资源节点,每个资源节点代表了相应目标资源在访问逻辑中所处的节点。目标资源是指请求端所能够请求访问的目标资源。比如,正常终端在访问页面A之后需要点击页面A中的按钮B才能跳转至页面C,则页面A节点是页面C节点的父节点。每个资源标识都在资源结构树中有对应的资源节点,可以将资源标识集合中的所有资源标识按照预设的遍历路径与资源结构树中的每个资源节点进行匹配,将存在相匹配的资源标识的资源节点作为访问节点。访问节点是指请求端所访问过的目标资源在资源结构树中对应的节点。
在其中一个实施例中,可以预设一个参数n标记访问节点,参数n可以为资源标识集合中与访问节点匹配的资源标识的数量。若请求端发起了3次对资源节点K所对应目标资源的资源访问请求,则资源标识集合中与该资源节点K匹配的资源标识为3个,则n为3。
步骤208,当存在孤立的访问节点时,判定资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
孤立的访问节点所对应的子节点的父节点不是访问节点。由于正常终端会按照正常的访问逻辑来发起资源访问请求,而通过网络爬虫所发起的请求可不符合正常的访问逻辑,因此当检测到与资源标识集合对应的资源结构树中存在孤立的访问节点时,即当存在访问节点所对应的子节点的父节点不是访问节点时,可判定存在不符合正常的访问逻辑的资源访问请求,该资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
在其中一个实施例中,在检测孤立的访问节点时,还包括判断孤立的访问节点是否为根节点,若是,则判定该根节点所对应的访问节点不为孤立的访问节点。根节点可为网站初始页对应的节点。举例来说,图3A为第一资源标识集合的资源结构树的示意图,图3B为第二资源标识集合的资源结构树的示意图。图3A和图3B可代表一个四层网站架构的站点地图,每个资源节点代表网站中的一个网页。实心圆代表访问节点,空心圆代表与资源访问集合中没有相匹配资源标识的资源节点。图3A中除根节点00a之外,其余访问节点的父节点都为访问节点,符合正常的访问逻辑。图3B中除根节点00b之外,访问节点24b、31b及32b的父节点都不是访问节点,则访问节点24b、31b及32b都为孤立的访问 节点,可判定第二资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
上述网络爬虫识别方法中,根据正常终端的访问逻辑构建资源结构树之后,通过提取具有相同请求端标识的资源访问请求所对应的资源标识形成每个请求端的资源标识集合,并将资源标识集合中的所有资源标识和资源结构树中的资源节点进行匹配,得到资源结构树中与资源标识集合所对应的所有访问节点,当检测到存在孤立的访问节点时,判定资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。通过网络爬虫的访问特性识别出网络爬虫,避免了遗漏通过轮换代理IP地址进行资源访问的网络爬虫,提高了网络爬虫识别的准确性,从而提高了网络爬虫的识别效率。
在其中一个实施例中,在将资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与资源标识相匹配的资源节点作为访问节点之后,还包括当不存在孤立的访问节点时识别网络爬虫的步骤。如图4所示,该步骤包括:
步骤402,当不存在孤立的访问节点时,统计每个访问节点与资源标识集合中的资源标识匹配的匹配次数。
将资源标识集合中的所有资源标识按照预设的遍历路径与资源结构树中的每个资源节点进行匹配时,可统计每个访问节点与资源标识集合中的资源标识匹配的匹配次数。匹配次数是指资源标识集合中与访问节点匹配的资源标识的数量。还可在将资源标识集合按照预设的遍历路径与资源结构树进行匹配的过程中,将与资源节点匹配的资源标识进行标记,使具有标记的资源标识不再进行之后与其他资源节点的匹配,节约匹配时间。
步骤404,获取每个访问节点的节点权重。
节点权重表示正常终端访问该访问节点所对应资源的可能程度。正常终端访问的可能程度越大,通过网络爬虫发起资源访问请求的可能性越小,则节点权重越大;访问的可能程度越小,通过网络爬虫发起资源访问请求的可能性越大,则节点权重越小。同一资源节点下可包含的多个子资源节点,比如同一页面资源中包含的多个URL(Uniform Resource Locator,统一资源定位符),一个URL即为一个页面资源,因此每个URL都可对应一个子资源节点。举例来说,图3A中的访问节点21a和22a都为访问节点11a的子节点,但是正常终端访问访问节点21a所对应目标资源的可能性与访问访问节点22a所对应目标资源的可能性可不一致。
在其中一个实施例中,对于页面资源来说,由于通过网络爬虫发起的资源访问请求都具有针对性,访问无关资源的概率较低,因此可以结合每个页面资源中的无关资源占比计算每个子资源节点的权重。无关资源包括图片资源、JS脚本资源及广告资源中的至少一种。广告资源包括关键字搜索之后在页面上的关键字广告资源,图片、flash动画、视频等展示类广告资源,广告文字对应超链接的超链接广告资源等。
步骤406,根据每个访问节点的匹配次数和相应的节点权重进行计算,得到与资源标识集合对应的综合权重。
综合权重表示资源标识集合所对应的请求端为正常终端的可能性。综合权重越大,则请求端为正常终端的概率越大。
在其中一个实施例中,可以将每个访问节点的匹配次数乘以相应的节点权重之后进行求和,得到与资源标识集合对应的综合权重。
步骤408,当综合权重小于预设权重时,则判定资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
当综合权重小于预设权重时,则判定资源访问请求为爬虫请求,资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。当综合权重大于等于预设权重时,则判定请求端为正常终端。
上述实施例中,统计得到匹配次数并获取相应的节点权重之后,可通过匹配次数和节点权重计算得到综合权重来判断终端是否为正常终端,提高了网络爬虫识别的精准度。
在其中一个实施例中,获取每个访问节点的节点权重,包括:计算访问节点所对应的子页面资源与访问节点的父节点所对应的父页面资源的页面相似度;统计子页面资源中包含的无关资源的第一数量,和子页面资源中包含的更新资源的第二数量,更新资源为父页面资源中不包含的无关资源;根据页面相似度、第一数量和第二数量进行计算,得到每个访问节点的节点权重。
第一数量是指子页面资源中包含的无关资源的数量,第二数量是指子页面资源中包含的更新资源的数量。更新资源是指仅存在于子页面资源而不再父页面资源中的无关资源。
可通过站点特征、路径特征和路径深度特征等特征来计算访问节点所对应的子页面资源Q,与访问节点的父节点所对应的父页面资源P的页面相似度。还可以根据子页面资源Q和父页面资源P的特征生成相应的特征向量量
Figure PCTCN2018099876-appb-000001
Figure PCTCN2018099876-appb-000002
通过空间向量相似度公式
Figure PCTCN2018099876-appb-000003
计算子页面资源Q和父页面资源P的页面相似度
Figure PCTCN2018099876-appb-000004
再根据计算出来的页面相似度,通过下列公式计算子页面资源Q的节点权重W(U Q)=δ×(1-sim(U P,U Q))×W(U P)+(1-δ)×θ。δ是常数,θ是全局统计参数,
Figure PCTCN2018099876-appb-000005
当Q页面中没有无关资源时,θ为1,当页面中有无关资源时,θ通过
Figure PCTCN2018099876-appb-000006
计算得到。total Q为子页面资源Q中包含的无关资源的第一数量。new Q为子页面资源Q中包含的更新资源的第二数量。更新资源是指子页面资源Q中存在而父页面资源P中不存在的无关资源。W(U P)是指页面资源P的节点权重。
在其中一个实施例中,根节点的节点权重可预设为1。其余子节点的访问节点的范围为0至1。
举例来说,针对图3A中的访问节点00a、11a、12a、21a、22a、24a及33a,可统计得到匹配次数分别为N 00、N 11、N 12、N 21、N 22、N 24、N 33,还可计算得到相应节点权重分别为W 00、W 11、W 12、W 21、W 22、W 24、W 33,则与图3A中所示的资源结构树所对应的请求端的综合权重W A可通过下式计算得到:
W A=N 00×W 00+N 11×W 11+N 12×W 12+N 21×W 21+N 22×W 22+N 24×W 24+N 33×W 33
上述实施例中,通过计算访问节点与其父节点所对应的页面资源之间的相似度,得到该访问节点的节点权重,使得计算得到的节点权重不但与访问节点本身的特征有关,还结合了访问节点处于资源结构树中的位置特征,提高了节点权重的精准度,从而提高了网络爬虫识别的效率。
在其中一个实施例中,在将资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与资源标识相匹配的资源节点作为访问节点之后,还包括:当不存在孤立的访问节点时,统计每个访问节点与资源标识集合中的资源标识匹配的匹配次数;当存在匹配次数大于预设次数的访问节点时,则判定资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
还可以接收预设时长之内的多个资源访问请求,在提取相同请求端标识所对应的资源标识,形成每个请求端的资源标识集合之后,统计请求端的资源标识集合与资源结构树中的资源节点总共的匹配次数,当总共的匹配次数大于预设总共次数时,则判定资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
在其中一个实施例中,在判定资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端之后,还包括:发送身份验证页面至网络爬虫所对应的请求端进行身份验证。
可以在判定还可以在资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端之后,当再次接收到包含该请求端标识的资源访问请求时,拦截该资源访问请求,并调取预设的身份验证页面发送至与网络爬虫所对应的请求端进行身份验证。若验证不通过,则发送验证不通过的页面至该请求端;若验证通过,还可以每隔预设时长发送预设的身份验证页面至该请求端。
在其中一个实施例中,在判定资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端之后,还包括:判断网络爬虫所对应的请求端标识是否为白名单爬虫标识;若是,则通过网络爬虫发起的资源访问请求;若否,则拒绝网络爬虫发起的资源访问请求。
白名单爬虫标识是指处于爬虫白名单的网络爬虫所对应的请求端标识。举例来说,可以预设爬虫白名单,将一些允许访问的网络爬虫的IP地址存入爬虫白名单,当识别到网络爬虫之后,将该网络爬虫的IP地址与爬虫白名单中所有的白名单爬虫标识的IP地址 进行匹配,若网络爬虫所对应的请求端标识为白名单爬虫标识,则不限制访问。比如说,百度爬虫通常是baidu.com或者baidu.jp的子域名,google爬虫通常是googlebot.com的子域名,微软bing搜索引擎爬虫是search.msn.com的子域名,搜狗爬虫是rawl.sogou.com的子域名。可将这些搜索引擎的子域名存储至爬虫白名单,当检测相应的子域名时,可通过网络爬虫发起的资源访问请求。通过预设爬虫白名单,能够更快捷地筛选出允许通过的资源访问请求,降低了拒绝正常资源访问请求的错误率。
在其中一个实施例中,如图5所示,提供了另一种网络爬虫方法,该方法包括以下步骤:
步骤502,接收多个资源访问请求,每个资源访问请求中包含请求端标识和资源标识。
举例来说,可接收多个请求端发送的资源访问请求,资源访问请求中可包含网络接口卡(NIC)、虚拟NIC(vNIC)、IP地址或域名系统(DNS)名称或cookie等请求端标识,以及资源访问请求的目标资源,比如说网页资源、图片资源、文本资源、JS脚本资源及广告资源等。
步骤504,提取相同请求端标识所对应的资源标识,形成每个请求端的资源标识集合。
举例来说,可以每隔预设时间段将接收的资源访问请求进行统计,或者可以在接收资源访问请求的同时记录资源访问请求接收时间,根据资源访问请求接收时间统计预设时长内,相同请求端标识所对应的资源标识。
步骤506,将资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与资源标识相匹配的资源节点作为访问节点。
举例来说,可将资源标识集合按照预设的遍历路径与资源结构树中的每个资源节点进行匹配,还可在将资源标识集合中的资源标识匹配每个资源节点时,统计匹配次数。
步骤508,检测是否存在孤立的访问节点。
孤立的访问节点是指父节点不为访问节点的访问节点。如图3B中的访问节点24b、31b及32b都为孤立的访问节点。若是,则执行步骤510;若否则执行步骤512。
步骤510,判定资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
举例来说,当存在孤立的访问节点时,可判断与该资源标识集合对应的请求端标识是否为白名单爬虫标识,若否,则不再通过包含该请求端标识的资源访问请求。
步骤512,统计每个访问节点与资源标识集合中的资源标识匹配的匹配次数。
举例来说,可以预设一个参数n,参数n初始值为0,每检测到一个与访问节点匹配的资源标识则执行n=n+1。
步骤514,计算访问节点所对应的子页面资源与访问节点的父节点所对应的父页面资源的页面相似度。
举例来说,根据子页面资源Q和父页面资源P的特征生成相应的特征向量量
Figure PCTCN2018099876-appb-000007
Figure PCTCN2018099876-appb-000008
之后,可通过空间向量相似度公式
Figure PCTCN2018099876-appb-000009
计算子页面资源Q和父页面资源P的页面相似度
Figure PCTCN2018099876-appb-000010
步骤516,统计子页面资源中包含的无关资源的第一数量,和子页面资源中包含的更新资源的第二数量,更新资源为父页面资源中不包含的无关资源。
由于网络爬虫通常为了节省带宽,提高响应速度,对于与网络爬虫无用的无关资源,网络爬虫不会去请求,但是对正常网页渲染显示必须的目标资源,正常终端会对无关资源进行请求,因此当不存在孤立的访问节点时,还可根据资源访问请求所对应的无关资源进行网络爬虫的识别。
步骤518,根据页面相似度、第一数量和第二数量进行计算,得到每个访问节点的节点权重。
举例来说,在计算得到页面相似度
Figure PCTCN2018099876-appb-000011
之后,可根据页面相似度通过以下公式W(U Q)=δ×(1-sim(U P,U Q))×W(U P)+(1-δ)×θ计算子页面资源Q的节点权重W(U Q)。δ是常数,θ是全局统计参数,
Figure PCTCN2018099876-appb-000012
当Q页面中没有无关资源时,θ为1,当页面中有无关资源时,θ通过
Figure PCTCN2018099876-appb-000013
计算得到。total Q为子页面资源Q中包含的无关资源的第一数量。new Q为子页面资源Q中包含的更新资源的第二数量。更新资源是指子页面资源Q中存在而父页面资源P中不存在的无关资源。W(U P)是指页面资源P的节点权重。
步骤520,根据每个访问节点的匹配次数和相应的节点权重进行计算,得到与资源标识集合对应的综合权重。
步骤522,当综合权重小于预设权重时,则判定资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
上述实施例中,通过将资源标识集合与资源结构树进行匹配之后,判断是否存在孤立的访问节点。当不存在孤立的访问节点时,通过访问节点所对应的目标资源与其父节点所对应的目标资源之间的相似度,以及访问节点中包含的无关资源情况,计算得到每个访问节点的节点权重。降低了网络爬虫识别的遗漏可能性,从而提高了网络爬虫的识别效率。
应该理解的是,虽然图2和5的流程图中的各个步骤按照箭头的指示依次显示,但 是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2和5中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在其中一个实施例中,如图6所示,提供了一种网络爬虫识别装置600,包括:资源访问请求接收模块602,用于接收多个资源访问请求,每个资源访问请求中包含请求端标识和资源标识;资源标识提取模块604,用于提取相同请求端标识所对应的资源标识,形成每个请求端的资源标识集合;资源节点匹配模块606,用于将资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与资源标识相匹配的资源节点作为访问节点;网络爬虫识别模块608,用于当存在孤立的访问节点时,判定资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端;孤立的访问节点所对应的子节点的父节点不是访问节点。
在其中一个实施例中,网络爬虫识别模块608还用于当不存在孤立的访问节点时,统计每个访问节点与资源标识集合中的资源标识匹配的匹配次数;获取每个访问节点的节点权重;根据每个访问节点的匹配次数和相应的节点权重进行计算,得到与资源标识集合对应的综合权重;当综合权重小于预设权重时,则判定资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
在其中一个实施例中,网络爬虫识别模块608还用于计算访问节点所对应的子页面资源与访问节点的父节点所对应的父页面资源的页面相似度;统计子页面资源中包含的无关资源的第一数量,和子页面资源中包含的更新资源的第二数量,更新资源为父页面资源中不包含的无关资源;根据页面相似度、第一数量和第二数量进行计算,得到每个访问节点的节点权重。
在其中一个实施例中,无关资源包括图片资源、JS脚本资源及广告资源中的至少一种。
在其中一个实施例中,网络爬虫识别模块608还用于当不存在孤立的访问节点时,统计每个访问节点与资源标识集合中的资源标识匹配的匹配次数;当存在匹配次数大于预设次数的访问节点时,则判定资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
在其中一个实施例中,网络爬虫识别模块608还用于发送身份验证页面至网络爬虫所对应的请求端进行身份验证。
在其中一个实施例中,网络爬虫识别模块608还用于判断网络爬虫所对应的请求端标识是否为白名单爬虫标识;若是,则通过网络爬虫发起的资源访问请求;若否,则拒绝网络爬虫发起的资源访问请求。
关于网络爬虫识别装置的具体限定可以参见上文中对于网络爬虫识别方法的限定,在此不再赘述。上述网络爬虫识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
上述网络爬虫识别装置可以实现为一种计算机可读指令的形式,计算机可读指令可以在如图7所示的计算机设备上运行。
在其中一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图7所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性计算机可读存储介质、内存储器。该非易失性计算机可读存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性计算机可读存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储静态资源、动态资源及资源结构树等。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种网络爬虫识别方法。
本领域技术人员可以理解,图7中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在其中一个实施例中,提供了一种计算机设备,包括存储器和一个或多个处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时实现本申请任意一个实施例中提供的网络爬虫识别方法的步骤。
在其中一个实施例中,提供了一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的网络爬虫识别方法的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种网络爬虫识别方法,包括:
    接收多个资源访问请求,每个资源访问请求中包含请求端标识和资源标识;
    提取相同请求端标识所对应的资源标识,形成每个请求端的资源标识集合;
    将所述资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与所述资源标识相匹配的资源节点作为访问节点;及
    当存在孤立的访问节点时,判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端;
    孤立的访问节点所对应的子节点的父节点不是访问节点。
  2. 根据权利要求1所述的方法,其特征在于,在所述将所述资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与所述资源标识相匹配的资源节点作为访问节点之后,还包括:
    当不存在孤立的访问节点时,统计每个访问节点与资源标识集合中的资源标识匹配的匹配次数;
    获取每个访问节点的节点权重;
    根据每个访问节点的匹配次数和相应的节点权重进行计算,得到与所述资源标识集合对应的综合权重;及
    当所述综合权重小于预设权重时,则判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
  3. 根据权利要求2所述的方法,其特征在于,所述获取每个访问节点的节点权重,包括:
    计算访问节点所对应的子页面资源与所述访问节点的父节点所对应的父页面资源的页面相似度;
    统计所述子页面资源中包含的无关资源的第一数量,和所述子页面资源中包含的更新资源的第二数量,所述更新资源为父页面资源中不包含的无关资源;及
    根据所述页面相似度、第一数量和第二数量进行计算,得到每个访问节点的节点权重。
  4. 根据权利要求3所述的方法,其特征在于,所述无关资源包括图片资源、JS脚本资源及广告资源中的至少一种。
  5. 根据权利要求1所述的方法,其特征在于,在所述将所述资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与所述资源标识相匹配的资源节点作为访问节点之后,还包括:
    当不存在孤立的访问节点时,统计每个访问节点与资源标识集合中的资源标识匹配的匹配次数;及
    当存在所述匹配次数大于预设次数的访问节点时,则判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
  6. 根据权利要求1至5任意一项所述的方法,其特征在于,在所述判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端之后,还包括:
    发送身份验证页面至所述网络爬虫所对应的请求端进行身份验证。
  7. 根据权利要求1至5任意一项所述的方法,其特征在于,在所述判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端之后,还包括:
    判断所述网络爬虫所对应的请求端标识是否为白名单爬虫标识;
    若是,则通过所述网络爬虫发起的资源访问请求;及
    若否,则拒绝所述网络爬虫发起的资源访问请求。
  8. 一种网络爬虫识别装置,包括:
    资源访问请求接收模块,用于接收多个资源访问请求,每个资源访问请求中包含请求端标识和资源标识;
    资源标识提取模块,用于提取相同请求端标识所对应的资源标识,形成每个请求端的资源标识集合;
    资源节点匹配模块,用于将所述资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与所述资源标识相匹配的资源节点作为访问节点;及
    网络爬虫识别模块,用于当存在孤立的访问节点时,判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端;孤立的访问节点所对应的子节点的父节点不是访问节点。
  9. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    接收多个资源访问请求,每个资源访问请求中包含请求端标识和资源标识;
    提取相同请求端标识所对应的资源标识,形成每个请求端的资源标识集合;
    将所述资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与所述资源标识相匹配的资源节点作为访问节点;及
    当存在孤立的访问节点时,判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端;
    孤立的访问节点所对应的子节点的父节点不是访问节点。
  10. 根据权利要求9所述的计算机设备,其特征在于,在所述将所述资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与所述资源标识相匹配的资源节点作为访问节点的步骤之后,还包括执行以下步骤:
    当不存在孤立的访问节点时,统计每个访问节点与资源标识集合中的资源标识匹配的匹配次数;
    获取每个访问节点的节点权重;
    根据每个访问节点的匹配次数和相应的节点权重进行计算,得到与所述资源标识集合 对应的综合权重;及
    当所述综合权重小于预设权重时,则判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
  11. 根据权利要求10所述的计算机设备,其特征在于,所述获取每个访问节点的节点权重的步骤,包括执行以下步骤:
    计算访问节点所对应的子页面资源与所述访问节点的父节点所对应的父页面资源的页面相似度;
    统计所述子页面资源中包含的无关资源的第一数量,和所述子页面资源中包含的更新资源的第二数量,所述更新资源为父页面资源中不包含的无关资源;及
    根据所述页面相似度、第一数量和第二数量进行计算,得到每个访问节点的节点权重。
  12. 根据权利要求11所述的计算机设备,其特征在于,所述无关资源包括图片资源、JS脚本资源及广告资源中的至少一种。
  13. 根据权利要求9所述的计算机设备,其特征在于,在所述将所述资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与所述资源标识相匹配的资源节点作为访问节点的步骤之后,还包括执行以下步骤:
    当不存在孤立的访问节点时,统计每个访问节点与资源标识集合中的资源标识匹配的匹配次数;及
    当存在所述匹配次数大于预设次数的访问节点时,则判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
  14. 根据权利要求9至13任意一项所述的计算机设备,其特征在于,在所述判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端的步骤之后,还包括执行以下步骤:
    判断所述网络爬虫所对应的请求端标识是否为白名单爬虫标识;
    若是,则通过所述网络爬虫发起的资源访问请求;及
    若否,则拒绝所述网络爬虫发起的资源访问请求。
  15. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    接收多个资源访问请求,每个资源访问请求中包含请求端标识和资源标识;
    提取相同请求端标识所对应的资源标识,形成每个请求端的资源标识集合;
    将所述资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与所述资源标识相匹配的资源节点作为访问节点;及
    当存在孤立的访问节点时,判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端;
    孤立的访问节点所对应的子节点的父节点不是访问节点。
  16. 根据权利要求15所述的存储介质,其特征在于,在所述将所述资源标识集合中 的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与所述资源标识相匹配的资源节点作为访问节点的步骤之后,还包括执行以下步骤:
    当不存在孤立的访问节点时,统计每个访问节点与资源标识集合中的资源标识匹配的匹配次数;
    获取每个访问节点的节点权重;
    根据每个访问节点的匹配次数和相应的节点权重进行计算,得到与所述资源标识集合对应的综合权重;及
    当所述综合权重小于预设权重时,则判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
  17. 根据权利要求16所述的存储介质,其特征在于,所述获取每个访问节点的节点权重的步骤,包括执行以下步骤:
    计算访问节点所对应的子页面资源与所述访问节点的父节点所对应的父页面资源的页面相似度;
    统计所述子页面资源中包含的无关资源的第一数量,和所述子页面资源中包含的更新资源的第二数量,所述更新资源为父页面资源中不包含的无关资源;及
    根据所述页面相似度、第一数量和第二数量进行计算,得到每个访问节点的节点权重。
  18. 根据权利要求17所述的存储介质,其特征在于,所述无关资源包括图片资源、JS脚本资源及广告资源中的至少一种。
  19. 根据权利要求15所述的存储介质,其特征在于,在所述将所述资源标识集合中的所有资源标识和预设的资源结构树中的资源节点进行匹配,将与所述资源标识相匹配的资源节点作为访问节点的步骤之后,还包括执行以下步骤:
    当不存在孤立的访问节点时,统计每个访问节点与资源标识集合中的资源标识匹配的匹配次数;及
    当存在所述匹配次数大于预设次数的访问节点时,则判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端。
  20. 根据权利要求15至19任意一项所述的存储介质,其特征在于,在所述判定所述资源标识集合对应的请求端为通过网络爬虫发起资源访问请求的请求端的步骤之后,还包括执行以下步骤:
    判断所述网络爬虫所对应的请求端标识是否为白名单爬虫标识;
    若是,则通过所述网络爬虫发起的资源访问请求;及
    若否,则拒绝所述网络爬虫发起的资源访问请求。
PCT/CN2018/099876 2018-01-12 2018-08-10 网络爬虫识别方法、装置、计算机设备和存储介质 WO2019136987A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810031350.2 2018-01-12
CN201810031350.2A CN108228864B (zh) 2018-01-12 2018-01-12 网络爬虫识别方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2019136987A1 true WO2019136987A1 (zh) 2019-07-18

Family

ID=62641639

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/099876 WO2019136987A1 (zh) 2018-01-12 2018-08-10 网络爬虫识别方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN108228864B (zh)
WO (1) WO2019136987A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228864B (zh) * 2018-01-12 2019-08-20 深圳壹账通智能科技有限公司 网络爬虫识别方法、装置、计算机设备和存储介质
CN110647672B (zh) * 2019-08-29 2020-12-11 北京三快在线科技有限公司 异常用户检测方法、装置、电子设备及可读存储介质
CN111858929A (zh) * 2020-06-22 2020-10-30 网宿科技股份有限公司 一种基于图神经网络的网络爬虫检测方法、系统及装置
CN112434208B (zh) * 2020-12-03 2024-05-07 百果园技术(新加坡)有限公司 一种孤立森林的训练及其网络爬虫的识别方法与相关装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8788263B1 (en) * 2013-03-15 2014-07-22 Steven E. Richfield Natural language processing for analyzing internet content and finding solutions to needs expressed in text
CN107092660A (zh) * 2017-03-28 2017-08-25 成都优易数据有限公司 一种网站服务器爬虫识别方法和装置
CN107196968A (zh) * 2017-07-12 2017-09-22 深圳市活力天汇科技股份有限公司 一种爬虫识别方法
CN108228864A (zh) * 2018-01-12 2018-06-29 深圳壹账通智能科技有限公司 网络爬虫识别方法、装置、计算机设备和存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105187396A (zh) * 2015-08-11 2015-12-23 小米科技有限责任公司 识别网络爬虫的方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8788263B1 (en) * 2013-03-15 2014-07-22 Steven E. Richfield Natural language processing for analyzing internet content and finding solutions to needs expressed in text
CN107092660A (zh) * 2017-03-28 2017-08-25 成都优易数据有限公司 一种网站服务器爬虫识别方法和装置
CN107196968A (zh) * 2017-07-12 2017-09-22 深圳市活力天汇科技股份有限公司 一种爬虫识别方法
CN108228864A (zh) * 2018-01-12 2018-06-29 深圳壹账通智能科技有限公司 网络爬虫识别方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
CN108228864B (zh) 2019-08-20
CN108228864A (zh) 2018-06-29

Similar Documents

Publication Publication Date Title
US10409874B2 (en) Search based on combining user relationship datauser relationship data
WO2019136987A1 (zh) 网络爬虫识别方法、装置、计算机设备和存储介质
JP7330891B2 (ja) インターネットコンテンツ内の要素の直接的なブラウザ内のマークアップのためのシステムおよび方法
CN106503134B (zh) 浏览器跳转至应用程序的数据同步方法及装置
US10237299B2 (en) Browser extension for contemporaneous in-browser tagging and harvesting of internet content
WO2019134334A1 (zh) 网络异常数据检测方法、装置、计算机设备和存储介质
US7974970B2 (en) Detection of undesirable web pages
US9032000B2 (en) System and method for geolocation of social media posts
WO2020156389A1 (zh) 信息推送方法和装置
WO2019205716A1 (zh) 一种搜索应用程序内信息的方法及设备
WO2019205717A1 (zh) 一种搜索应用程序内信息的方法及设备
WO2018053620A1 (en) Digital communications platform for webpage overlay
US20130173655A1 (en) Selective fetching of search results
US20230126692A1 (en) System and method for blocking phishing attempts in computer networks
WO2011116082A2 (en) Indexing and searching employing virtual documents
CN109582844A (zh) 一种识别爬虫的方法、装置及系统
WO2013070534A1 (en) Function extension for browsers or documents
EP3745292A1 (en) Hidden link detection method and apparatus for website
CN110619075B (zh) 一种网页识别方法与设备
WO2014059851A1 (zh) 一种搜索服务器及搜索方法
CN107786529B (zh) 网站的检测方法、装置及系统
CN115687810A (zh) 网页搜索方法、装置及相关设备
US9843559B2 (en) Method for determining validity of command and system thereof
WO2019237949A1 (zh) 搜索方法及装置
CN110825976B (zh) 网站页面的检测方法、装置、电子设备及介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 17/11/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18899100

Country of ref document: EP

Kind code of ref document: A1