WO2021169239A1 - 一种爬虫数据的识别方法、系统及设备 - Google Patents

一种爬虫数据的识别方法、系统及设备 Download PDF

Info

Publication number
WO2021169239A1
WO2021169239A1 PCT/CN2020/114053 CN2020114053W WO2021169239A1 WO 2021169239 A1 WO2021169239 A1 WO 2021169239A1 CN 2020114053 W CN2020114053 W CN 2020114053W WO 2021169239 A1 WO2021169239 A1 WO 2021169239A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
request
session
target
crawler
Prior art date
Application number
PCT/CN2020/114053
Other languages
English (en)
French (fr)
Inventor
陈志勇
王凤杰
赵志文
Original Assignee
网宿科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 网宿科技股份有限公司 filed Critical 网宿科技股份有限公司
Priority to EP20861973.4A priority Critical patent/EP3893128A4/en
Priority to US17/210,487 priority patent/US20210263979A1/en
Publication of WO2021169239A1 publication Critical patent/WO2021169239A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Definitions

  • This application relates to the field of Internet technology, and in particular to a method, system and equipment for identifying crawler data.
  • crawler technology can be used to automatically obtain web content, so as to quickly filter out the required information from a large amount of information.
  • crawlers may include legitimate crawlers of search engines, and may also include malicious crawlers of illegal data collection.
  • it is necessary to screen the access data, and then filter out the crawler data for further analysis.
  • the purpose of this application is to provide a method, system and equipment for identifying crawler data, which can effectively identify crawler data.
  • this application provides a method for identifying crawler data, the method includes: obtaining site map data of a target website, and generating a vector diagram of the site map data; obtaining session data of the target website , And based on the request contained in the session data, map the session data to a sub-graph in the vector graph; add a session label to the session data, and the session label is used to characterize whether the session data is Crawler data, and train a preset classifier based on the session label and the subgraph to obtain a classifier for distinguishing crawler data from non-crawler data.
  • the system includes: a vector graph generating unit for obtaining site map data of a target website and generating a vector graph of the site map data A sub-picture mapping unit, used to obtain the session data of the target website, and based on the request contained in the session data, map the session data to the sub-picture in the vector graph; training unit, used for A session label is added to the session data, and the session label is used to characterize whether the session data is crawler data, and a preset classifier is trained based on the session label and the subgraph to obtain data for distinguishing crawlers And classifiers for non-crawler data.
  • the crawler data recognition device includes a processor and a memory.
  • the memory is used to store a computer program that is processed by the computer program.
  • the crawler is executed, the above-mentioned crawler data identification method is realized.
  • sitemap data (sitemap) of the target website can be obtained, and the data is usually XML (eXtensible Markup Language, extensible markup language) format data.
  • site map data can be converted into vector graphics, thereby converting data in XML format into image data.
  • session data can be mapped into a sub-picture of the vector graph according to the request contained therein. This subgraph can characterize the way the session data is accessed.
  • the mapped subgraph can be used to distinguish whether the session data is crawler data. Then, using the session label of the session data and the mapped subgraph, a machine learning algorithm can be used to train the preset classifier, thereby obtaining a classifier for distinguishing crawler data from non-crawler data.
  • a machine learning algorithm can be used to train the preset classifier, thereby obtaining a classifier for distinguishing crawler data from non-crawler data.
  • images can be used as training samples to train accurate classifiers. Later, the classifier can predict whether the input session data is crawler data. It can be seen that the technical solution provided by the present application can accurately and efficiently predict whether the session data is crawler data through the classifier obtained through training.
  • FIG. 1 is a step diagram of a method for identifying crawler data in an embodiment of the present application
  • Fig. 2 is a flowchart of training a classifier in an embodiment of the present application
  • Fig. 3 is a schematic diagram of a vector diagram in an embodiment of the present application.
  • Fig. 4 is a schematic diagram of sub-picture mapping in an embodiment of the present application.
  • Fig. 5 is a schematic diagram of a subgraph in an embodiment of the present application.
  • FIG. 6 is a flowchart of identifying crawler data in an embodiment of the present application.
  • FIG. 7 is a schematic diagram of functional modules of the crawler data identification system in an embodiment of the present application.
  • FIG. 8 is a schematic diagram of the internal structure of the crawler data identification device in an embodiment of the present application.
  • This application provides a method for identifying crawler data. Please refer to FIG. 1 and FIG. 2.
  • the method may include the following multiple steps.
  • S1 Obtain site map data of the target website, and generate a vector diagram of the site map data.
  • the target website may be a website whose crawler data is to be identified.
  • the target website usually has its own site map data
  • the site map data is an XML file containing information about each access link in the target website.
  • the site map data may include each URL (Uniform Resource Locator, Uniform Resource Locator) in the target website and the jump relationship between each URL.
  • the existing crawler tool or bypass monitoring tool can be used to obtain it.
  • the site map data can be converted into a visualized vector diagram.
  • Figure 3 multiple node positions can be included.
  • the circle in Figure 3 can be used as the node position.
  • each node location can correspond to an access link in the target website.
  • each access link contained in the site map data can be identified, and then the node location corresponding to each access link can be determined based on the jump relationship between each access link.
  • the jump relationship between the access links can be determined by the access sequence of the access links.
  • the page with access link A contains access links A1 and A2, then a jump relationship from A to A1 and from A to A2 can be constructed.
  • multiple access links with a jump relationship can be closer together in a vector graph.
  • the corresponding node positions can be determined in the vector graph for each access link.
  • the image containing the position of each node can be used as a vector diagram of the site map data.
  • the Weilley algorithm can be used to process the acquired site map data, so that the node coordinates corresponding to each access link can be calculated.
  • node coordinates can be used as node positions in the vector graph, so that each access link in the site map data can be mapped to the node positions in the vector graph.
  • the two node positions at both ends of the dotted line may be node positions with a jump relationship.
  • XML data can be converted into image data to facilitate the subsequent machine learning process.
  • S3 Obtain the session data of the target website, and map the session data into a subgraph in the vector graph based on the request contained in the session data.
  • training samples can be constructed based on the session data of the target website.
  • the access data of the target website can be recorded in the traffic log of the target website. In this way, the traffic log of the target website can be read.
  • the access data of each session (session) may be included.
  • the access data in the traffic log can be grouped according to sessions, so as to obtain one or more sets of session data. For each set of session data, corresponding training samples can be generated.
  • each group of session data may include one or more requests, and these requests may include access links.
  • each request in the session data can be sorted according to the access time. For example, the requests in the session data can be sorted in the order of access time from first to last.
  • a certain number of requests can be selected for analysis. For example, you can select the top W requests.
  • the certain number can be comprehensively determined according to the accuracy requirements of machine learning and the efficiency of machine learning.
  • the access behavior of the session data can be more accurately characterized, but the machine learning process may be relatively long.
  • the certain number is small, the machine learning process can be shortened, but it may not be able to accurately characterize the access behavior of the session data.
  • the node location corresponding to each request can be queried in the above-mentioned vector graph according to the access link corresponding to each request.
  • each node position can correspond to an access link.
  • the access link corresponding to each request can be known, and the node position corresponding to each request in the vector graph can be determined.
  • multiple different requests may point to the same access link. In this case, the node positions of these requests in the vector graph can be the same.
  • the access frequency of a certain access link in the target website can also be used as a feature of the session data.
  • the visit frequency of the visit link can also be taken as a part of the training sample.
  • the request information of each request in the session data can be traversed, and for any target request, the access frequency of the access link corresponding to the target request can be counted.
  • the above request information may include various parameters of the request.
  • the request information may include various parameters such as the IP address of the request initiator, the access link to be requested, the requested access port, and the duration of the request. By counting the request information of each request, the access frequency of each access link can be determined.
  • the size of the access frequency can be represented by the node radius of the requesting node in the vector graph.
  • the greater the access frequency the greater the node radius of the requesting node.
  • the node radius corresponding to the access frequency can be determined.
  • a suitable increasing function can be selected, and the access frequency can be used as the independent variable of the increasing function, and the node radius of the request node can be used as the result of the increasing function. This can ensure that the greater the access frequency, the larger the radius of the requesting node.
  • a request node with the node radius can be generated, and the request node with the node radius can be used as the request node that matches the target request.
  • corresponding request nodes can be generated for each access link accessed by the session data, and the node radius of each request node can represent the access frequency of the access link.
  • the generated request node can be filled in the corresponding node position.
  • part of the node positions in the vector graph can be filled in by request nodes generated based on the session data.
  • the black filled circle can be the request node generated based on the session data.
  • the connection relationship between each request node can also be determined according to the sorted requests. Specifically, in the generated request nodes, any two request nodes whose access time is adjacent can be determined. For example, in Figure 4, url1 and url2 can be two adjacent request nodes. Two request nodes with adjacent access time indicate that the access sequence is also coherent. Therefore, a connection can be established between the two request nodes, thereby indicating that the two request nodes are adjacent in access time. It should be noted that two requesting nodes with adjacent access times may also be the same requesting node. For example, for url3 and url4, although they are two requests that are connected in time, the two requests point to the same request node.
  • connection cannot be established. Therefore, after determining any two requesting nodes with adjacent access times in each requesting node, it can be further judged that if the two requesting nodes are different requesting nodes, it can be between the two requesting nodes. Establish a connection.
  • connection established between two requesting nodes can also have directivity.
  • This directional connection can characterize the access sequence of the two requesting nodes.
  • the requesting node with the first access time points to the requesting node with the later access time.
  • the image formed by each request node with the connection relationship can be used as a sub-graph of the vector graph obtained by mapping, which can represent The access behavior of session data, so that the session data in XML format can be converted into images.
  • S5 Add a session label to the session data, the session label is used to characterize whether the session data is crawler data, and a preset classifier is trained based on the session label and the sub-image to obtain A classifier that distinguishes crawler data from non-crawler data.
  • the generated subgraph can be used as a training sample.
  • the session label can indicate whether the current session data is crawler data.
  • conventional methods can be used to determine whether the session data is crawler data, so as to add a corresponding session tag to the session data.
  • a variety of conventional methods can be used to analyze the session data.
  • the generated subgraph can be used as the training data, and the session label can be used as a criterion for measuring the correctness of the training result to train the preset classifier.
  • the preset classifier can be a conventional machine learning model.
  • the preset classifier may be a convolutional neural network model, a support vector machine, a recurrent neural network model, and so on.
  • the corresponding model can be flexibly selected according to the accuracy requirements and training efficiency requirements. For example, LeNet-5, AlexNet or ResNet models can be selected.
  • the generated sub-picture can be input into the preset classifier.
  • the preset classifier can have multiple levels of neurons, and each neuron can correspond to an internal parameter.
  • the input subgraph can be processed by the corresponding internal parameters when it passes through the neurons at all levels, and finally a probability array can be output.
  • the probability array may include two probability values, which respectively correspond to the probability of crawler data and the probability of non-crawler data.
  • the classification result output by the preset classifier may be a data type corresponding to a larger probability value.
  • the probability array obtained by the preset classifier for the input subgraph is (0.8, 0.2), where the data type corresponding to the probability value of 0.8 is crawler data, and the classification result output by the preset classifier is crawler data.
  • the internal parameters initialized in the preset classifier may not accurately predict the input subgraph, so it is necessary to compare the classification result output by the preset classifier with the actual session label. If the two are consistent, the internal parameters may not be adjusted, and if the two are inconsistent, an error function can be generated based on the difference between the two, and the error function can be used as correction information to adjust the internal parameters in the preset classifier. After adjusting the internal parameters, the sub-picture can be input into the preset classifier again, and the resulting classification result can be compared with the conversation label again. If the two are still inconsistent, you can continue to adjust the internal parameters.
  • the classification result output by the preset classifier can finally be kept consistent with the actual session label.
  • the training process can be completed and a classifier used to distinguish crawler data from non-crawler data can be obtained.
  • the classifier can be used to predict the actual session data.
  • the server of the target website can record the current session data.
  • the server of the target website may record the unique identifier of the session data, and may record the number of requests in the session data.
  • the target session data initiated by the client for the target website can be obtained, and the target session data can be mapped into a vector graph according to the method described in step S3
  • the target subgraph in.
  • the specified number threshold may be determined when training a classifier for distinguishing crawler data from non-crawler data. For example, when constructing a training sample, after sorting the requests in the session data according to the access time, W of them can be selected to construct the mapped subgraph. In this case, the specified number threshold can be W .
  • the target subgraph can be input into the trained classifier, and the output result of the classifier can be used to determine whether the target session data is crawler data.
  • the output result of the classifier may be a text-type data, and the text-type data may represent crawler data or non-crawler data.
  • the output result of the classifier can also be Boolean data, where 0 can represent non-crawler data, and 1 can represent crawler data.
  • the output result of the classifier can also be other data types, which will not be listed here.
  • the alarm information may include a unique identifier of the target session data, so as to facilitate subsequent data investigation.
  • This application also provides a system for identifying crawler data. Please refer to Figure 7.
  • the system includes:
  • the vector graph generating unit is used to obtain the site map data of the target website and generate the vector graph of the site map data;
  • a subgraph mapping unit configured to obtain session data of the target website, and based on the request contained in the session data, map the session data to a subgraph in the vector graph;
  • the training unit is configured to add a session label to the session data, the session label is used to characterize whether the session data is crawler data, and to train a preset classifier based on the session label and the sub-image to Obtain a classifier used to distinguish crawler data from non-crawler data.
  • the sub-picture mapping unit includes:
  • a node location query module used to identify the request contained in the session data, and query the node location corresponding to each request in the vector graph;
  • a node generating module configured to generate a request node matching each request according to the request information of each request, and fill the generated request node at the corresponding node position;
  • the connection relationship determination module is used to sort the requests according to the access time, determine the connection relationship between the request nodes according to the sorting result, and compose the request nodes with the connection relationship
  • the image is used as a sub-image obtained by the mapping.
  • system further includes:
  • An access data obtaining unit configured to obtain target session data initiated by the client for the target website, and map the target session data to a target sub-graph of the vector graph;
  • the prediction unit is used for inputting the target sub-image into the trained classifier, and judging whether the target session data is crawler data based on the output result of the classifier.
  • the present application also provides a crawler data recognition device.
  • the crawler data recognition device includes a memory and a processor.
  • the memory is used to store a computer program that is executed when the computer program is executed by the processor. , Can realize the above-mentioned crawler data identification method.
  • the memory may include a physical device for storing information, which is usually digitized and then stored in a medium using electrical, magnetic, or optical methods.
  • the memory may also include: a device that uses electrical energy to store information, such as RAM or ROM, etc.; a device that uses magnetic energy to store information, such as hard disk, floppy disk, magnetic tape, magnetic core memory, bubble memory, or U disk; using optical mode A device that stores information, such as a CD or DVD.
  • a device that uses electrical energy to store information such as RAM or ROM, etc.
  • a device that uses magnetic energy to store information such as hard disk, floppy disk, magnetic tape, magnetic core memory, bubble memory, or U disk
  • optical mode A device that stores information such as a CD or DVD.
  • quantum memory or graphene memory there are other types of memory, such as quantum memory or graphene memory.
  • the processor can be implemented in any suitable manner.
  • the processor may take the form of, for example, a microprocessor or a processor and a computer-readable medium storing computer-readable program codes (for example, software or firmware) executable by the (micro)processor, logic gates, switches, special-purpose integrated Circuit (Application Specific Integrated Circuit, ASIC), programmable logic controller and embedded microcontroller form, etc.
  • program codes for example, software or firmware
  • sitemap data (sitemap) of the target website can be obtained, and the data is usually XML (eXtensible Markup Language, extensible markup language) format data.
  • site map data can be converted into vector graphics, thereby converting data in XML format into image data.
  • session data can be mapped into a sub-picture of the vector graph according to the request contained therein. This subgraph can characterize the way the session data is accessed.
  • the mapped subgraph can be used to distinguish whether the session data is crawler data. Then, using the session label of the session data and the mapped subgraph, a machine learning algorithm can be used to train the preset classifier, thereby obtaining a classifier for distinguishing crawler data from non-crawler data.
  • the technical solution of the present application converts data into images, so that images can be used as training samples to train accurate classifiers. Later, the classifier can predict whether the input session data is crawler data. It can be seen that the technical solution provided by the present application can accurately and efficiently predict whether the session data is crawler data through the classifier obtained through training.
  • this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware implementation, a complete software implementation, or a combination of software and hardware implementations. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer readable media.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种爬虫数据的识别方法、系统及设备,其中,所述方法包括:获取目标网站的站点地图数据,并生成所述站点地图数据的向量图(S1);获取所述目标网站的会话数据,并基于所述会话数据中包含的请求,将所述会话数据映射为所述向量图中的子图(S3);为所述会话数据添加会话标签,所述会话标签用于表征所述会话数据是否为爬虫数据,并基于所述会话标签和所述子图对预设分类器进行训练,以得到用于区分爬虫数据和非爬虫数据的分类器(S5)。

Description

一种爬虫数据的识别方法、系统及设备
交叉引用
本申请要求于2020年02月24日递交的名称为“一种爬虫数据的识别方法、系统及设备”、申请号为202010112134.8的中国专利申请的优先权,其通过引用被全部并入本申请。
技术领域
本申请涉及互联网技术领域,特别涉及一种爬虫数据的识别方法、系统及设备。
背景技术
随着互联网技术的不断发展,网络中的信息量也呈爆炸式增长。当前可以采用爬虫技术,自动获取网页内容,从而快速地从众多的信息中筛选出所需信息。在实际应用中,爬虫可以包括搜索引擎类的合法爬虫,还可能包含非法数据采集的恶意爬虫。为了避免服务器受到恶意爬虫的攻击,需要对访问数据进行甄别,从而筛选出爬虫数据后进行进一步的分析。
目前,可以通过添加UserAgent黑名单、限制IP地址访问频率、识别设备指纹等方式来识别或者限制爬虫数据。然而,维护UserAgent黑名单和IP地址库需要花费巨大的精力,爬虫数据也可以通过代理IP地址或者修改UserAgent等方式来绕过这些检测方式,从而导致现有的爬虫数据的识别方法不太有效。
发明内容
本申请的目的在于提供一种爬虫数据的识别方法、系统及设备,能够有效地识别出爬虫数据。
为实现上述目的,本申请一方面提供一种爬虫数据的识别方法,所述方法包括:获取目标网站的站点地图数据,并生成所述站点地图数据的向量图;获取所述目标网站的会话数据,并基于所述会话数据中包含的请求,将所述会 话数据映射为所述向量图中的子图;为所述会话数据添加会话标签,所述会话标签用于表征所述会话数据是否为爬虫数据,并基于所述会话标签和所述子图对预设分类器进行训练,以得到用于区分爬虫数据和非爬虫数据的分类器。
为实现上述目的,本申请另一方面还提供一种爬虫数据的识别系统,所述系统包括:向量图生成单元,用于获取目标网站的站点地图数据,并生成所述站点地图数据的向量图;子图映射单元,用于获取所述目标网站的会话数据,并基于所述会话数据中包含的请求,将所述会话数据映射为所述向量图中的子图;训练单元,用于为所述会话数据添加会话标签,所述会话标签用于表征所述会话数据是否为爬虫数据,并基于所述会话标签和所述子图对预设分类器进行训练,以得到用于区分爬虫数据和非爬虫数据的分类器。
为实现上述目的,本申请另一方面还提供一种爬虫数据的识别设备,所述爬虫数据的识别设备包括处理器和存储器,所述存储器用于存储计算机程序,所述计算机程序被所述处理器执行时,实现上述的爬虫数据的识别方法。
由上可见,本申请一个或者多个实施例提供的技术方案,可以通过机器学习的方式来识别爬虫数据。具体地,针对待识别的目标网站,可以获取目标网站的站点地图数据(sitemap),该数据通常是XML(eXtensible Markup Language,可扩展标记语言)格式的数据。在本申请中,可以将站点地图数据转换为向量图,从而将XML格式的数据转换为图像数据。后续,针对目标网站的会话数据,可以根据其中包含的请求,将该会话数据映射为向量图的子图。该子图可以表征会话数据的访问方式。由于爬虫数据和非爬虫数据的访问方式往往不同,因此映射得到的子图可以用于区分会话数据是否为爬虫数据。然后,利用会话数据的会话标签和映射得到的子图,可以采用机器学习算法对预设分类器进行训练,从而得到用于区分爬虫数据和非爬虫数据的分类器。本申请的技术方案,通过将数据转换为图像,从而可以利用图像作为训练样本,训练出精准的分类器。后续,该分类器可以预测输入的会话数据是否为爬虫数据。可见,本申请提供的技术方案,能够通过训练得到的分类器精确、高效地预测出出会话数据是否为爬虫数据。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中 所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例中爬虫数据的识别方法步骤图;
图2是本申请实施例中训练分类器的流程图;
图3是本申请实施例中向量图的示意图;
图4是本申请实施例中子图映射的示意图;
图5是本申请实施例中子图的示意图;
图6是本申请实施例中识别爬虫数据的流程图;
图7是本申请实施例中爬虫数据的识别系统的功能模块示意图;
图8是本申请实施例中爬虫数据的识别设备的内部结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请提供一种爬虫数据的识别方法,请参阅图1和图2,该方法可以包括以下多个步骤。
S1:获取目标网站的站点地图数据,并生成所述站点地图数据的向量图。
在本实施例中,目标网站可以是待识别爬虫数据的网站。在实际应用中,目标网站通常具备自身的站点地图数据,该站点地图数据是一个包含目标网站内各个访问链接信息的XML文件。例如,在站点地图数据中,可以包括目标网站中的各个URL(Uniform Resource Locator,统一资源定位符)以及各个URL之间的跳转关系。在获取目标网站的站点地图数据时,可以利用现有的爬虫工具,或者旁路监控工具进行获取。
在本实施例中,在获取到XML格式的站点地图数据后,为了便于后续的机器学习过程,可以将该站点地图数据转换为可视化的向量图。请参阅图3,在该向量图中,可以包括多个节点位置。图3中的圆形便可以作为节点位置。其 中,每个节点位置可以对应目标网站中的一个访问链接。具体地,可以识别站点地图数据中包含的各个访问链接,然后,可以基于各个访问链接之间的跳转关系,确定各个访问链接对应的节点位置。其中,访问链接之间的跳转关系,可以通过访问链接的访问顺序来确定。例如,访问链接为A的页面中,包含访问链接A1、A2,那么便可以构建从A至A1,以及从A至A2的跳转关系。通常而言,具备跳转关系的多个访问链接,在向量图中可以距离较近。根据访问链接之间的跳转关系,可以将各个访问链接分别在向量图中确定出对应的节点位置。最终,可以将包含各个节点位置的图像作为站点地图数据的向量图。在实际应用中,可以采用韦尔莱算法对获取到的站点地图数据进行处理,从而可以计算出各个访问链接对应的节点坐标。这些节点坐标便可以作为向量图中的节点位置,从而可以将站点地图数据中的各个访问链接映射至向量图中的节点位置处。并且,在向量图中,具备跳转关系的节点位置之间,还可以具备连线。例如,在图3中,虚线两端的两个节点位置,可以是具备跳转关系的节点位置。
这样,通过对站点地图数据进行处理,可以将XML的数据转换为图像数据,以便于后续的机器学习过程。
S3:获取所述目标网站的会话数据,并基于所述会话数据中包含的请求,将所述会话数据映射为所述向量图中的子图。
在本实施例中,在进行机器学习之前,还需要构建训练样本。具体地,可以基于目标网站的会话数据来构建训练样本。在实际应用中,目标网站的访问数据可以记录于目标网站的流量日志中。这样,可以读取目标网站的流量日志。在该流量日志中,可以包括各个会话(session)的访问数据。在本实施例中,可以将流量日志中的访问数据按照会话进行分组,从而得到一组或者多组会话数据。对于其中的每一组会话数据,都可以生成对应的训练样本。
在本实施例中,每组会话数据中都可以包括一个或者多个请求,这些请求中可以包含访问链接。为了识别每组会话数据对目标网站的访问行为,可以按照访问时间对会话数据中的各个请求进行排序。例如,可以按照访问时间从先到后的顺序,对会话数据中的请求进行排序。
在一个实施例中,在对会话数据中的请求进行排序后,为了能够较好地表征会话数据的访问行为,可以选择一定数量的请求进行分析。例如,可以选择排序靠前的W个请求。在实际应用中,该一定数量可以根据机器学习的精度 要求以及机器学习的效率来综合确定。当该一定数量较大时,可以比较准确地表征会话数据的访问行为,但可能会导致机器学习的过程比较漫长。而当该一定数量较小时,可以缩短机器学习的过程,但是可能无法准确地表征会话数据的访问行为。
在本实施例中,在识别出会话数据中包含的请求后,可以根据各个请求对应的访问链接,在上述的向量图中查询到各个请求对应的节点位置。在向量图中,每个节点位置都可以与一个访问链接相对应。这样,通过识别会话数据中的请求,从而可以获知各个请求对应的访问链接,从而可以确定出各个请求在向量图中对应的节点位置。需要说明的是,多个不同的请求可能会指向同一个访问链接,在这种情况下,这些请求在向量图中的节点位置便可以是相同的。
在本实施例中,对于目标网站中某个访问链接的访问频率,也可以作为会话数据的一个特征。鉴于此,在构建训练样本时,可以将访问链接的访问频率也作为训练样本的一部分。具体地,可以遍历会话数据中的各个请求的请求信息,并针对任一目标请求,统计该目标请求对应的访问链接的访问频率。上述的请求信息,可以包含请求的各项参数。例如,在该请求信息中,可以包含请求发起方的IP地址、待请求的访问链接、请求的访问端口、请求持续的时长等各项参数。通过统计各个请求的请求信息,从而可以确定出各个访问链接的访问频率。访问频率的大小可以通过向量图中请求节点的节点半径来表示。访问频率越大,请求节点的节点半径也可以越大。这样,可以确定访问频率对应的节点半径。在实际应用中,可以选用合适的增函数,并可以将访问频率作为该增函数的自变量,以及将请求节点的节点半径作为增函数的结果。这样可以保证访问频率越大,请求节点的半径也越大。在确定出访问频率对应的节点半径之后,便可以生成具备该节点半径的请求节点,并将具备该节点半径的请求节点作为与目标请求相匹配的请求节点。按照上述的方式,可以针对会话数据访问的各个访问链接,生成对应的请求节点,并且每个请求节点的节点半径,可以表征访问链接的访问频率。
请参阅图4,在向量图中确定出各个请求对应的节点位置,并且生成了与各个请求相匹配的请求节点后,可以将生成的请求节点填充于对应的节点位置处。这样,向量图中的部分节点位置,便可以由根据会话数据生成的请求节点填充。在图4中,黑色填充的圆形便可以是根据会话数据生成的请求节点。
在本实施例中,为了表明各个访问链接的访问顺序,还可以根据排序后的请求,确定各个请求节点之间的连接关系。具体地,可以在生成的请求节点中,确定访问时间相邻的任意两个请求节点。例如,在图4中,url1和url2便可以为相邻的两个请求节点。访问时间相邻的两个请求节点,表明访问顺序也是连贯的,因此,可以在这两个请求节点之间建立连线,从而表明这两个请求节点在访问时间上是相邻的。需要说明的是,访问时间相邻的两个请求节点,也有可能是同一个请求节点。例如,对于url3和url4而言,虽然是时间上相连的两个请求,但这两个请求指向的是同一个请求节点,在这种情况下,就无法建立连线。因此,在各个请求节点中确定出访问时间相邻的任意两个请求节点后,可以进一步地判断,若所述两个请求节点为不同的请求节点,则可以在所述两个请求节点之间建立连线。
当然,在某些场景下,两个请求节点之间建立的连线还可以具备指向性。该具备指向性的连线,可以表征两个请求节点的访问顺序。通常而言,可以如图4所示,由访问时间在前的请求节点指向访问时间在后的请求节点。
请参阅图5,在生成了各个请求节点,并且在请求节点之间建立连线后,可以将具备连接关系的各个请求节点构成的图像作为映射得到的向量图的子图,该子图可以表征会话数据的访问行为,这样便可以将XML格式的会话数据转换为图像。
S5:为所述会话数据添加会话标签,所述会话标签用于表征所述会话数据是否为爬虫数据,并基于所述会话标签和所述子图对预设分类器进行训练,以得到用于区分爬虫数据和非爬虫数据的分类器。
在本实施例中,生成的子图可以作为训练样本,为了衡量训练结果是否准确,还需要对会话数据添加会话标签,该会话标签可以表明当前的会话数据是否为爬虫数据。在实际应用中,可以采用常规的手段判断该会话数据是否为爬虫数据,从而给该会话数据添加对应的会话标签。当然,为了会话标签的准确度,可以采用多种常规的手段对会话数据进行分析。
在本实施例中,在为会话数据添加会话标签后,便可以将生成的子图作为训练数据,并且将会话标签作为衡量训练结果正确与否的标准,对预设分类器进行训练。该预设分类器可以是常规的机器学习模型。例如,该预设分类器可以是卷积神经网络模型、支持向量机、循环神经网络模型等。当然,实际应 用中可以根据精度需求和训练效率需求灵活选择对应的模型。例如,可以选用LeNet-5、AlexNet或者ResNet模型。
在本实施例中,在对预设分类器进行训练时,可以将生成的子图输入该预设分类器中。在该预设分类器中可以具备多级神经元,每个神经元可以对应一个内部参数。输入的子图通过各级神经元时可以由对应的内部参数进行处理,最终可以输出一个概率数组。该概率数组中可以包括两个概率值,这两个概率值分别对应爬虫数据的概率和非爬虫数据的概率。预设分类器输出的分类结果,可以是较大的概率值对应的数据类型。例如,预设分类器针对输入的子图,得到的概率数组为(0.8,0.2),其中,概率值0.8对应的数据类型为爬虫数据,那么预设分类器输出的分类结果便是爬虫数据。
在本实施例中,预设分类器中初始化的内部参数可能并不能准确地对输入的子图进行预测,因此需要将预设分类器输出的分类结果与实际的会话标签进行对比。如果两者一致,则可以不对内部参数进行调整,而如果两者不一致,可以根据两者之间的差别生成误差函数,该误差函数可以作为校正信息对预设分类器中的内部参数进行调整。在调整内部参数之后,可以再次将子图输入预设分类器,得到的分类结果可以再次与会话标签进行对比。若两者仍然不一致,那么可以继续对内部参数进行调整。这样,通过大量训练样本的反复训练,最终可以使得预设分类器输出的分类结果,与实际的会话标签保持一致。这样,便可以完成训练过程,得到用于区分爬虫数据和非爬虫数据的分类器。
在一个实施例中,当训练得到分类器之后,便可以利用该分类器对实际的会话数据进行预测。具体地,请参阅图6,客户端发起请求到目标网站的服务器时,目标网站的服务器可以记录当前的会话数据。具体地,目标网站的服务器可以记录会话数据的唯一标识,并可以记录会话数据中请求的数量。
在本实施例中,当需要对客户端的会话数据进行识别时,可以获取该客户端针对目标网站发起的目标会话数据,并可以按照步骤S3中描述的方式,将该目标会话数据映射为向量图中的目标子图。当然,在对目标会话数据进行处理时,可以先判断目标会话数据中的请求数量是否达到了指定数量阈值。该指定数量阈值,可以是在训练用于区分爬虫数据和非爬虫数据的分类器时确定的。例如,在构建训练样本时,对会话数据中的请求按照访问时间进行排序后,可以选取其中的W个请求来构建映射后的子图,那么在这种情况下,指定数量阈 值便可以是W。这样,若目标会话数据中的请求数量未达到指定数量阈值,那么可以继续等待一段时间,直至目标会话数据中的请求数量增长至指定数量阈值。而如果目标会话数据中的请求数量已经超过所述指定数量阈值,那么可以按照访问时间进行排序,选取其中的指定数量阈值个请求,来生成对应的目标子图。
在本实施例中,生成目标子图之后,可以将目标子图输入训练得到的分类器中,并可以通过分类器的输出结果判断所述目标会话数据是否为爬虫数据。在实际应用中,分类器的输出结果可以是一个文本型数据,该文本型数据可以表示爬虫数据或者非爬虫数据。当然,分类器的输出结果还可以是布尔型数据,其中0可以表示非爬虫数据,1可以表示爬虫数据。在实际应用中,分类器的输出结果还可以是其它的数据类型,这里就不再例举。
在本实施例中,若判定目标会话数据为爬虫数据,可以生成对应的告警信息。在该告警信息中,可以包括该目标会话数据的唯一标识,从而便于后续的数据排查。
本申请还提供一种爬虫数据的识别系统,请参阅图7,所述系统包括:
向量图生成单元,用于获取目标网站的站点地图数据,并生成所述站点地图数据的向量图;
子图映射单元,用于获取所述目标网站的会话数据,并基于所述会话数据中包含的请求,将所述会话数据映射为所述向量图中的子图;
训练单元,用于为所述会话数据添加会话标签,所述会话标签用于表征所述会话数据是否为爬虫数据,并基于所述会话标签和所述子图对预设分类器进行训练,以得到用于区分爬虫数据和非爬虫数据的分类器。
在一个实施例中,所述子图映射单元包括:
节点位置查询模块,用于识别所述会话数据中包含的请求,并在所述向量图中查询各个所述请求对应的节点位置;
节点生成模块,用于根据各个所述请求的请求信息,生成与各个所述请求相匹配的请求节点,并将生成的所述请求节点填充于对应的节点位置处;
连接关系确定模块,用于将各个所述请求按照访问时间进行排序,并根据排序结果,确定各个所述请求节点之间的连接关系,并将具备所述连接关系的各个所述请求节点构成的图像作为映射得到的子图。
在一个实施例中,所述系统还包括:
访问数据获取单元,用于获取客户端针对所述目标网站发起的目标会话数据,并将所述目标会话数据映射为所述向量图的目标子图;
预测单元,用于将所述目标子图输入训练后的所述分类器中,并通过所述分类器的输出结果判断所述目标会话数据是否为爬虫数据。
请参阅图8,本申请还提供一种爬虫数据的识别设备,所述爬虫数据的识别设备包括存储器和处理器,所述存储器用于存储计算机程序,所述计算机程序被所述处理器执行时,可以实现上述的爬虫数据的识别方法。
在本申请中,所述存储器可以包括用于存储信息的物理装置,通常是将信息数字化后再以利用电、磁或者光学等方法的媒体加以存储。所述存储器又可以包括:利用电能方式存储信息的装置,如RAM或ROM等;利用磁能方式存储信息的装置,如硬盘、软盘、磁带、磁芯存储器、磁泡存储器或U盘;利用光学方式存储信息的装置,如CD或DVD。当然,还有其他方式的存储器,例如量子存储器或石墨烯存储器等等。
在本申请中,所述处理器可以按任何适当的方式实现。例如,所述处理器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式等等。
由上可见,本申请一个或者多个实施例提供的技术方案,可以通过机器学习的方式来识别爬虫数据。具体地,针对待识别的目标网站,可以获取目标网站的站点地图数据(sitemap),该数据通常是XML(eXtensible Markup Language,可扩展标记语言)格式的数据。在本申请中,可以将站点地图数据转换为向量图,从而将XML格式的数据转换为图像数据。后续,针对目标网站的会话数据,可以根据其中包含的请求,将该会话数据映射为向量图的子图。该子图可以表征会话数据的访问方式。由于爬虫数据和非爬虫数据的访问方式往往不同,因此映射得到的子图可以用于区分会话数据是否为爬虫数据。然后,利用会话数据的会话标签和映射得到的子图,可以采用机器学习算法对预设分类器进行训练,从而得到用于区分爬虫数据和非爬虫数据的分类器。本申请的技术方案,通过将数据转换为图像,从而可以利用图像作为训练样本,训练出精准的分类 器。后续,该分类器可以预测输入的会话数据是否为爬虫数据。可见,本申请提供的技术方案,能够通过训练得到的分类器精确、高效地预测出出会话数据是否为爬虫数据。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,针对系统和设备的实施例来说,均可以参照前述方法的实施例的介绍对照解释。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施方式、完全软件实施方式、或结合软件和硬件方面的实施方式的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器 (RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (13)

  1. 一种爬虫数据的识别方法,包括:
    获取目标网站的站点地图数据,并生成所述站点地图数据的向量图;
    获取所述目标网站的会话数据,并基于所述会话数据中包含的请求,将所述会话数据映射为所述向量图中的子图;
    为所述会话数据添加会话标签,所述会话标签用于表征所述会话数据是否为爬虫数据,并基于所述会话标签和所述子图对预设分类器进行训练,以得到用于区分爬虫数据和非爬虫数据的分类器。
  2. 根据权利要求1所述的方法,其中,生成所述站点地图数据的向量图包括:
    识别所述站点地图数据中包含的各个访问链接,并基于各个所述访问链接之间的跳转关系,确定各个所述访问链接对应的节点位置;
    将包含各个所述节点位置的图像作为所述站点地图数据的向量图。
  3. 根据权利要求1所述的方法,其中,获取所述目标网站的会话数据包括:
    读取所述目标网站的流量日志,并将所述流量日志中的访问数据按照会话进行分组,以得到一组或者多组会话数据。
  4. 根据权利要求1所述的方法,其中,将所述会话数据映射为所述向量图中的子图包括:
    识别所述会话数据中包含的请求,并在所述向量图中查询各个所述请求对应的节点位置;
    根据各个所述请求的请求信息,生成与各个所述请求相匹配的请求节点,并将生成的所述请求节点填充于对应的节点位置处;
    将各个所述请求按照访问时间进行排序,并根据排序结果,确定各个所述请求节点之间的连接关系,并将具备所述连接关系的各个所述请求节点构成的图像作为映射得到的子图。
  5. 根据权利要求4所述的方法,其中,生成与各个所述请求相匹配的请求 节点包括:
    针对各个所述请求中的任一目标请求,统计所述目标请求对应的访问链接的访问频率,并确定所述访问频率对应的节点半径;
    生成具备所述节点半径的请求节点,并将具备所述节点半径的请求节点作为与所述目标请求相匹配的请求节点。
  6. 根据权利要求4所述的方法,其中,根据排序结果,确定各个所述请求节点之间的连接关系包括:
    在各个所述请求节点中确定访问时间相邻的任意两个请求节点,若所述两个请求节点为不同的请求节点,在所述两个请求节点之间建立连线。
  7. 根据权利要求1所述的方法,其中,基于所述会话标签和所述子图对预设分类器进行训练包括:
    将所述子图输入所述预设分类器,并将所述预设分类器输出的分类结果与所述会话标签进行对比;
    若所述分类结果与所述会话标签不一致,生成校正信息,并利用所述校正信息对所述预设分类器的内部参数进行调整,以使得再次将所述子图输入所述预设分类器后,所述预设分类器输出的分类结果与所述会话标签保持一致。
  8. 根据权利要求1所述的方法,其中,在得到用于区分爬虫数据和非爬虫数据的分类器后,所述方法还包括:
    获取客户端针对所述目标网站发起的目标会话数据,并将所述目标会话数据映射为所述向量图的目标子图;
    将所述目标子图输入训练后的所述分类器中,并通过所述分类器的输出结果判断所述目标会话数据是否为爬虫数据。
  9. 根据权利要求8所述的方法,其中,将所述目标会话数据映射为所述向量图的目标子图包括:
    识别所述目标会话数据中的请求数量是否达到指定数量阈值,若达到所述指定数量阈值,将所述目标会话数据映射为所述向量图的目标子图;其中,所 述指定数量阈值在训练所述用于区分爬虫数据和非爬虫数据的分类器时确定。
  10. 一种爬虫数据的识别系统,包括:
    向量图生成单元,用于获取目标网站的站点地图数据,并生成所述站点地图数据的向量图;
    子图映射单元,用于获取所述目标网站的会话数据,并基于所述会话数据中包含的请求,将所述会话数据映射为所述向量图中的子图;
    训练单元,用于为所述会话数据添加会话标签,所述会话标签用于表征所述会话数据是否为爬虫数据,并基于所述会话标签和所述子图对预设分类器进行训练,以得到用于区分爬虫数据和非爬虫数据的分类器。
  11. 根据权利要求10所述的系统,其中,所述子图映射单元包括:
    节点位置查询模块,用于识别所述会话数据中包含的请求,并在所述向量图中查询各个所述请求对应的节点位置;
    节点生成模块,用于根据各个所述请求的请求信息,生成与各个所述请求相匹配的请求节点,并将生成的所述请求节点填充于对应的节点位置处;
    连接关系确定模块,用于将各个所述请求按照访问时间进行排序,并根据排序结果,确定各个所述请求节点之间的连接关系,并将具备所述连接关系的各个所述请求节点构成的图像作为映射得到的子图。
  12. 根据权利要求10所述的系统,其中,所述系统还包括:
    访问数据获取单元,用于获取客户端针对所述目标网站发起的目标会话数据,并将所述目标会话数据映射为所述向量图的目标子图;
    预测单元,用于将所述目标子图输入训练后的所述分类器中,并通过所述分类器的输出结果判断所述目标会话数据是否为爬虫数据。
  13. 一种爬虫数据的识别设备,所述爬虫数据的识别设备包括存储器和处理器,所述存储器用于存储计算机程序,所述计算机程序被所述处理器执行时,实现如权利要求1至9中任一所述的方法。
PCT/CN2020/114053 2020-02-24 2020-09-08 一种爬虫数据的识别方法、系统及设备 WO2021169239A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20861973.4A EP3893128A4 (en) 2020-02-24 2020-09-08 METHOD, SYSTEM AND DEVICE FOR IDENTIFYING CRAWLER DATA
US17/210,487 US20210263979A1 (en) 2020-02-24 2021-03-23 Method, system and device for identifying crawler data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010112134.8 2020-02-24
CN202010112134.8A CN111368163B (zh) 2020-02-24 2020-02-24 一种爬虫数据的识别方法、系统及设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/210,487 Continuation US20210263979A1 (en) 2020-02-24 2021-03-23 Method, system and device for identifying crawler data

Publications (1)

Publication Number Publication Date
WO2021169239A1 true WO2021169239A1 (zh) 2021-09-02

Family

ID=71208126

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/114053 WO2021169239A1 (zh) 2020-02-24 2020-09-08 一种爬虫数据的识别方法、系统及设备

Country Status (3)

Country Link
EP (1) EP3893128A4 (zh)
CN (1) CN111368163B (zh)
WO (1) WO2021169239A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368163B (zh) * 2020-02-24 2024-03-26 网宿科技股份有限公司 一种爬虫数据的识别方法、系统及设备
CN115410158B (zh) * 2022-09-13 2023-06-30 北京交通大学 一种基于监控摄像头的地标提取方法
CN117596081B (zh) * 2024-01-18 2024-03-26 北京无忧创想信息技术有限公司 一种基于机器学习的社区爬虫行为识别方法及系统

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080147669A1 (en) * 2006-12-14 2008-06-19 Microsoft Corporation Detecting web spam from changes to links of web sites
CN105930727A (zh) * 2016-04-25 2016-09-07 无锡中科富农物联科技有限公司 基于Web的爬虫识别算法
CN107147640A (zh) * 2017-05-09 2017-09-08 网宿科技股份有限公司 识别网络爬虫的方法及系统
CN107818132A (zh) * 2017-09-21 2018-03-20 中国科学院信息工程研究所 一种基于机器学习的网页代理发现方法
CN107888616A (zh) * 2017-12-06 2018-04-06 北京知道创宇信息技术有限公司 基于URI的分类模型的构建方法和Webshell攻击网站的检测方法
CN108763274A (zh) * 2018-04-09 2018-11-06 北京三快在线科技有限公司 访问请求的识别方法、装置、电子设备及存储介质
CN110392032A (zh) * 2018-04-23 2019-10-29 华为技术有限公司 检测异常url的方法、装置及存储介质
CN111368163A (zh) * 2020-02-24 2020-07-03 网宿科技股份有限公司 一种爬虫数据的识别方法、系统及设备

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109600272B (zh) * 2017-09-30 2022-03-18 北京国双科技有限公司 爬虫检测的方法及装置
CN110245280B (zh) * 2019-05-06 2021-03-02 北京三快在线科技有限公司 识别网络爬虫的方法、装置、存储介质和电子设备

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080147669A1 (en) * 2006-12-14 2008-06-19 Microsoft Corporation Detecting web spam from changes to links of web sites
CN105930727A (zh) * 2016-04-25 2016-09-07 无锡中科富农物联科技有限公司 基于Web的爬虫识别算法
CN107147640A (zh) * 2017-05-09 2017-09-08 网宿科技股份有限公司 识别网络爬虫的方法及系统
CN107818132A (zh) * 2017-09-21 2018-03-20 中国科学院信息工程研究所 一种基于机器学习的网页代理发现方法
CN107888616A (zh) * 2017-12-06 2018-04-06 北京知道创宇信息技术有限公司 基于URI的分类模型的构建方法和Webshell攻击网站的检测方法
CN108763274A (zh) * 2018-04-09 2018-11-06 北京三快在线科技有限公司 访问请求的识别方法、装置、电子设备及存储介质
CN110392032A (zh) * 2018-04-23 2019-10-29 华为技术有限公司 检测异常url的方法、装置及存储介质
CN111368163A (zh) * 2020-02-24 2020-07-03 网宿科技股份有限公司 一种爬虫数据的识别方法、系统及设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3893128A4 *

Also Published As

Publication number Publication date
CN111368163A (zh) 2020-07-03
EP3893128A4 (en) 2021-12-22
EP3893128A1 (en) 2021-10-13
CN111368163B (zh) 2024-03-26

Similar Documents

Publication Publication Date Title
WO2021169239A1 (zh) 一种爬虫数据的识别方法、系统及设备
TWI706273B (zh) 統一資源定位符(url)攻擊檢測方法、裝置及電子設備
US20210263979A1 (en) Method, system and device for identifying crawler data
US11451566B2 (en) Network traffic anomaly detection method and apparatus
TWI673625B (zh) 統一資源定位符(url)攻擊檢測方法、裝置以及電子設備
CN106982230B (zh) 一种流量检测方法及系统
CN111897962B (zh) 一种物联网资产标记方法及装置
CN112800427B (zh) webshell检测方法、装置、电子设备和存储介质
RU2708955C2 (ru) Настройка дескриптора каждого признака в режиме онлайн
CN110855648B (zh) 一种网络攻击的预警控制方法及装置
CN113469366A (zh) 一种加密流量的识别方法、装置及设备
CN110598774A (zh) 加密流量检测方法及装置、计算机可读存储介质、电子设备
CN111368289B (zh) 一种恶意软件检测方法和装置
CN112673386A (zh) 用于高效标签传播的基于集成的数据管理管道
CN111953665B (zh) 服务器攻击访问识别方法及系统、计算机设备、存储介质
CN114024761B (zh) 网络威胁数据的检测方法、装置、存储介质及电子设备
CN112839055B (zh) 面向tls加密流量的网络应用识别方法、装置及电子设备
Tang et al. HSLF: HTTP header sequence based LSH fingerprints for application traffic classification
CN110958244A (zh) 一种基于深度学习的仿冒域名检测方法及装置
Zhu et al. Effective phishing website detection based on improved BP neural network and dual feature evaluation
CN111382432A (zh) 一种恶意软件检测、分类模型生成方法及装置
CN113194068A (zh) 基于卷积神经网络的恶意加密流量检测方法及装置
CN116668089B (zh) 基于深度学习的网络攻击检测方法、系统及介质
CN112800419A (zh) 识别ip团伙的方法、装置、介质及设备
US20210397924A1 (en) Web crawler detection method, system and device based on graph neural network

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2020861973

Country of ref document: EP

Effective date: 20210316

NENP Non-entry into the national phase

Ref country code: DE