WO2021114454A1 - 一种检测爬虫请求的方法和装置 - Google Patents

一种检测爬虫请求的方法和装置 Download PDF

Info

Publication number
WO2021114454A1
WO2021114454A1 PCT/CN2020/071457 CN2020071457W WO2021114454A1 WO 2021114454 A1 WO2021114454 A1 WO 2021114454A1 CN 2020071457 W CN2020071457 W CN 2020071457W WO 2021114454 A1 WO2021114454 A1 WO 2021114454A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
information
crawler
identification information
conversation
Prior art date
Application number
PCT/CN2020/071457
Other languages
English (en)
French (fr)
Inventor
洪镇宇
黄梅芬
王鑫渊
Original Assignee
网宿科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 网宿科技股份有限公司 filed Critical 网宿科技股份有限公司
Publication of WO2021114454A1 publication Critical patent/WO2021114454A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/40Network security protocols
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to the technical field of network communication, in particular to a method and device for detecting crawler requests.
  • a web crawler is a program or script that automatically crawls data resources from a website according to preset rules. It can crawl web pages from one or several web pages by continuously sending access requests (which can be called crawler requests) to the web server. Median resources and links, and then continue to visit and crawl subsequent pages through the crawled links, until all the required pages are crawled.
  • the website operator will adopt a certain crawler detection scheme to filter out crawler requests from all the access requests of the webpage.
  • the received access request can be detected based on preset crawler request detection rules. For example, when the source IP address of a certain access request belongs to the IP address in the blacklist, or a certain client is in a session If all the pages linked to a certain webpage are visited in, it can be determined that the corresponding access request is a crawler request, or the corresponding client is the sender of the crawler request.
  • embodiments of the present invention provide a method and device for detecting crawler requests.
  • the technical solution is as follows:
  • a method for detecting crawler requests includes:
  • a crawler detection model corresponding to the target website is trained, and crawler detection is performed on the target website through the crawler detection model.
  • a device for detecting crawler requests includes:
  • the information acquisition module is used to acquire the historical visit information of the target website in the target historical period, and divide the historical visit information containing the same main dimension identifier into the same information group;
  • a sequence creation module configured to create multiple single-dimensional conversation sequences corresponding to secondary dimension identifiers under the target information group based on historical access information under the target information group, and cross-latitude conversation sequences corresponding to the primary dimension identifier;
  • a feature extraction module configured to generate a feature vector corresponding to the target information group according to the traffic features corresponding to the cross-latitude conversation sequence and the traffic features corresponding to the multiple single-dimensional conversation sequences;
  • the crawler detection module is used to train a crawler detection model corresponding to the target website based on the feature vectors corresponding to all information groups under the target website, and perform crawler detection on the target website through the crawler detection model.
  • a network device in a third aspect, includes a processor and a memory.
  • the memory stores at least one instruction, at least one program, code set, or instruction set.
  • a piece of program, the code set or the instruction set is loaded and executed by the processor to implement the method for detecting crawler requests as described in the first aspect.
  • a computer-readable storage medium stores at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code
  • the set or instruction set is loaded and executed by the processor to implement the method for detecting crawler requests as described in the first aspect.
  • the historical visit information of the target website in the target historical period is obtained, and the historical visit information containing the same main dimension identifier is divided into the same information group; based on the historical visit information under the target information group, the target information group is created Multiple single-dimensional session sequences corresponding to the secondary dimension identifiers, and cross-latitude session sequences corresponding to the primary dimension identifiers; according to the traffic characteristics corresponding to the cross-latitude session sequences and the traffic characteristics corresponding to multiple single-dimensional session sequences, generate target information group correspondences Based on the feature vector corresponding to all information groups under the target website, train the crawler detection model corresponding to the target website, and perform crawler detection on the target website through the crawler detection model.
  • FIG. 1 is a flowchart of a method for detecting crawler requests provided by an embodiment of the present invention
  • Figure 2 is a flow chart of a method for detecting crawler requests provided by an embodiment of the present invention
  • Figure 3 is a flow chart of a method for detecting crawler requests provided by an embodiment of the present invention
  • FIG. 4 is a schematic structural diagram of an apparatus for detecting crawler requests provided by an embodiment of the present invention.
  • Fig. 5 is a schematic structural diagram of a network device provided by an embodiment of the present invention.
  • the embodiment of the present invention provides a method for detecting crawler requests.
  • the execution subject of the method can be any network device with data processing function, and an intelligent model can be created and trained based on machine learning technology.
  • the network device can be the back-end server of any website, which can create an intelligent model for crawler detection based on the visit history of the website, and then use the intelligent model to detect subsequent crawler requests.
  • the network device can also be a network node (such as a node server in a CDN cluster) for receiving and forwarding website access requests. It can train a crawler detection model for each website based on the historical reception of the access request, and then use crawler detection Model checking crawler request.
  • the above-mentioned network equipment may include a processor, a memory, and a transceiver.
  • the processor may be used to perform the processing of detecting crawler requests in the following procedures.
  • the memory may be used to store the data required and generated during the following processing.
  • the transceiver may Used to receive and send related data in the following processing.
  • Step 101 Obtain historical visit information of a target website in a target historical period, and divide historical visit information containing the same primary dimension identifier into the same information group.
  • the main dimension identifier can be carried in the user's access request to the website, and can be used to identify different users and distinguish between different access requests. Specifically, it can be the source IP address, user ID, or device fingerprint in the access request. Any item of can be selected according to actual needs. Of course, this embodiment also supports the selection of other feasible identifiers as the main dimension identifier.
  • the user can send an access request for the target website to the network device to access any webpage in the target website.
  • the network device can record the seven-tuple information of the access request, the pointed URL, access time, data packet size and other parameter information to generate historical access information of the target website. After that, the network device can use the historical visit information to train and generate a crawler detection model corresponding to the target website.
  • the network device can obtain the historical visit information of the target website in the target historical period, for example, obtain the historical visit information of the target website in the previous 7 days, and then group all the historical visit information according to the pre-selected main dimension identifier to group
  • the historical access information containing the same main dimension identifier is divided into the same information group.
  • the main dimension is identified as the source IP address, and the network device can group the historical visit information of the target website in the previous 7 days according to the source IP address.
  • Each information group is composed of historical visit information containing the same source IP address, and different information groups Can correspond to different source IP addresses.
  • network equipment can first use traditional crawler detection technology to perform crawler detection on the target website until it is based on historical visits. Information to establish a crawler detection model.
  • Step 102 Based on the historical access information under the target information group, create multiple single-dimensional conversation sequences corresponding to the secondary dimension identifiers under the target information group, and cross-latitude conversation sequences corresponding to the primary dimension identifiers.
  • the secondary dimension identifier can be carried in the client's access request to the website.
  • it can be further used to identify different users and distinguish different access requests.
  • It can be source IP address, user ID
  • One or more of the device fingerprint or browser identifier that is different from the primary dimension identifier can be specifically selected according to actual needs.
  • this embodiment also supports the selection of other feasible identifiers as secondary dimension identifiers.
  • the target information group is any information group obtained by dividing historical access information in step 101.
  • the network device after the network device groups the historical access information according to the main dimension identifier, it can create a session sequence in each information group.
  • the session sequence can be an ordered set composed of multiple sets of session information, and each group of session information can be Contain at least one specific access information in an HTTP session.
  • the network device may respectively create a cross-latitude session sequence corresponding to the primary dimension identifier and multiple single-dimensional session sequences corresponding to the secondary dimension identifier based on the historical access information under the target information group.
  • the network device may use the HTTP session as the granularity to divide the historical access information under the target information group into multiple groups of session information.
  • multiple sets of session information can be arranged in an orderly manner to generate a cross-latitude session sequence corresponding to the primary dimension identifier; on the other hand, multiple sets of session information can be divided according to the secondary dimension identifier A, and those with secondary dimension identifier A
  • the session information is arranged in an orderly manner to generate multiple single-dimensional conversation sequences corresponding to the secondary dimension identifier A; further, the secondary dimension identifier B can be selected to re-divide multiple sets of session information, and then the session information with the secondary dimension identifier B
  • the orderly arrangement is performed to generate multiple single-dimensional conversation sequences corresponding to the secondary dimension identifier B, so that multiple single-dimensional conversation sequences corresponding to all secondary dimension identifiers can be sequentially generated.
  • the primary dimension is identified as the source IP address
  • the secondary dimension is identified as the user ID
  • browser identification and device fingerprint there are 30 sets of session information under the information group IP1.
  • the 30 sets of session information can be arranged in an orderly manner to generate the information group IP1 The sequence of cross-latitude conversations under.
  • the 30 sets of session information include 3 user IDs "ID1, ID2, ID3”, the 30 sets of session information can be divided and arranged according to the user ID to generate a single-dimensional session sequence corresponding to the 3 user IDs; then, if The 30 sets of session information include 2 device fingerprints "MAC1, MAC2”, then the 30 sets of session information can be divided and arranged according to the device fingerprints to generate a single-dimensional session sequence corresponding to the 2 device fingerprints; next, if 30 sets The session information includes four browser identifiers "UA1, UA2, UA3, UA4", and 30 sets of session information can be divided and arranged according to the browser identifiers, thereby generating a single-dimensional session sequence corresponding to the four browser identifiers.
  • the creation process of the above-mentioned single-dimensional session sequence may be as follows: for the target secondary dimension identifiers under the target information group, obtain the webpage access information contained in the session corresponding to each target secondary dimension identification information; access the webpage information according to the access time The arrangement is performed to generate a single-dimensional conversation sequence corresponding to the identification information of each target secondary dimension.
  • the target secondary dimension identifier can be any pre-selected secondary dimension identifier, and the target secondary dimension identifier information can be specific identification information under the target secondary dimension identifier.
  • the target secondary dimension identifier is the user ID, and the target secondary dimension identifier The information is "ID1: 893***221", “ID2: 668***203" and so on.
  • the network device when the network device creates multiple single-dimensional conversation sequences corresponding to the target secondary dimension identifier, it can first extract from the historical access information under the target information group All the target secondary dimension identification information, and then filter the sessions corresponding to each target secondary dimension identification information, and then obtain all the webpage access information contained in these sessions.
  • the webpage access information may at least include the page identification and access time of the webpage. After that, the network device can arrange the above-mentioned webpage access information in the order of access time, thereby generating a single-dimensional conversation sequence corresponding to each target secondary dimension identification information.
  • the creation process of the above-mentioned cross-dimensional conversation sequence may be as follows: arrange all the webpage access information included in all the sessions of the target information group according to the access time, and generate the cross-dimensional conversation sequence corresponding to the main dimension identifier.
  • the webpage access information may at least include the page identification and access time of the webpage.
  • the network device when the network device creates the cross-dimensional session sequence corresponding to the main dimension identifier, it can arrange all the webpage access information contained in all sessions of the target information group in the order of access time, thereby generating the main information group.
  • the dimension identifies the corresponding cross-latitude conversation sequence. For example, based on the processing, the target group identification information for the sub-dimension of the single dimension of a session target sequence, the presence of a total of "1, SP 2, SP 3 SP" 3 dimensions session single sequence, it can be under each of the SP i Sort according to the access time, so that the cross-latitude conversation sequence corresponding to the main dimension identifier under the target information group can be obtained.
  • Step 103 Generate a feature vector corresponding to the target information group according to the traffic characteristics corresponding to the cross-latitude conversation sequence and the traffic characteristics corresponding to the multiple single-dimensional conversation sequences.
  • the traffic characteristics of each session sequence can be calculated.
  • the traffic characteristics here may be preset, and are used to characterize the characteristics of multiple sets of session information contained in each session sequence, such as session interval time, request packet size, request type, request packet content and other characteristics. It is worth mentioning that, for a cross-dimension conversation sequence, the similarity between conversation information corresponding to different secondary dimension identifiers can also be detected.
  • the network device can integrate and splice the calculated traffic characteristics of each session sequence to generate a feature vector corresponding to the target information group.
  • the session sequence can be scored according to the different website attributes carried in the session, and the scoring result can be used as a feature vector.
  • the corresponding processing can be as follows: obtain the site map of the target website, and establish the attribute score of the target website based on the site map Library: According to the attribute score library, each single-dimensional conversation sequence and cross-latitude conversation sequence are scored separately, and the scoring result is set as the feature vector of the target information group.
  • the attribute score database records the score values of different website attributes of the target website, and the website attributes include at least a web page URL, a web page referer, and all supported browser identifiers (ie, UA).
  • the network device can obtain the site map of the target website, and then record the access degree and jump relationship of each webpage in the target website according to the site map, as well as all the browser identifiers supported by the target website, and build an attribute score database.
  • the site map here can be actively provided by the target website, or it can be created by the network device based on the historical visit information of the target website; since the content of the website will be continuously updated, the network device can also update the site map and attribute score library regularly .
  • different scoring mechanisms can be used to score the same website attribute, and then the scores obtained by different scoring mechanisms can be weighted to obtain the comprehensive score of the website attribute. For example, there are three sets of scoring mechanisms.
  • scoring mechanism A is implemented based on the frequency of appearance of network attributes on the website, the scoring mechanism B is implemented according to the frequency of appearance of network attributes in all dynamic pages, and the scoring mechanism C is manually based on business experience Evaluate the value of the property.
  • the network device can score each session separately according to the attribute score database, and then score each single-dimensional session sequence and cross-latitude session sequence based on the session score, and can calculate the maximum score and minimum score in each session sequence. , Average score, weighted score, etc., so that the scoring result can be set as the feature vector of the target information group.
  • the rule of the crawler request can be summarized with a greater probability, and the accuracy of the crawler detection can be effectively improved.
  • the human-computer interaction information when the user visits the target website can also be used as the feature vector, and the corresponding processing can be as follows: based on the human-computer interaction information in the single-dimensional conversation sequence and the cross-latitude conversation sequence, generate the corresponding target information group Feature vector.
  • the network device can embed the human-computer interaction detection program in the feedback message after receiving the access request of the target website sent by the user terminal.
  • the program can be used to monitor whether there are mouse movements, clicks, buttons and other behaviors on the user terminal. , And can report the detection result to the network device.
  • the network device can record the human-computer interaction information generated in each session, and can generate the characteristics corresponding to the target information group based on the human-computer interaction information contained in the above-mentioned conversation sequence after creating a single-dimensional conversation sequence and a cross-latitude conversation sequence vector.
  • the network device can combine device fingerprints to further determine whether the detected human-computer interaction information contains forged information. In this way, by detecting human-computer interaction information and determining crawler requests from the perspective of human-computer interaction, normal requests and crawler requests can be more effectively identified, and the accuracy of crawler detection can be improved.
  • Step 104 Train a crawler detection model corresponding to the target website based on the feature vectors corresponding to all information groups under the target website, and perform crawler detection on the target website through the crawler detection model.
  • the network device can generate feature vectors corresponding to all information groups under the target website according to the processing of step 102 and step 103. In this way, the network device can use machine learning technology to train the crawler detection model corresponding to the target website based on these feature vectors. Specifically, the network device can use a supervised algorithm or an unsupervised algorithm to train the crawler detection model. After the crawler detection model training is completed, the network device can perform crawler detection on the target website through the crawler detection model, that is, identify the crawler request among all the received access requests of the target website. Furthermore, since the content of the website often changes and the web crawler is constantly updated, it is necessary to set a certain validity period for the crawler detection model.
  • the network device uses the crawler detection model to perform crawler detection, it can first determine whether the current moment is within the valid period of the crawler detection model, if it is, it can continue the detection, otherwise it can re-execute the processing from step 101 to step 104 to update the crawler detection model.
  • Step 201 Periodically count the main dimension identification information corresponding to the target website that appears in the current period.
  • the main dimension identification information may be specific identification information under the main dimension identification, for example, the main dimension identification is the source IP address, and the main dimension identification information may be "IP address 1: 192.***.***.200 ", "IP address 2: 255.***.***.101” and so on.
  • the network device can periodically analyze the historical visit information for the target website in the current period, and count all the main dimension identification information that appears therein. For example, the network device may perform statistics every 10 minutes to obtain all source IP addresses that appear in the historical visit information of the target website in the previous 10 minutes.
  • Step 202 For each main dimension identification information, based on all historical access information containing the main dimension identification information within a preset time period, create multiple single-dimensional session sequences and cross-latitude conversation sequences corresponding to the main dimension identification information.
  • step 102 where the preset duration may be preset by a technician, and may be appropriately adjusted according to requirements, for example, it may be 6 hours, 12 hours, and so on.
  • Step 203 Generate a feature vector corresponding to the primary dimension identification information according to the traffic characteristics of the multiple single-dimensional session sequences and the cross-latitude session sequences corresponding to the primary dimension identification information.
  • step 103 For the processing of this step, refer to step 103 for details.
  • Step 204 Input the feature vector corresponding to the main dimension identification information into the crawler detection model, and determine whether the main dimension identification information belongs to the crawler request according to the output content of the model.
  • batch detection of access requests periodically can ensure the timeliness of feedback for normal access requests, and there is no need to frequently perform crawler detection processing, which can reduce the resource consumption of the device to a certain extent.
  • Step 301 when a request to access the target website is received, the main dimension identification information of the access request is obtained.
  • the network device can detect the access request after receiving the access request of the target website, that is, it can first obtain the main dimension identification information of the access request. In this way, if it is detected that the access request is a normal request, the access request can be responded to according to the conventional response mechanism; if the access request is detected as a crawler request, the crawler request can be marked, and the crawler request can be discarded.
  • Step 302 Create a plurality of single-dimensional conversation sequences and cross-latitude conversation sequences corresponding to the main dimension identification information based on all historical access information including the main dimension identification information within a preset time period.
  • step 102 where the preset duration may be preset by a technician, and may be appropriately adjusted according to requirements, for example, it may be 6 hours, 12 hours, and so on.
  • Step 303 Generate a feature vector corresponding to the primary dimension identification information according to the traffic characteristics of the multiple single-dimensional session sequences and the cross-latitude session sequences corresponding to the primary dimension identification information.
  • step 103 For the processing of this step, refer to step 103 for details.
  • Step 304 Input the feature vector corresponding to the main dimension identification information into the crawler detection model, and determine whether the access request belongs to the crawler request according to the output content of the model.
  • the target access request is marked as a crawler request.
  • each time a network device detects a crawler request it can record all the dimensional identification information of the crawler request. Therefore, after receiving the target access request, the network device may first obtain all the dimension identification information of the target access request, and then compare it with the overall dimension identification information of the detected crawler request. If it is found that the overall similarity of the identification information of all dimensions between the target access request and a crawler request is greater than the preset threshold, the target access request can be directly identified as a crawler request, without the need to use the crawler detection model to detect the target access request . In this way, first use the method of comparing all the dimensional identification information to perform preliminary detection on the access request, which can quickly and simply identify part of the crawler request, thereby reducing the workload of crawler detection and saving equipment resources.
  • the detection accuracy of the crawler detection model can be verified in advance, and the corresponding processing can be as follows: verify the detection accuracy of the crawler detection model based on the preset crawler feature material library; if the detection accuracy is If the threshold is lower than the preset threshold, the newly added historical visit information of the target website will be supplemented; based on the newly added historical visit information and historical visit information, the crawler detection model corresponding to the target website will be retrained.
  • a crawler feature material library may be preset at the network device, and the crawler feature material library may contain a large number of determined feature vectors of the crawler request.
  • the network device can verify the detection accuracy of the crawler detection model corresponding to the target website based on the crawler feature material library. If the detection accuracy rate is lower than the preset threshold, the network device can supplementally obtain new historical visit information of the target website.
  • the new historical visit information may be in addition to the historical visit information obtained during model training in step 101.
  • the historical visit information of the target website that is, if the historical visit information of the past 7 days is obtained when the crawler detection model is trained, the new historical visit information may be the historical visit information of the previous 8 days.
  • the network device can retrain the crawler detection model corresponding to the target website based on the newly added historical visit information and historical visit information in the process from step 101 to step 104 until the detection accuracy of the crawler detection model reaches a preset threshold.
  • verifying the crawler detection model through the crawler feature material library can effectively ensure the detection accuracy of the crawler detection model; on the other hand, supplementing to obtain historical access information and retraining the crawler detection model can make the model closer to the target Web crawler detection requirements.
  • network devices can change the method of model training or set the form of crawler request whitelists to avoid preventing the crawler detection model from detecting part of the crawler requests.
  • the historical visit information of the target website in the target historical period is obtained, and the historical visit information containing the same main dimension identifier is divided into the same information group; based on the historical visit information under the target information group, the target information group is created Multiple single-dimensional session sequences corresponding to the secondary dimension identifiers, and cross-latitude session sequences corresponding to the primary dimension identifiers; according to the traffic characteristics corresponding to the cross-latitude session sequences and the traffic characteristics corresponding to multiple single-dimensional session sequences, generate target information group correspondences Based on the feature vector corresponding to all information groups under the target website, train the crawler detection model corresponding to the target website, and perform crawler detection on the target website through the crawler detection model.
  • an embodiment of the present invention also provides a device for detecting crawler requests. As shown in FIG. 4, the device includes:
  • the information acquisition module 401 is configured to acquire historical visit information of the target website in the target historical period, and divide the historical visit information containing the same main dimension identifier into the same information group;
  • the sequence creation module 402 is configured to create multiple single-dimensional conversation sequences corresponding to the secondary dimension identifiers of the target information group based on the historical access information under the target information group, and cross-latitude conversation sequences corresponding to the primary dimension identifiers;
  • the feature extraction module 403 is configured to generate a feature vector corresponding to the target information group according to the traffic features corresponding to the cross-latitude conversation sequence and the traffic features corresponding to the multiple single-dimensional conversation sequences;
  • the crawler detection module 404 is configured to train a crawler detection model corresponding to the target website based on the feature vectors corresponding to all information groups under the target website, and perform crawler detection on the target website through the crawler detection model.
  • the crawler detection module 404 is specifically used for:
  • each main dimension identification information For each main dimension identification information, create multiple single-dimensional conversation sequences and cross-latitude conversation sequences corresponding to the main dimension identification information based on all historical access information containing the main dimension identification information within a preset time period;
  • the feature vector corresponding to the primary dimension identification information is input into the crawler detection model, and it is determined whether the primary dimension identification information belongs to the crawler request according to the output content of the model.
  • the crawler detection module 404 is specifically used for:
  • the feature vector corresponding to the primary dimension identification information is input into the crawler detection model, and it is determined whether the access request belongs to the crawler request according to the output content of the model.
  • Fig. 5 is a schematic structural diagram of a network device provided by an embodiment of the present invention.
  • the network device 500 may have relatively large differences due to different configurations or performances, and may include one or more central processing units 522 (for example, one or more processors) and a memory 532, and one or more storage application programs 542 or
  • the storage medium 530 of the data 544 (for example, one or a storage device in a large amount).
  • the memory 532 and the storage medium 530 may be short-term storage or persistent storage.
  • the program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the network device 500.
  • the central processing unit 522 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the network device 500.
  • the network device 500 may also include one or more power supplies 529, one or more wired or wireless network interfaces 550, one or more input and output interfaces 558, one or more keyboards 556, and/or, one or more operating systems 541, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc.
  • the network device 500 may include a memory and one or more programs. One or more programs are stored in the memory and configured to be executed by one or more processors. The above instructions for detecting crawler requests.
  • the program can be stored in a computer-readable storage medium.
  • the storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本发明公开了一种检测爬虫请求的方法和装置,属于网络通信技术领域。所述方法包括:获取目标历史时段中目标网站的历史访问信息,将包含相同的主维度标识的历史访问信息划分至同一信息组;基于目标信息组下的历史访问信息,创建目标信息组下副维度标识对应的多个单维度会话序列,和主维度标识对应的跨纬度会话序列;根据跨纬度会话序列对应的流量特征,和多个单维度会话序列对应的流量特征,生成目标信息组对应的特征向量;基于目标网站下所有信息组对应的特征向量,训练目标网站对应的爬虫检测模型,通过爬虫检测模型对目标网站进行爬虫检测。采用本发明,可以更加精确有效地检测出传统的爬虫请求和新型爬虫请求。

Description

一种检测爬虫请求的方法和装置 技术领域
本发明涉及网络通信技术领域,特别涉及一种检测爬虫请求的方法和装置。
背景技术
网络爬虫是一种按照预设规则自动从网站抓取数据资源的程序或者脚本,其可以通过不断向网站服务器发送访问请求(可称为爬虫请求)的方式,从一个或若干网页开始抓取网页中资源和链接,再通过抓取到的链接继续访问并抓取后续网页,直至抓取完所需的全部网页。
据统计,目前针对网页的访问请求中爬虫请求的数量占比达到了一半以上,而对于网站运营方来讲,大量的爬虫请求会对网站服务器造成极高的负载压力。因此,网站运营方会采用一定的爬虫检测方案以从网页的全部访问请求中筛选过滤出爬虫请求。具体的,可以基于预先设定的爬虫请求检测规则,对接收到的访问请求进行检测,例如,当某个访问请求的源IP地址属于黑名单中的IP地址,或者某一客户端在一次会话中访问了某一网页链接的全部页面,则可以判定相应的访问请求为爬虫请求,或者相应的客户端为爬虫请求的发送端。
在实现本发明的过程中,发明人发现现有技术至少存在以下问题:
随着计算机技术的发展,出现了满足跨会话、爬取频率随机、应用IP代理池等特点的新型网络爬虫,传统的爬虫请求检测规则已无法对这些新型网络爬虫生成的新型爬虫请求进行有效检测,故而目前亟需一种既能够识别传统的爬虫请求,也可以有效检测出新型爬虫请求的爬虫检测方案。
发明内容
为了解决现有技术的问题,本发明实施例提供了一种检测爬虫请求的方法和装置。所述技术方案如下:
第一方面,提供了一种检测爬虫请求的方法,所述方法包括:
获取目标历史时段中目标网站的历史访问信息,将包含相同的主维度标识 的历史访问信息划分至同一信息组;
基于目标信息组下的历史访问信息,创建所述目标信息组下副维度标识对应的多个单维度会话序列,和所述主维度标识对应的跨纬度会话序列;
根据所述跨纬度会话序列对应的流量特征,和所述多个单维度会话序列对应的流量特征,生成所述目标信息组对应的特征向量;
基于目标网站下所有信息组对应的特征向量,训练所述目标网站对应的爬虫检测模型,通过所述爬虫检测模型对所述目标网站进行爬虫检测。
第二方面,提供了一种检测爬虫请求的装置,所述装置包括:
信息获取模块,用于获取目标历史时段中目标网站的历史访问信息,将包含相同的主维度标识的历史访问信息划分至同一信息组;
序列创建模块,用于基于目标信息组下的历史访问信息,创建所述目标信息组下副维度标识对应的多个单维度会话序列,和所述主维度标识对应的跨纬度会话序列;
特征提取模块,用于根据所述跨纬度会话序列对应的流量特征,和所述多个单维度会话序列对应的流量特征,生成所述目标信息组对应的特征向量;
爬虫检测模块,用于基于目标网站下所有信息组对应的特征向量,训练所述目标网站对应的爬虫检测模型,通过所述爬虫检测模型对所述目标网站进行爬虫检测。
第三方面,提供了一种网络设备,所述网络设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如第一方面所述的检测爬虫请求的方法。
第四方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如第一方面所述的检测爬虫请求的方法。
本发明实施例提供的技术方案带来的有益效果是:
本发明实施例中,获取目标历史时段中目标网站的历史访问信息,将包含相同的主维度标识的历史访问信息划分至同一信息组;基于目标信息组下的历史访问信息,创建目标信息组下副维度标识对应的多个单维度会话序列,和主维度标识对应的跨纬度会话序列;根据跨纬度会话序列对应的流量特征,和多个单维度会话序列对应的流量特征,生成目标信息组对应的特征向量;基于目标网站下所有信息组对应的特征向量,训练目标网站对应的爬虫检测模型,通过爬虫检测模型对目标网站进行爬虫检测。这样,通过构建不同维度的会话序列,从整体上以跨会话的方式对访问请求进行分析,再利用机器学习技术,针对不同网站构建爬虫检测模型,可以更加直观便捷地发现爬虫请求的总体特征和发送规律,从而可以更加精确有效地检测出传统的爬虫请求和新型爬虫请求。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例提供的一种检测爬虫请求的方法流程图;
图2是本发明实施例提供的一种检测爬虫请求的方法流程图;
图3是本发明实施例提供的一种检测爬虫请求的方法流程图;
图4是本发明实施例提供的一种检测爬虫请求的装置结构示意图;
图5是本发明实施例提供的一种网络设备的结构示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
本发明实施例提供了一种检测爬虫请求的方法,该方法的执行主体可以是任意具备数据处理功能的网络设备,且可以基于机器学习技术创建并训练智能模型。其中,网络设备可以是任意网站的后台服务器,其可以根据网站的访问历史,创建用于爬虫检测的智能模型,然后再利用智能模型检测后续的爬虫请 求。网络设备还可以是用于接收并转发网站访问请求的网络节点(如CDN集群中的节点服务器),其可以根据访问请求的历史接收情况,训练出针对各个网站的爬虫检测模型,再利用爬虫检测模型检测爬虫请求。上述网络设备可以包括处理器、存储器和收发器,处理器可以用于进行下述流程中检测爬虫请求的处理,存储器可以用于存储下述处理过程中需要的数据以及产生的数据,收发器可以用于接收和发送下述处理过程中的相关数据。
下面将结合具体实施方式,对图1所示的处理流程进行详细的说明,内容可以如下:
步骤101,获取目标历史时段中目标网站的历史访问信息,将包含相同的主维度标识的历史访问信息划分至同一信息组。
其中,主维度标识可以是用户端对于网站的访问请求中携带的,可以用来标识不同用户端、区分不同访问请求的标识,具体可以是访问请求中的源IP地址、用户ID或设备指纹中的任一项,可以根据实际需要选定,当然本实施例还支持选取其它可行标识作为主维度标识的情况。
在实施中,在目标网站上线后,用户可以向网络设备发送针对目标网站的访问请求,以访问目标网站中任一网页。在接收到针对目标网站的访问请求后,网络设备可以对访问请求的七元组信息、指向的URL、访问时间、数据包大小等参数信息进行记录,生成目标网站的历史访问信息。之后,网络设备可以利用该历史访问信息训练生成目标网站对应的爬虫检测模型。具体的,网络设备可以获取目标历史时段中目标网站的历史访问信息,如获取前7天内的目标网站的历史访问信息,然后按照预先选定的主维度标识对所有历史访问信息进行分组,以将包含相同的主维度标识的历史访问信息划分到同一信息组。例如,主维度标识为源IP地址,网络设备则可以按照源IP地址对前7天内目标网站的历史访问信息进行分组,每个信息组由包含相同源IP地址的历史访问信息组成,不同信息组可以对应不同的源IP地址。值得一提的是,在目标网站上线的初期,由于不存在足够的历史访问信息来建立爬虫检测模型,网络设备则可以先采用传统的爬虫检测技术来对目标网站进行爬虫检测,直至基于历史访问信息建立爬虫检测模型。
步骤102,基于目标信息组下的历史访问信息,创建目标信息组下副维度标识对应的多个单维度会话序列,和主维度标识对应的跨纬度会话序列。
其中,副维度标识可以是用户端对于网站的访问请求中携带的,在主维度标识之外的,可以进一步用来标识不同用户端、区分不同访问请求的标识,可以为源IP地址、用户ID、设备指纹或浏览器标识中与主维度标识不同的一项或多项,具体可以根据实际需要选定,当然本实施例还支持选取其它可行标识作为副维度标识的情况。目标信息组为步骤101中对历史访问信息划分得到的任一信息组。
在实施中,网络设备按照主维度标识对历史访问信息进行分组后,可以在各个信息组中分别创建会话序列,该会话序列可以是由多组会话信息组成的有序集合,每组会话信息可以至少包含一次HTTP会话中的具体访问信息。以目标信息组为例,网络设备可以基于目标信息组下的历史访问信息,分别创建主维度标识对应的跨纬度会话序列,和副维度标识对应的多个单维度会话序列。具体的,网络设备可以以HTTP会话为粒度,将目标信息组下的历史访问信息分为多组会话信息。之后,一方面可以对多组会话信息进行有序排列,生成主维度标识对应的跨纬度会话序列;另一方面可以按照副维度标识A对多组会话信息进行划分,对具备副维度标识A的会话信息进行有序排列,生成副维度标识A对应的多个单维度会话序列;进一步的,还可以选取副维度标识B对多组会话信息重新进行划分,再对具备副维度标识B的会话信息进行有序排列,生成副维度标识B对应的多个单维度会话序列,从而可以依次生成所有副维度标识对应的多个单维度会话序列。此处,还可以同时选取2个副维度标识,对会话信息进行划分和排列,生成2个副维度标识共同对应的多个单维度会话序列。
例如,主维度标识为源IP地址,副维度标识为用户ID、浏览器标识和设备指纹,而信息组IP1下存在30组会话信息,可以将30组会话信息进行有序排列,生成信息组IP1下的跨纬度会话序列。若30组会话信息中包括“ID1、ID2、ID3”3个用户ID,则可以按用户ID对30组会话信息进行划分和排列,从而生成3个用户ID对应的单维度会话序列;之后,若30组会话信息中包括“MAC1、MAC2”2个设备指纹,则可以按设备指纹对30组会话信息进行划分和排列,从而生成2个设备指纹对应的单维度会话序列;接下来,若30组会话信息中包括“UA1、UA2、UA3、UA4”4个浏览器标识,则可以按浏览器标识对30组会话信息进行划分和排列,从而生成4个浏览器标识对应的单维度会话序列。
具体的,上述单维度会话序列的创建过程可以如下:对于目标信息组下的 目标副维度标识,获取每个目标副维度标识信息对应的会话所包含的网页访问信息;按照访问时间对网页访问信息进行排列,生成每个目标副维度标识信息对应的单维度会话序列。
其中,目标副维度标识可以是预先选定的任一副维度标识,目标副维度标识信息则可以是目标副维度标识下的具体标识信息,例如,目标副维度标识为用户ID,目标副维度标识信息为“ID1:893***221”、“ID2:668***203”等。
在实施中,以目标信息组下的目标副维度标识为例,网络设备在创建目标副维度标识对应的多个单维度会话序列时,可以先从目标信息组下的历史访问信息中,提取出全部的目标副维度标识信息,然后筛选每个目标副维度标识信息对应的会话,再获取这些会话包含的全部网页访问信息。其中,网页访问信息可以至少包括网页的页面标识和访问时间。之后,网络设备可以按照访问时间的顺序对上述网页访问信息进行排列,从而生成每个目标副维度标识信息对应的单维度会话序列。可以定义S={s 1,s 2,s 3,…,s n}为目标信息组下的会话集合,元素s i代表一次会话,
Figure PCTCN2020071457-appb-000001
表示第i个会话内的第j次访问,
Figure PCTCN2020071457-appb-000002
h为页面标识,t为访问时间。假设目标副维度标识信息对应的会话有:
Figure PCTCN2020071457-appb-000003
其中,
Figure PCTCN2020071457-appb-000004
Figure PCTCN2020071457-appb-000005
则单维度会话序列可以为
Figure PCTCN2020071457-appb-000006
具体的,上述跨维度会话序列的创建过程可以如下:按照访问时间对目标信息组所有会话包含的全部网页访问信息进行排列,生成主维度标识对应的跨纬度会话序列。
其中,网页访问信息可以至少包括网页的页面标识和访问时间。
在实施中,以目标信息组为例,网络设备在创建主维度标识对应的跨维度会话序列时,可以按照访问时间的顺序对目标信息组所有会话包含的全部网页访问信息进行排列,从而生成主维度标识对应的跨纬度会话序列。例如,基于上述单维度会话序列的处理,目标信息组针对目标副维度标识,总共存在“SP 1,SP 2,SP 3”3个单维度会话序列,则可以将每个SP i下的
Figure PCTCN2020071457-appb-000007
按照访问时间进行排序,从而可以得到目标信息组下主维度标识对应的跨纬度会话序列。
步骤103,根据跨纬度会话序列对应的流量特征,和多个单维度会话序列对应的流量特征,生成目标信息组对应的特征向量。
在实施中,网络设备针对目标信息组创建了多个单维度会话序列和跨维度 会话序列之后,可以计算每个会话序列的流量特征。此处的流量特征可以是预先设定的,用于表征各个会话序列包含的多组会话信息的特征,如可以是会话间隔时间、请求包大小、请求类型、请求包内容等特征。值得一提的是,对于跨维度会话序列,还可以检测其包含不同副维度标识对应的会话信息间的相似度。接下来,网络设备可以将计算得到的各个会话序列的流量特征进行整合拼接,从而生成目标信息组对应的特征向量。
可选的,可以根据会话中携带的不同网站属性,对会话序列进行打分,并将打分结果作为特征向量,相应的处理可以如下:获取目标网站的网站地图,基于网站地图建立目标网站的属性分数库;根据属性分数库分别对每个单维度会话序列和跨纬度会话序列进行打分,将打分结果设为目标信息组的特征向量。
其中,属性分数库记录有目标网站的不同网站属性的分数值,网站属性至少包括网页URL、网页referer和支持的所有浏览器标识(即UA)。
在实施中,网络设备可以获取目标网站的网站地图,然后根据网站地图记录目标网站中各个网页的出入度及跳转关系,以及目标网站所支持的所有浏览器标识,搭建属性分数库。此处网站地图可以是目标网站主动提供的,也可以是网络设备根据目标网站的历史访问信息自行创建的;由于网站的内容会持续更新,网络设备也可以对网站地图以及属性分数库进行定期更新。具体来讲,可以采用不同的打分机制对同一网站属性进行打分,之后再对不同打分机制得到的分数进行加权,以得到该网站属性的综合得分。例如存在三套打分机制,打分机制A是基于网络属性在该网站的出现频率来实现,打分机制B根据网络属性在所有动态页面中的出现频率来实现,打分机制C是由人工根据业务经验的对属性价值进行评估。例如,打分机制A可以存在针对“URL、UA、referer”三种网站属性的分数:a url={(url 1,a_score url1),(url 2,a_score url2),(url 3,a_score url3)},a ua={(ua 1,a_score ua1),(ua 2,a_score ua2)},a referer={(referer 1,a_score referer1),(referer 2,a_score referer2)};打分机制B可以存在针对“URL、UA”两种网站属性的分数:b url={(url 1,b_score url1),(url 2,b_score url2)},b ua={(ua 1,b_score url1),(ua 3,b_score ua3)};打分机制C或其它打分机制同理。针对每个网站属性的加权分数可以为:score_traget i=(a_score target_i+ b_score target_i+…+n_score target_i)/n。进而,网络设备可以根据属性分数库分别对每个会话进行打分,再基于会话得分对每个单维度会话序列和跨纬度会话序列进行打分,并可以计算每个会话序列中的最大分、最小分、平均分、加权分等,从而可以将打分结果设为目标信息组的特征向量。这样,通过打分机制多角度地评估会话与网站的关联性,并将会话得分作为爬虫请求的检测角度,可以更大概率总结出爬虫请求的规律,有效提高爬虫检测的准确性。
可选的,可以将用户访问目标网站时的人机交互信息也作为特征向量,相应的处理可以如下:基于单维度会话序列和跨纬度会话序列中的人机交互信息,生成目标信息组对应的特征向量。
在实施中,网络设备在接收到用户端发送的目标网站的访问请求后,可以在反馈消息中嵌入人机交互检测程序,该程序可以用于监测用户端是否存在鼠标移动、点击、按键等行为,并可以将检测结果上报给网络设备。这样,网络设备可以记录每次会话中产生的人机交互信息,并可以在创建单维度会话序列和跨纬度会话序列后,基于上述会话序列中包含的人机交互信息生成目标信息组对应的特征向量。此外,网络设备可以结合设备指纹,进一步判断探测到的人机交互信息是否包含伪造信息。这样,通过对人机交互信息进行探测,并从人机交互的角度判定爬虫请求,可以更加有效地识别出正常请求和爬虫请求,提高爬虫检测的准确性。
步骤104,基于目标网站下所有信息组对应的特征向量,训练目标网站对应的爬虫检测模型,通过爬虫检测模型对目标网站进行爬虫检测。
在实施中,网络设备可以按照步骤102和步骤103的处理,生成目标网站下所有信息组对应的特征向量。这样,网络设备可以基于这些特征向量,利用机器学习技术训练目标网站对应的爬虫检测模型。具体来说,网络设备可以采用有监督算法或者无监督算法来训练爬虫检测模型。在爬虫检测模型训练完成后,网络设备可以通过爬虫检测模型对目标网站进行爬虫检测,即在接收到的目标网站的所有访问请求中,识别出爬虫请求。进一步的,由于网站的内容时常发生变更,且网络爬虫的不断更新,故而需要对爬虫检测模型设置一定的有效期限。网络设备在使用爬虫检测模型进行爬虫检测时,可以先判断当前时刻是否处于爬虫检测模型的有效期限内,如果处于则可以继续检测,否则可以重新执行步骤101至步骤104的处理,以更新爬虫检测模型。
值得一提的是,当出现新上线的网站,或者需要对网站的爬虫检测模型进更新,再或者当发现爬虫检测模型的检测结果的正确率过低时,均可以触发执行上述步骤101至步骤104的处理。
可选的,利用爬虫检测模型进行爬虫检测的机制多种多样,如下给出了如图2和图3示出的两种可行的机制:
机制一:步骤201,周期性统计当前周期内出现的目标网站对应的主维度标识信息。
其中,主维度标识信息可以是主维度标识下的具体标识信息,例如,主维度标识为源IP地址,主维度标识信息则可以为“IP地址1:192.***.***.200”、“IP地址2:255.***.***.101”等。
在实施中,网络设备可以周期性地对当前周期内针对目标网站的历史访问信息进行分析,统计其中出现的所有主维度标识信息。举例来说,网络设备可以每10分钟进行一次统计,获取前10分钟内目标网站的历史访问信息中出现的所有源IP地址。
步骤202,针对每个主维度标识信息,基于预设时长内的包含主维度标识信息的全部历史访问信息,创建主维度标识信息对应的多个单维度会话序列和跨纬度会话序列。
本步骤的处理具体可以参考步骤102,其中预设时长可以是由技术人员预先设定的,并可以根据需求进行适当调整,如可以是6个小时、12个小时等。
步骤203,根据主维度标识信息对应的多个单维度会话序列和跨纬度会话序列的流量特征,生成主维度标识信息对应的特征向量。
本步骤的处理具体可以参考步骤103。
步骤204,将主维度标识信息对应的特征向量输入爬虫检测模型,根据模型输出内容判断主维度标识信息是否属于爬虫请求。
这样,周期性对访问请求批量进行检测,可以保证针对正常访问请求反馈的及时性,并且无需频繁执行爬虫检测处理,一定程度上可以降低设备的资源消耗。
机制二:步骤301,当接收到目标网站的访问请求时,获取访问请求的主维度标识信息。
在实施中,网络设备可以在接收到目标网站的访问请求后,就对该访问请 求进行检测,即可以先获取访问请求的主维度标识信息。这样,若检测出访问请求为正常请求,则可以按照常规的响应机制对该访问请求进行响应;若检测出访问请求为爬虫请求,则可以对该爬虫请求进行标记,并丢弃该爬虫请求。
步骤302,基于预设时长内的包含主维度标识信息的全部历史访问信息,创建主维度标识信息对应的多个单维度会话序列和跨纬度会话序列。
本步骤的处理具体可以参考步骤102,其中预设时长可以是由技术人员预先设定的,并可以根据需求进行适当调整,如可以是6个小时、12个小时等。
步骤303,根据主维度标识信息对应的多个单维度会话序列和跨纬度会话序列的流量特征,生成主维度标识信息对应的特征向量。
本步骤的处理具体可以参考步骤103。
步骤304,将主维度标识信息对应的特征向量输入爬虫检测模型,根据模型输出内容判断访问请求是否属于爬虫请求。
这样,每接收到一个访问请求即进行检测,可以及时有效地识别出爬虫请求,避免网站内容被频繁的恶意爬取。
可选的,在接收到某个访问请求后,可以先通过比对维度标识信息的方式,简单地对其进行检测,相应的处理可以如下:若接收到的目标访问请求与已检测出的爬虫请求间所有维度标识信息的相似度大于预设阈值,则将目标访问请求标记为爬虫请求。
在实施中,网络设备每次检测出爬虫请求之后,均可以对该爬虫请求的所有维度标识信息进行记录。因此,网络设备在接收到目标访问请求之后,可以先获取目标访问请求的所有维度标识信息,然后将其与已检测出的爬虫请求的维度标识信息整体进行对比。若发现目标访问请求与某个爬虫请求间所有维度标识信息在整体上的相似度大于预设阈值,则可以直接认定目标访问请求为爬虫请求,而无需再利用爬虫检测模型对目标访问请求进行检测。这样,先利用比对所有维度标识信息的方式,对访问请求进行初步的检测,可以快捷简单地识别出部分爬虫请求,从而可以降低爬虫检测的工作量,节省设备资源。
可选的,在模型训练完成后,可以预先对爬虫检测模型的检测准确率进行验证,相应的处理可以如下:基于预设的爬虫特征素材库验证爬虫检测模型的检测准确率;若检测准确率低于预设阈值,则补充获取目标网站的新增历史访问信息;基于新增历史访问信息和历史访问信息,重新训练目标网站对应的爬 虫检测模型。
在实施中,网络设备处可以预先设置有爬虫特征素材库,该爬虫特征素材库中可以包含大量已确定的爬虫请求的特征向量。网络设备在训练完成目标网站对应的爬虫检测模型后,可以基于该爬虫特征素材库,来验证目标网站对应的爬虫检测模型的检测准确率。如果检测准确率低于预设阈值,网络设备则可以补充获取目标网站的新增历史访问信息,该新增历史访问信息可以是除步骤101中模型训练时所获取的历史访问信息之外的、目标网站的历史访问信息,即:若在训练爬虫检测模型时获取了过去7天的历史访问信息,则新增历史访问信息可以是前第8天的历史访问信息。之后,网络设备可以基于新增历史访问信息和历史访问信息,以步骤101至步骤104的流程,重新训练目标网站对应的爬虫检测模型,直至爬虫检测模型的检测准确率达到预设阈值。这样,一方面,通过爬虫特征素材库对爬虫检测模型进行验证,可以有效保证爬虫检测模型的检测准确率;另一方面,补充获取历史访问信息,重新训练爬虫检测模型,可以使得模型更贴近目标网站的爬虫检测需求。
需要说明的是,鉴于爬虫请求有利于网站内容的推广,网络设备可以通过更改模型训练的方式,或者设置爬虫请求白名单的形式,以避免阻止爬虫检测模型对部分的爬虫请求进行检测。
本发明实施例中,获取目标历史时段中目标网站的历史访问信息,将包含相同的主维度标识的历史访问信息划分至同一信息组;基于目标信息组下的历史访问信息,创建目标信息组下副维度标识对应的多个单维度会话序列,和主维度标识对应的跨纬度会话序列;根据跨纬度会话序列对应的流量特征,和多个单维度会话序列对应的流量特征,生成目标信息组对应的特征向量;基于目标网站下所有信息组对应的特征向量,训练目标网站对应的爬虫检测模型,通过爬虫检测模型对目标网站进行爬虫检测。这样,通过构建不同维度的会话序列,从整体上以跨会话的方式对访问请求进行分析,再利用机器学习技术,针对不同网站构建爬虫检测模型,可以更加直观便捷地发现爬虫请求的总体特征和发送规律,从而可以更加精确有效地检测出传统的爬虫请求和新型爬虫请求。
基于相同的技术构思,本发明实施例还提供了一种检测爬虫请求的装置,如图4所示,所述装置包括:
信息获取模块401,用于获取目标历史时段中目标网站的历史访问信息,将包含相同的主维度标识的历史访问信息划分至同一信息组;
序列创建模块402,用于基于目标信息组下的历史访问信息,创建所述目标信息组下副维度标识对应的多个单维度会话序列,和所述主维度标识对应的跨纬度会话序列;
特征提取模块403,用于根据所述跨纬度会话序列对应的流量特征,和所述多个单维度会话序列对应的流量特征,生成所述目标信息组对应的特征向量;
爬虫检测模块404,用于基于目标网站下所有信息组对应的特征向量,训练所述目标网站对应的爬虫检测模型,通过所述爬虫检测模型对所述目标网站进行爬虫检测。
可选的,所述爬虫检测模块404,具体用于:
周期性统计当前周期内出现的所述目标网站对应的主维度标识信息;
针对每个主维度标识信息,基于预设时长内的包含所述主维度标识信息的全部历史访问信息,创建所述主维度标识信息对应的多个单维度会话序列和跨纬度会话序列;
根据所述主维度标识信息对应的多个单维度会话序列和跨纬度会话序列的流量特征,生成所述主维度标识信息对应的特征向量;
将所述主维度标识信息对应的特征向量输入所述爬虫检测模型,根据模型输出内容判断所述主维度标识信息是否属于爬虫请求。
可选的,所述爬虫检测模块404,具体用于:
当接收到目标网站的访问请求时,获取所述访问请求的主维度标识信息;
基于预设时长内的包含所述主维度标识信息的全部历史访问信息,创建所述主维度标识信息对应的多个单维度会话序列和跨纬度会话序列;
根据所述主维度标识信息对应的多个单维度会话序列和跨纬度会话序列的流量特征,生成所述主维度标识信息对应的特征向量;
将所述主维度标识信息对应的特征向量输入所述爬虫检测模型,根据模型输出内容判断所述访问请求是否属于爬虫请求。
图5是本发明实施例提供的网络设备的结构示意图。该网络设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器522 (例如,一个或一个以上处理器)和存储器532,一个或一个以上存储应用程序542或数据544的存储介质530(例如一个或一个以上海量存储设备)。其中,存储器532和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对网络设备500中的一系列指令操作。更进一步地,中央处理器522可以设置为与存储介质530通信,在网络设备500上执行存储介质530中的一系列指令操作。
网络设备500还可以包括一个或一个以上电源529,一个或一个以上有线或无线网络接口550,一个或一个以上输入输出接口558,一个或一个以上键盘556,和/或,一个或一个以上操作系统541,例如Windows Server,Mac OS X,Unix,Linux,FreeBSD等等。
网络设备500可以包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行上述检测爬虫请求的指令。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (15)

  1. 一种检测爬虫请求的方法,其特征在于,所述方法包括:
    获取目标历史时段中目标网站的历史访问信息,将包含相同的主维度标识的历史访问信息划分至同一信息组;
    基于目标信息组下的历史访问信息,创建所述目标信息组下副维度标识对应的多个单维度会话序列,和所述主维度标识对应的跨纬度会话序列;
    根据所述跨纬度会话序列对应的流量特征,和所述多个单维度会话序列对应的流量特征,生成所述目标信息组对应的特征向量;
    基于目标网站下所有信息组对应的特征向量,训练所述目标网站对应的爬虫检测模型,通过所述爬虫检测模型对所述目标网站进行爬虫检测。
  2. 根据权利要求1所述的方法,其特征在于,所述主维度标识为源IP地址、用户ID或设备指纹;所述副维度标识为源IP地址、用户ID、设备指纹或浏览器标识。
  3. 根据权利要求1所述的方法,其特征在于,所述创建所述目标信息组下副维度标识对应的多个单维度会话序列,包括:
    对于所述目标信息组下的目标副维度标识,获取每个目标副维度标识信息对应的会话所包含的网页访问信息;
    按照访问时间对所述网页访问信息进行排列,生成每个所述目标副维度标识信息对应的单维度会话序列。
  4. 根据权利要求1所述的方法,其特征在于,所述创建所述主维度标识对应的跨纬度会话序列,包括:
    按照访问时间对所述目标信息组所有会话包含的全部网页访问信息进行排列,生成所述主维度标识对应的跨纬度会话序列。
  5. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取所述目标网站的网站地图,基于所述网站地图建立所述目标网站的属 性分数库,其中,所述属性分数库记录有目标网站的不同网站属性的分数值,所述网站属性至少包括网页URL、网页referer和支持的所有浏览器标识;
    根据所述属性分数库分别对每个所述单维度会话序列和所述跨纬度会话序列进行打分,将所述打分结果设为目标信息组的特征向量。
  6. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    基于所述单维度会话序列和所述跨纬度会话序列中的人机交互信息,生成所述目标信息组对应的特征向量。
  7. 根据权利要求1所述的方法,其特征在于,所述通过所述爬虫检测模型对所述目标网站进行爬虫检测,包括:
    周期性统计当前周期内出现的所述目标网站对应的主维度标识信息;
    针对每个主维度标识信息,基于预设时长内的包含所述主维度标识信息的全部历史访问信息,创建所述主维度标识信息对应的多个单维度会话序列和跨纬度会话序列;
    根据所述主维度标识信息对应的多个单维度会话序列和跨纬度会话序列的流量特征,生成所述主维度标识信息对应的特征向量;
    将所述主维度标识信息对应的特征向量输入所述爬虫检测模型,根据模型输出内容判断所述主维度标识信息是否属于爬虫请求。
  8. 根据权利要求1所述的方法,其特征在于,所述通过所述爬虫检测模型对所述目标网站进行爬虫检测,包括:
    当接收到目标网站的访问请求时,获取所述访问请求的主维度标识信息;
    基于预设时长内的包含所述主维度标识信息的全部历史访问信息,创建所述主维度标识信息对应的多个单维度会话序列和跨纬度会话序列;
    根据所述主维度标识信息对应的多个单维度会话序列和跨纬度会话序列的流量特征,生成所述主维度标识信息对应的特征向量;
    将所述主维度标识信息对应的特征向量输入所述爬虫检测模型,根据模型输出内容判断所述访问请求是否属于爬虫请求。
  9. 根据权利要求7或8所述的方法,其特征在于,所述方法还包括:
    若接收到的目标访问请求与已检测出的爬虫请求间所有维度标识信息的相似度大于预设阈值,则将所述目标访问请求标记为爬虫请求。
  10. 根据权利要求1所述的方法,其特征在于,所述建立所述目标网站对应的爬虫检测模型之后,还包括:
    基于预设的爬虫特征素材库验证所述爬虫检测模型的检测准确率;
    若所述检测准确率低于预设阈值,则补充获取所述目标网站的新增历史访问信息;
    基于所述新增历史访问信息和所述历史访问信息,重新训练所述目标网站对应的爬虫检测模型。
  11. 一种检测爬虫请求的装置,其特征在于,所述装置包括:
    信息获取模块,用于获取目标历史时段中目标网站的历史访问信息,将包含相同的主维度标识的历史访问信息划分至同一信息组;
    序列创建模块,用于基于目标信息组下的历史访问信息,创建所述目标信息组下副维度标识对应的多个单维度会话序列,和所述主维度标识对应的跨纬度会话序列;
    特征提取模块,用于根据所述跨纬度会话序列对应的流量特征,和所述多个单维度会话序列对应的流量特征,生成所述目标信息组对应的特征向量;
    爬虫检测模块,用于基于目标网站下所有信息组对应的特征向量,训练所述目标网站对应的爬虫检测模型,通过所述爬虫检测模型对所述目标网站进行爬虫检测。
  12. 根据权利要求11所述的装置,其特征在于,所述爬虫检测模块,具体用于:
    周期性统计当前周期内出现的所述目标网站对应的主维度标识信息;
    针对每个主维度标识信息,基于预设时长内的包含所述主维度标识信息的全部历史访问信息,创建所述主维度标识信息对应的多个单维度会话序列和跨纬度会话序列;
    根据所述主维度标识信息对应的多个单维度会话序列和跨纬度会话序列的流量特征,生成所述主维度标识信息对应的特征向量;
    将所述主维度标识信息对应的特征向量输入所述爬虫检测模型,根据模型输出内容判断所述主维度标识信息是否属于爬虫请求。
  13. 根据权利要求11所述的装置,其特征在于,所述爬虫检测模块,具体用于:
    当接收到目标网站的访问请求时,获取所述访问请求的主维度标识信息;
    基于预设时长内的包含所述主维度标识信息的全部历史访问信息,创建所述主维度标识信息对应的多个单维度会话序列和跨纬度会话序列;
    根据所述主维度标识信息对应的多个单维度会话序列和跨纬度会话序列的流量特征,生成所述主维度标识信息对应的特征向量;
    将所述主维度标识信息对应的特征向量输入所述爬虫检测模型,根据模型输出内容判断所述访问请求是否属于爬虫请求。
  14. 一种网络设备,其特征在于,所述网络设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如下方法:
    获取目标历史时段中目标网站的历史访问信息,将包含相同的主维度标识的历史访问信息划分至同一信息组;
    基于目标信息组下的历史访问信息,创建所述目标信息组下副维度标识对应的多个单维度会话序列,和所述主维度标识对应的跨纬度会话序列;
    根据所述跨纬度会话序列对应的流量特征,和所述多个单维度会话序列对应的流量特征,生成所述目标信息组对应的特征向量;
    基于目标网站下所有信息组对应的特征向量,训练所述目标网站对应的爬虫检测模型,通过所述爬虫检测模型对所述目标网站进行爬虫检测。
  15. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一 段程序、所述代码集或指令集由处理器加载并执行以实现如下方法:
    获取目标历史时段中目标网站的历史访问信息,将包含相同的主维度标识的历史访问信息划分至同一信息组;
    基于目标信息组下的历史访问信息,创建所述目标信息组下副维度标识对应的多个单维度会话序列,和所述主维度标识对应的跨纬度会话序列;
    根据所述跨纬度会话序列对应的流量特征,和所述多个单维度会话序列对应的流量特征,生成所述目标信息组对应的特征向量;
    基于目标网站下所有信息组对应的特征向量,训练所述目标网站对应的爬虫检测模型,通过所述爬虫检测模型对所述目标网站进行爬虫检测。
PCT/CN2020/071457 2019-12-13 2020-01-10 一种检测爬虫请求的方法和装置 WO2021114454A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911281120.2 2019-12-13
CN201911281120.2A CN112989157A (zh) 2019-12-13 2019-12-13 一种检测爬虫请求的方法和装置

Publications (1)

Publication Number Publication Date
WO2021114454A1 true WO2021114454A1 (zh) 2021-06-17

Family

ID=76329578

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/071457 WO2021114454A1 (zh) 2019-12-13 2020-01-10 一种检测爬虫请求的方法和装置

Country Status (2)

Country Link
CN (1) CN112989157A (zh)
WO (1) WO2021114454A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486229A (zh) * 2021-07-05 2021-10-08 北京百度网讯科技有限公司 抓取压力的控制方法、装置、电子设备及可读存储介质
CN113806614A (zh) * 2021-10-10 2021-12-17 北京亚鸿世纪科技发展有限公司 一种基于分析Http请求的网络爬虫快速识别装置
CN113868651A (zh) * 2021-09-27 2021-12-31 中国石油大学(华东) 一种基于web日志的网站反爬虫方法
CN117118743A (zh) * 2023-10-16 2023-11-24 北京长亭科技有限公司 一种爬虫行为识别方法、装置、设备及存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343232A (zh) * 2021-07-13 2021-09-03 壹药网科技(上海)股份有限公司 一种反制爬虫系统
CN114978674B (zh) * 2022-05-18 2023-12-05 中国电信股份有限公司 一种爬虫识别增强的方法及装置、存储介质及电子设备
CN115225385B (zh) * 2022-07-20 2024-02-23 深信服科技股份有限公司 一种流量监控方法、系统、设备及计算机可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107800684A (zh) * 2017-09-20 2018-03-13 贵州白山云科技有限公司 一种低频爬虫识别方法及装置
CN109145185A (zh) * 2018-02-02 2019-01-04 北京数安鑫云信息技术有限公司 识别网络爬虫以及提取网络爬虫特征的方法及装置
US20190230391A1 (en) * 2018-01-19 2019-07-25 Mux, Inc. Video Analytics System
CN110175278A (zh) * 2019-05-24 2019-08-27 新华三信息安全技术有限公司 网络爬虫的检测方法及装置
CN110245280A (zh) * 2019-05-06 2019-09-17 北京三快在线科技有限公司 识别网络爬虫的方法、装置、存储介质和电子设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102724059B (zh) * 2012-03-31 2015-03-11 常熟市支塘镇新盛技术咨询服务有限公司 基于MapReduce的网站运行状态监控与异常检测
CN104601556B (zh) * 2014-12-30 2017-12-26 中国科学院信息工程研究所 一种面向web的攻击检测方法及系统
CN109474640B (zh) * 2018-12-29 2021-01-05 奇安信科技集团股份有限公司 恶意爬虫检测方法、装置、电子设备及存储介质
CN109582855B (zh) * 2019-01-17 2019-10-22 北京三快在线科技有限公司 增强反爬系统识别性能的方法、装置和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107800684A (zh) * 2017-09-20 2018-03-13 贵州白山云科技有限公司 一种低频爬虫识别方法及装置
US20190230391A1 (en) * 2018-01-19 2019-07-25 Mux, Inc. Video Analytics System
CN109145185A (zh) * 2018-02-02 2019-01-04 北京数安鑫云信息技术有限公司 识别网络爬虫以及提取网络爬虫特征的方法及装置
CN110245280A (zh) * 2019-05-06 2019-09-17 北京三快在线科技有限公司 识别网络爬虫的方法、装置、存储介质和电子设备
CN110175278A (zh) * 2019-05-24 2019-08-27 新华三信息安全技术有限公司 网络爬虫的检测方法及装置

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486229A (zh) * 2021-07-05 2021-10-08 北京百度网讯科技有限公司 抓取压力的控制方法、装置、电子设备及可读存储介质
CN113486229B (zh) * 2021-07-05 2023-11-07 北京百度网讯科技有限公司 抓取压力的控制方法、装置、电子设备及可读存储介质
CN113868651A (zh) * 2021-09-27 2021-12-31 中国石油大学(华东) 一种基于web日志的网站反爬虫方法
CN113868651B (zh) * 2021-09-27 2024-04-26 中国石油大学(华东) 一种基于web日志的网站反爬虫方法
CN113806614A (zh) * 2021-10-10 2021-12-17 北京亚鸿世纪科技发展有限公司 一种基于分析Http请求的网络爬虫快速识别装置
CN113806614B (zh) * 2021-10-10 2024-05-17 北京亚鸿世纪科技发展有限公司 一种基于分析Http请求的网络爬虫快速识别装置
CN117118743A (zh) * 2023-10-16 2023-11-24 北京长亭科技有限公司 一种爬虫行为识别方法、装置、设备及存储介质
CN117118743B (zh) * 2023-10-16 2024-01-23 北京长亭科技有限公司 一种爬虫行为识别方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN112989157A (zh) 2021-06-18

Similar Documents

Publication Publication Date Title
WO2021114454A1 (zh) 一种检测爬虫请求的方法和装置
US7827166B2 (en) Handling dynamic URLs in crawl for better coverage of unique content
US9110985B2 (en) Generating a conceptual association graph from large-scale loosely-grouped content
CN106776983B (zh) 搜索引擎优化装置和方法
WO2019184122A1 (zh) 一种登录验证方法、装置、终端设备及存储介质
US10437610B2 (en) System for utilizing one or more data sources to generate a customized interface
CN111563216B (zh) 本地数据缓存方法、装置及相关设备
CN109905288B (zh) 一种应用服务分类方法及装置
US10333964B1 (en) Fake account identification
US20170300698A1 (en) Distributed System for Discovery of Vulnerabilities in Applications Including Detecting and/or Filtering Out Vulnerability Duplicates
KR20040082633A (ko) 인터넷 검색 엔진에 있어서의 무효 클릭 검출 방법 및 장치
WO2016045567A1 (zh) 网页数据分析方法及装置
US20190132352A1 (en) Nearline clustering and propagation of entity attributes in anti-abuse infrastructures
US20170017695A1 (en) Question and answer information providing system, information processing device, and non-transitory computer-readable medium
CN110855648A (zh) 一种网络攻击的预警控制方法及装置
CN107609389A (zh) 一种基于图像内容相关性的验证方法及系统
Bai et al. Analysis and detection of bogus behavior in web crawler measurement
CN105677882B (zh) 一种展示评论信息的方法和装置
US20160119193A1 (en) Method and system for detecting proxy internet access
CN109783471A (zh) 企业画像小程序化方法、装置、计算机设备及存储介质
CN108234431A (zh) 一种后台登陆行为检测方法和检测服务器
CN105912573A (zh) 数据更新方法及装置
CN106254575B (zh) 一种确定用户标识的方法和装置
RU2745362C1 (ru) Система и способ формирования индивидуального содержимого для пользователя сервиса
US8909795B2 (en) Method for determining validity of command and system thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20898252

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20898252

Country of ref document: EP

Kind code of ref document: A1