CN113392297A

CN113392297A - Method, system and equipment for crawling data

Info

Publication number: CN113392297A
Application number: CN202010172697.6A
Authority: CN
Inventors: 李想; 胡金涌
Original assignee: Shanghai Yundun Information Technology Co ltd
Current assignee: Shanghai Yundun Information Technology Co ltd
Priority date: 2020-03-12
Filing date: 2020-03-12
Publication date: 2021-09-14

Abstract

The method comprises the steps of obtaining access flow and an access log according to a target task, and analyzing the access flow and the access log to determine all uniform resource locators; screening target access links from all uniform resource locators; adding the screened uniform resource locators into configuration items related to the target tasks, and integrating the configuration items into source data with the same format as the active capture data; and adding the source data into an active grab message queue for task consumption to determine a target access link. Therefore, all links can be analyzed efficiently, the universal applicability of the analyzed codes is guaranteed, the development and maintenance cost is reduced, and the requirement for quickly crawling massive webpages is met.

Description

Method, system and equipment for crawling data

Technical Field

The present application relates to the field of computers, and in particular, to a method, system, and device for crawling data.

Background

With the rapid development of the internet and the explosive growth of information, the web page technology is changing day by day, the crawler technology which is widely used all the time is gradually difficult to adapt to new requirements, and the coverage rate of analyzing links is generally not ideal. These widely used crawlers are active crawling technologies, i.e. tag parsing is performed on the basis of actively crawled hypertext markup language pages (html pages). This technique is often difficult to resolve because only the link in the tag can be obtained, and the link is hidden in the js code or the tag has a special writing method. In recent years, applications based on a headless browser (Heardeless) are more and more frequent, but the method needs to load and dynamically render pages through the browser, is low in efficiency, and cannot meet the requirement of rapid resolution and crawling of mass URLs. For the situations, in the currently adopted crawler method, the crawling process completely depends on the analysis of the HTML content, and when a tag with link property exists in js codes in a page or other types of files (such as js files), part of links can not be analyzed obviously, so that the coverage rate is reduced; generally, the analysis codes are fixed, and the general applicability of the analysis link codes cannot be guaranteed for different websites, so that finally, the captured page results are incomplete; the development and maintenance cost is high, and developers need to develop specific analysis codes for each website in order to solve the problem that different websites cannot be analyzed; the efficiency is low, a large number of webpages need to be analyzed, and the quick crawling of massive webpages cannot be met.

Disclosure of Invention

An object of the present application is to provide a method, a system, and a device for crawling data, which solve the problems of low coverage rate of a crawler program on website link analysis, low integrity and efficiency of acquiring website content data, and high development cost in the prior art.

According to one aspect of the application, there is provided a method of crawling data, the method comprising:

acquiring access flow and an access log according to a target task, and analyzing the access flow and the access log to determine all uniform resource locators;

screening target access links from all uniform resource locators;

adding the screened uniform resource locators into configuration items related to the target tasks, and integrating the configuration items into source data with the same format as the active capture data;

and adding the source data into an active grab message queue for task consumption to determine a target access link.

Further, the acquiring access flow and access log according to the target task includes:

and carrying out port mirroring on the main switch of the access entrance of the access target task to obtain access flow and obtain an access log corresponding to the target task.

Further, the parsing the access traffic and the access log to determine all uniform resource locators includes:

performing recombination analysis on the access flow to determine access request data, and determining a uniform resource locator corresponding to the access flow according to the access request data;

and analyzing the access log according to a preset log analysis rule to obtain a uniform resource locator corresponding to the access log.

Further, the configuration items related to the target task include: depth of grab, number of grabs, source internetworking protocol, grab limit.

Further, the performing reassembly analysis on the access traffic to determine access request data includes:

reconstructing the access traffic according to a transmission control protocol stream to determine reconstructed data;

and carrying out hypertext transfer protocol analysis on the recombined data to determine access request data, wherein the access request data comprises an access mode, an access address, an access domain name and an access request parameter.

Further, the screening out the target access link from all the uniform resource locators includes:

and screening all the uniform resource locators to determine the target access link, wherein the screening process comprises any one or any combination of a deduplication process, a static resource removing process, a similar uniform resource locator removing process and a sensitive domain name removing process.

Further, the method comprises:

and executing a production consumption cycle for grabbing the target task according to the configuration items related to the target task.

Further, the adding the source data with the same format as the actively grabbed data into a message queue to determine a new message queue includes:

and adding the source data with the same format as the actively captured data into a message queue according to a quantitative flow data mode, and determining a new message queue.

According to another aspect of the present application, there is also provided a system for crawling data, wherein the system comprises: a passive grabbing module, a crawler identifying module, a data integration module and an active grabbing module, wherein,

the passive grabbing module is used for acquiring access flow and an access log according to a target task, and analyzing the access flow and the access log to determine all uniform resource locators;

the crawler identification module is used for screening target access links from all uniform resource locators;

the data integration module is used for adding the screened uniform resource locators into configuration items related to the target tasks and integrating the configuration items into source data with the same format as the active capture data;

and the active grab module is used for adding the source data into an active grab message queue for task consumption so as to determine a target access link.

According to yet another aspect of the present application, there is also provided a computer readable medium having computer readable instructions stored thereon, the computer readable instructions being executable by a processor to implement a method of crawling data as described in any of the preceding.

According to still another aspect of the present application, there is also provided an apparatus for crawling data, wherein the apparatus comprises:

one or more processors; and

memory storing computer readable instructions that, when executed, cause the processor to perform operations of a method of crawling data as claimed in any preceding claim.

Compared with the prior art, the method and the device have the advantages that access flow and access logs are obtained according to the target task, and the access flow and the access logs are analyzed to determine all uniform resource locators; screening target access links from all uniform resource locators; adding the screened uniform resource locators into configuration items related to the target tasks, and integrating the configuration items into source data with the same format as the active capture data; and adding the source data into an active grab message queue for task consumption to determine a target access link. Therefore, all links can be analyzed efficiently, the universal applicability of the analyzed codes is guaranteed, the development and maintenance cost is reduced, and the requirement for quickly crawling massive webpages is met.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 illustrates a flow diagram of a method of crawling data provided in accordance with an aspect of the present application;

FIG. 2 illustrates a system architecture framework diagram of crawling data provided in accordance with an aspect of the present application;

fig. 3 is a flow chart illustrating a method for crawling data in a preferred embodiment of the present application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

Fig. 1 shows a schematic flow chart of a method for crawling data provided according to an aspect of the present application, wherein the method includes: steps S11 to S14, wherein,

step S11, obtaining access flow and access log according to the target task, and analyzing the access flow and the access log to determine all uniform resource locators;

step S12, screening out target access links from all uniform resource locators;

step S13, adding the screened uniform resource locator into the configuration item related to the target task, and integrating the uniform resource locator into source data with the same format as the active capture data;

and step S14, adding the source data into an active grab message queue for task consumption to determine a target access link. Therefore, all links can be analyzed efficiently, the universal applicability of the analyzed codes is guaranteed, the development and maintenance cost is reduced, and the requirement for quickly crawling massive webpages is met.

Specifically, in step S11, access traffic and an access log are obtained according to the target task, and the access traffic and the access log are analyzed to determine all uniform resource locators. Here, when a user accesses a website, a corresponding website access link record (i.e., an access URL record) is generated in real time, the record may be recorded in a log form or transmitted in a data traffic form, the target task is a task requiring data crawling in a current business demand, the access traffic and the access log are both in a passive data crawling manner, the coverage rate is relatively high, a full interface request may be maximally obtained in a traffic mirror manner, a URL that cannot be actively captured and analyzed may be maximally found, and it is difficult to obtain an independent page and a partial interface in an active capturing manner. In a preferred embodiment of the present application, a data task to be crawled by service requirements is configured according to information such as task configuration, crawling depth, total number of crawls, whether to crawl sub-domain names, whether to crawl pictures, whether to execute js events, url lists which are not to be crawled, source ip, proxy ip, execution level and the like of a user. Wherein the data source for the passive crawling data is from traffic mirror and global wide area network (web) access logs. Analyzing the access log can confirm a plurality of interfaces which are not easy to find in the analysis webpage, such as requests hidden in js and css and dynamically generated requests, and can efficiently acquire encrypted traffic by using the access log, preferably, after a decryption position is preset, the encrypted traffic can be analyzed to confirm a URL.

Step S12, the target access link is screened out from all the uniform resource locators. Here, the target access link includes a real and effective access link accessed by the user, the obtained locators of all uniform resources include a false proxy network address, and the target access link is selected after all uniform resource locators are screened according to a preset rule, so as to obtain the real and effective access link. In a preferred embodiment of the present application, the hacking injection causes all the obtained urls to contain web security vulnerabilities, along with some invalid links. And screening all the uniform resource locators according to a preset rule, namely the preset rule configured after the task is serialized, and then selecting the target access link.

And step S13, adding the screened uniform resource locators into configuration items related to the target tasks, and integrating the configuration items into source data with the same format as the active capture data. Here, the filtered url adds configuration items related to the target task, for example, data categories that limit capturing, such as parsing js, capturing pictures, audio, video, and the like. The integration of the active and passive capture data is efficiently completed by integrating the source data into the source data with the same format as the active capture data, so that the subsequent processing of the active and passive capture data, such as the unified task consumption of the active and passive capture tasks, is facilitated.

And step S14, adding the source data into an active grab message queue for task consumption to determine a target access link. The message queue of the active grab task is executed circularly, after the source data is added into the active grab message queue, a new message queue is determined, then the source data is subjected to task consumption in the new message queue, a target access link is determined, and the data is grabbed.

The step S11 and the step S12 are methods for passively crawling data, the step S13 and the step S14 are methods for actively crawling data, and in the present application, through the step S11 to the step S14, an active and passive matching scheme is adopted, so that an active and passive combined method for crawling data is realized, the coverage rate of analysis can be greatly improved, the development workload is greatly reduced, the efficiency is improved, and the high efficiency of capturing is maximally ensured.

Preferably, in step S11, a master switch of an access entry for accessing the target task is port mirrored to obtain access traffic, and an access log corresponding to the target task is obtained. Here, for a way of acquiring a passive capture data source by mirroring traffic, mirroring a piece of traffic on a master switch at an access entry, performing reassembly analysis on the traffic, analyzing an access request, and sending a Uniform Resource Locator (URL) of the access request to a crawler server. The Web access log is a compensation scheme aiming at a traffic mirroring mode, most of https requests adopt a key exchange algorithm with higher safety, the traffic request mode can only obtain http traffic, and the Web access log can filter out URLs using the https requests. In a preferred embodiment of the present application, the access traffic data is port mirrored at the core switch after passing through the web server (e.g., Nginx) to obtain access traffic of the web service; in addition, for the decryption processing of the https protocol, the positions of the decrypted images are specified to be unified so as to obtain the decrypted access flow.

Preferably, in step S11, performing reassembly analysis on the access traffic to determine access request data, and determining a uniform resource locator corresponding to the access traffic according to the access request data; and analyzing the access log according to a preset log analysis rule to obtain a uniform resource locator corresponding to the access log. Here, the performing reassembly analysis on the access traffic refers to performing reassembly on the acquired traffic according to a transmission control protocol stream (tcp stream), then performing http resolution to determine access request data, such as a method, an address, a domain name, a parameter, and the like of an http request, then integrating the access request data into data in a preset format, and sending the data to a crawler server, such as a json format, and determining a uniform resource locator corresponding to the access traffic by the crawler server. And for the access log, analyzing the access log by using a preset log analysis rule, wherein a universal log analysis rule can be used, and a uniform resource locator corresponding to the access log can be obtained by customizing the log analysis rule according to a scene. The preset log analysis rule is that the analysis rules of different types of web access logs are different according to the types of the logs, but all the analysis rules include methods of duplicate removal, static removal, similarity removal and the like, one part of rules are integrated, and the other part of rules need to be statistically set according to scenes and the logs, namely the preset log analysis rule.

Preferably, the configuration items related to the target task include: depth of grab, number of grabs, source internetworking protocol, grab limit. In the new message queue, the configuration items of the grabbing depth, the grabbing quantity, the source internet interconnection protocol and the grabbing limitation are utilized to actively process the grabbing task, so that the data grabbing efficiency is improved, and the system performance is enhanced.

Preferably, in step S11, the access traffic is reassembled according to the transmission control protocol stream to determine reassembled data; and carrying out hypertext transfer protocol analysis on the recombined data to determine access request data, wherein the access request data comprises an access mode, an access address, an access domain name and an access request parameter. Here, the performing reassembly analysis on the access traffic refers to performing reassembly on the acquired traffic according to a transmission control protocol stream (tcp stream) to determine reassembly data, then performing http resolution on the reassembly data to determine access request data, such as a method, an address, a domain name, parameters, and the like of an http request, then integrating the access request data into data in a preset format, and sending the data to a crawler server, such as a json format, and determining a uniform resource locator corresponding to the access traffic by the crawler server.

Preferably, in step S12, all the urls are subjected to a filtering process to determine the target access link, where the filtering process includes any one or any combination of a deduplication process, a static resource removal process, a similar url removal process, and a sensitive domain name removal process. In this case, the process is filtered in various combinations to obtain a truly effective access link. The authentication strategy is utilized to authenticate the authenticity of the access link, and mainly comprises response status code authentication, authentication of whether parameters in the URL have attack and other harmful information, authentication of URL access file types, authentication of access request modes and repetitive authentication.

Preferably, a production consumption cycle for grabbing the target task is executed according to the configuration item related to the target task. The production consumption cycle is a production consumption cycle for capturing the target task by using a multiprocess mode, a producer is responsible for crawling data, the crawled data is subjected to data cleaning by the consumer, useless data are filtered, and the data meeting the task configuration requirement are added into a producer queue, wherein the producer is mainly responsible for a network request part which is executed in an asynchronous coroutine mode for ensuring the performance, and when the data of the queue reaches a preset value, a coroutine event cycle is started to efficiently complete the network request.

Preferably, in step S14, adding the source data with the same format as the active grab data format into the active grab message queue in a quantitative flow data manner, and determining a new message queue; and actively processing and capturing source data in the new message queue according to the configuration items related to the target task to determine a target access link. The data is added into the message queue in a quantitative flow data mode to improve the system performance, reduce the system load and facilitate the efficient completion of the task of actively and passively capturing data. And the message queue of the active grabbing task contains the source data of the passive grabbing task, then the grabbing information is actively processed according to the configuration item related to the target task, the target access link is determined, and the data is grabbed, so that the active and passive grabbing data tasks are efficiently completed.

FIG. 2 illustrates a system architecture framework diagram for crawling data provided in accordance with an aspect of the subject application, wherein the system comprises: the system comprises a passive grabbing module 11, a crawler identifying module 12, a data integration module 13 and an active grabbing module 14, wherein the passive grabbing module 11 is used for acquiring access flow and an access log according to a target task, and analyzing the access flow and the access log to determine all uniform resource locators; the crawler identifying module 12 is configured to screen out target access links from all uniform resource locators; the data integration module 13 is configured to add the filtered uniform resource locator to the configuration item related to the target task, and integrate the uniform resource locator into source data having the same format as the active capture data; the active grab module 14 is configured to add the source data to an active grab message queue for task consumption to determine a target access link. Therefore, all links can be analyzed efficiently, the universal applicability of the analyzed codes is guaranteed, the development and maintenance cost is reduced, and the requirement for quickly crawling massive webpages is met.

Specifically, the passive fetching module 11 is configured to obtain an access flow and an access log according to a target task, and analyze the access flow and the access log to determine all uniform resource locators. Here, the target task is a task requiring data crawling for the current business demand, and the access flow and the access log are both in a passive data crawling manner. Analyzing the access log can confirm a plurality of interfaces which are not easy to find in the analysis webpage, such as requests hidden in js and css and dynamically generated requests, and can efficiently acquire encrypted traffic by using the access log, preferably, after a decryption position is preset, the encrypted traffic can be analyzed to confirm a URL.

The crawler screening module 12 is configured to screen out target access links from all the urls. Here, all the uniform resource locators acquired by the passive capture module 11 include a false proxy network address, and the crawler identification module 12 selects a target access link after screening all the uniform resource locators according to a preset rule, so as to obtain a real and effective access link.

The data integration module 13 is configured to add the filtered uniform resource locator to the configuration item related to the target task, and integrate the uniform resource locator into source data having the same format as the active capture data. Here, the data integration module 13 adds the url filtered by the crawler identification module 12 to the configuration items related to the target task, such as the data category for limiting capturing, for example, parsing js, capturing pictures, audio, video, and the like. The integration of the active and passive data capturing is efficiently completed by integrating the source data with the same format as the active data capturing format, so that the subsequent processing of the active and passive captured data is facilitated.

The active grab module 14 is configured to add the source data to an active grab message queue for task consumption to determine a target access link. Here, the message queue of the active grab task is executed in a cycle, and after receiving the message data of the task queue, the active grab module 14 in the cycle actively processes the grab task according to the task configuration item and executes the production consumption cycle, so that the passive grab task and the active grab task are effectively combined.

Fig. 3 is a schematic flow chart of a method for crawling data in a preferred embodiment of the present application, and preferably, the active and passive crawler systems further include a task processing module, and the crawler authentication module 12 is identified as a real access authentication module. When a visitor accesses a webpage, a task processing module generates a data crawling task and writes the data crawling task into a message queue, an active crawling task in the message queue is executed in a circulating mode by an active crawling module 14, meanwhile, an access log record of the visitor to the webpage is obtained by a passive crawling module 11, and access flow is obtained in a mirror image mode; and then, analyzing the preset rule of the access log record, and carrying out recombination analysis on the access flow to obtain all URLs. The crawler identifying module 12 screens all the URLs to obtain website links visited by the visitor, and transmits the screened URLs to the data integration module 13. Then, the data integration module 13 adds the screened URL to a configuration item related to the data crawling task, integrates the URL into source data with the same format as that of the actively captured data, and adds the source data with the same format as that of the actively captured data to a message queue to determine a new message queue. Finally, after receiving the message data of the task queue, the active grab module 14 which is executed in a cycle actively processes the grab task according to the task configuration item and executes the production consumption cycle, so that the passive grab task and the active grab task are effectively combined to actively grab the data linked with the target network. Preferably, the active grabbing process performed in a loop is as follows: after acquiring a crawling task, a crawler program starts from a received Uniform Resource Locator (URL) of one or a plurality of initial webpages, acquires webpage contents of the initial URL by initiating an http request, then performs hypertext markup language (HTML) analysis on the webpage contents, acquires tag contents according to a group of tags with link properties (such as < a > tags, < img > and the like), and then extracts links from the tag contents (usually, performs matching extraction by using regular expressions). In the whole capturing process, page content is continuously obtained from the currently obtained URL, and then a new URL is obtained from the HTML page until a certain stopping condition is met, for example, all URLs are obtained.

In addition, the embodiment of the present application also provides a computer readable medium, on which computer readable instructions are stored, the computer readable instructions being executable by a processor to implement the foregoing method for crawling data.

one or more processors; and

a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of one of the aforementioned methods of crawling data.

For example, the computer readable instructions, when executed, cause the one or more processors to:

acquiring access flow and an access log according to a target task, and analyzing the access flow and the access log to determine all uniform resource locators; screening target access links from all uniform resource locators; adding the screened uniform resource locators into configuration items related to the target tasks, and integrating the configuration items into source data with the same format as the active capture data; and adding the source data into an active grab message queue for task consumption to determine a target access link.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A method of crawling data, wherein the method comprises:

screening target access links from all uniform resource locators;

2. The method of claim 1, wherein the obtaining access traffic and an access log according to the target task comprises:

3. The method of claim 1, wherein the parsing the access traffic and the access log to determine all uniform resource locators comprises:

4. The method of claim 1, wherein the configuration items related to the target task comprise: depth of grab, number of grabs, source internetworking protocol, grab limit.

5. The method of claim 3, wherein the performing reassembly analysis on the access traffic to determine access request data comprises:

6. The method of claim 1, wherein said screening out target access links from all uniform resource locators comprises:

7. The method of claim 1, wherein the method comprises:

8. The method of claim 1, wherein the adding the source data to an active grab message queue for task consumption to determine a target access link comprises:

adding the source data with the same format as the actively captured data into the actively captured message queue in a quantitative flow data mode, and determining a new message queue;

and actively processing and capturing source data in the new message queue according to the configuration items related to the target task to determine a target access link.

9. A system to crawl data, wherein the system comprises: a passive grabbing module, a crawler identifying module, a data integration module and an active grabbing module, wherein,

10. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the method of any one of claims 1 to 8.

11. An apparatus to crawl data, wherein the apparatus comprises:

one or more processors; and

a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the method of any of claims 1 to 8.