CN113609411A

CN113609411A - Method for crawling page information through web crawler

Info

Publication number: CN113609411A
Application number: CN202110710973.4A
Authority: CN
Inventors: 王嵩森; 杜邦豪; 刘加勇
Original assignee: Beijing Huayuan Information Technology Co Ltd
Current assignee: Beijing Huayuan Information Technology Co Ltd
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-11-05
Anticipated expiration: 2041-06-25
Also published as: CN113609411B

Abstract

Embodiments of the present disclosure provide a method, apparatus, device, and computer-readable storage medium for crawling page information by a web crawler. The method comprises the steps of obtaining corresponding webpage information according to a webpage loading request; analyzing the webpage information and determining url information contained in the webpage; and crawling page information corresponding to url contained in the webpage. In this way, the method is suitable for information crawling of all the current websites, and a reverse crawling mechanism is effectively avoided.

Description

Method for crawling page information through web crawler

Technical Field

Embodiments of the present disclosure relate generally to the field of data processing, and more particularly, to a method, apparatus, device, and computer-readable storage medium for crawling page information by a web crawler.

Background

There are many kinds of crawlers based on python language, and the most widely used at present are the following:

requests + Beautiful Soup (synchronization);

using current.

Using aiohttp + asyncio + requests + beautiful soup (asynchrony);

using the framework Scapy;

the above methods can effectively acquire the relevant information of the web page in the past decades, but with the rapid development of the network in recent years, the format of the web page changes, wherein various network frameworks (such as VUE) are raised and continuously perfected, and the identification of anti-crawlers and anti-machine crawlers is gradually strengthened, so that the difficulty of quickly and accurately acquiring the relevant information is increased. The method has serious defects in the aspects of information acquisition, deep crawling, webpage evading and reverse crawling and the like, and cannot adapt to the realization and development of subsequent crawlers.

Disclosure of Invention

According to an embodiment of the present disclosure, a scheme for crawling page information by a web crawler is provided.

In a first aspect of the disclosure, a method of crawling page information by a web crawler is provided.

The method comprises the following steps:

acquiring corresponding webpage information according to the webpage loading request;

analyzing the webpage information and determining url information contained in the webpage;

and crawling page information corresponding to url contained in the webpage.

Further, the acquiring the corresponding webpage information according to the webpage loading request includes:

capturing a webpage loading request through Hook, and acquiring a url corresponding to the webpage loading request;

and acquiring webpage information through the url.

Further, the analyzing the web page information and determining url information included in the web page includes:

if the webpage comprises the click events, gradually triggering all the click events, refreshing the webpage, grabbing a webpage loading request through Hook, and acquiring url information corresponding to the click events.

Further, the crawling of the page information corresponding to the url contained in the webpage includes:

crawling is performed through Ajax rendering, and page information is crawled;

the page information comprises a request, a tag and/or a deep request, a tag of the page.

Further, still include:

if the click event is triggered, monitoring the time for crawling page information;

if the crawling time exceeds a preset threshold, judging that the current page is a special page;

analyzing the special page to determine the type of the special page; the types of the special pages comprise verifiable pages and non-verifiable pages;

if the special page is a verifiable page, simulating manual input to complete verification and/or login, and crawling the page information; the verifiable page comprises account password verification, verification code verification and/or character verification;

and if the special page is a non-verifiable page, performing bug processing.

Further, still include:

and if the url contained in the current page triggers a plurality of pages to jump, crawling page information of the plurality of pages according to the preset concurrency number and the crawling depth concurrency crawler.

In a second aspect of the disclosure, an apparatus for crawling page information by a web crawler is provided.

The device includes:

the acquisition module is used for acquiring corresponding webpage information according to the webpage loading request;

the analysis module is used for analyzing the webpage information and determining url information contained in the webpage;

and the crawling module is used for crawling the page information corresponding to the url contained in the webpage.

In a third aspect of the disclosure, an electronic device is provided. The electronic device includes: a memory having a computer program stored thereon and a processor implementing the method as described above when executing the program.

In a fourth aspect of the present disclosure, a computer readable storage medium is provided, having stored thereon a computer program, which when executed by a processor, implements a method as in accordance with the first aspect of the present disclosure.

According to the method for crawling the page information by the web crawler, the corresponding page information is acquired according to the page loading request; analyzing the webpage information and determining url contained in the webpage; and crawling page information corresponding to url contained in the webpage. The anti-crawling mechanism can be effectively avoided, the required information can be quickly and accurately crawled, information crawling of any website is realized, and the requirement of a multi-webpage frame is met. Meanwhile, in the process of continuously crawling information, the request frequency can be continuously analyzed and adjusted according to the processing of the url, and the risks of blockage of the IP and the like caused by stagnation, breakdown and monitoring from a website are avoided.

It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:

FIG. 1 illustrates a schematic diagram of an exemplary operating environment in which embodiments of the present disclosure can be implemented;

FIG. 2 illustrates a flow diagram of a method of crawling page information by a web crawler according to an embodiment of the present disclosure;

FIG. 3 illustrates a block diagram of an apparatus for crawling page information by a web crawler according to an embodiment of the present disclosure;

FIG. 4 illustrates a block diagram of an exemplary electronic device capable of implementing embodiments of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

FIG. 1 illustrates a schematic diagram of an exemplary operating environment 100 in which embodiments of the present disclosure can be implemented. Included in the runtime environment 100 are a client 101, a network 102, and a server 103.

It should be understood that the number of user clients, networks, and servers in FIG. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In particular, in the case where the target data does not need to be acquired from a remote place, the above system architecture may not include a network but only a terminal device or a server.

FIG. 2 shows a flowchart 200 of a method for crawling page information by a web crawler according to an embodiment of the present disclosure. As shown in fig. 2, the method for crawling page information by web crawlers includes:

and S210, acquiring corresponding webpage information according to the webpage loading request.

In this embodiment, an execution subject (for example, a server shown in fig. 1) of the method for crawling page information by a web crawler may acquire the page information in a wired manner or a wireless connection manner.

Optionally, the crawler cannot accurately and effectively acquire comprehensive data when rendering a web page in a new web site framework (VUE), and therefore in the present disclosure, a headless browser is used to crawl the required page information.

Optionally, in the present disclosure, a plurality of new URLs may be formed based on the initial URL in the crawler crawling task, that is, the present disclosure may decompose the initial URL into a plurality of URLs, and may form one crawler crawling task object for each new URL.

Optionally, the maximum number of new URLs decomposed from the initial URL may be limited, that is, the maximum number of crawler crawling task objects corresponding to one crawler crawling task may be limited (concurrent limitation). For example, when it is preset that one crawler crawling task can form 50 crawler crawling task objects at a time, the initial URL in the crawler crawling task is decomposed to separate 50 new URLs of URL-0, URL-1, … …, and URL-49, and one crawler crawling task object is formed for each of URL _0, URL _1, … …, and URL _49 to form 50 crawler crawling task objects, that is, crawler crawling task object 0, crawler crawling task object 1, … …, and crawler crawling task object 49.

Alternatively, the new URL formed based on the initial URL may be a URL of a primary page or a URL of a secondary page. Namely, the current page can be crawled, and the sub-links of the current page can also be deeply crawled.

Optionally, the obtaining of the web page information through the URL includes: when the webpage information is acquired in a target URL input mode; and grabbing a webpage loading request through Hook, acquiring a URL (uniform resource locator) corresponding to the webpage, and acquiring webpage information through the URL.

Optionally, when the webpage information is obtained through the URL, the URL is preprocessed through big data analysis, for example, the URL or the URL queue (URLs) is processed by deduplication, wrong URL filtering, website deduplication filtering, and/or abnormal URL filtering.

S220, analyzing the webpage information and determining url information contained in the webpage.

Optionally, the acquired web page information is analyzed, an event type included in the current web page is determined, and url information is acquired according to the event type.

Specifically, if the webpage comprises click events, simulating manual click to gradually trigger all click events, clicking all clickable events on an interface as much as possible, refreshing the webpage, grabbing a webpage loading request through Hook, and acquiring the url corresponding to the click events.

And S230, crawling page information corresponding to url contained in the webpage.

Optionally, crawling page information by Ajax rendering according to a preset concurrency number (50), a crawling depth and the acquired URL label; the page information includes a request, a tag of the page and/or a deep request, a tag of the page (primary, secondary page).

Optionally, the crawling task is monitored by a monitoring program such as a task manager, and the concurrence number can be modified and controlled in real time according to monitoring information returned (displayed) by the monitoring program such as the task manager, so as to prevent the network paralysis caused by excessive requests.

Optionally, the data such as url of the current request is analyzed through methods such as big data analysis, the anti-crawling reaction of the current crawling task and the special processing condition (advertisement and the like) of the current task are confirmed, the concurrent acquisition speed is updated and adjusted in real time, and the condition that the service ip is forbidden due to overhigh request is prevented.

Optionally, the time for crawling the page related information is monitored while the click event is triggered. If the crawling time exceeds a preset threshold (for example, 5 minutes), judging that the current page is a special page, analyzing the special page, and determining the type of the special page; the types of the special pages comprise verifiable pages and non-verifiable pages;

for example, login is performed through a user name and a password pre-stored in the system;

logging in through modes such as account password blasting and the like;

characters (images) are identified through an image identification technology to carry out character (picture) verification and the like.

And if the special page is a non-verifiable page, performing bug processing.

Optionally, the process of crawling and/or bug processing is analyzed by a big data analysis method, and if the analysis result is excellent, that is, the preset target information is crawled, the process of crawling and the corresponding frame of crawling are stored. The requirement of adapting to a multi-webpage frame is gradually met.

According to the embodiment of the disclosure, the method for acquiring the effective information by automatically identifying the certificate code and simulating login of the headless browser is adopted, the detection of a reverse-stealing prevention mechanism can be maximally realized, the corresponding effective information can be timely, effectively and accurately acquired, and compared with the traditional centralized crawler mode, the headless browser has the following huge advantages:

1. the code is convenient and simple, is completely compatible and concurrent, and has high information acquisition speed.

2. The operation is completed manually in a simulated mode, the anti-crawlers can be effectively avoided, and most manual verification and detection parts are completed.

3. Machine learning, data are accumulated in continuous crawlers, the crawler records of each frame are effectively counted, and the optimal crawler strategy is convenient to rapidly identify.

4. Effectively identifying information such as a verification code, a login code, a scanning code and the like, and quickly inputting and processing to finish blasting.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

The above is a description of embodiments of the method, and the embodiments of the apparatus are further described below.

FIG. 3 illustrates a block diagram of an apparatus 300 for crawling page information by a web crawler according to an embodiment of the present disclosure. As shown in fig. 3, the apparatus 300 includes:

an obtaining module 310, configured to obtain corresponding web page information according to the web page loading request;

the analysis module 320 is configured to analyze the webpage information and determine url information included in the webpage;

and the crawling module 330 is configured to crawl page information corresponding to urls included in the web page.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the described module may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

FIG. 4 shows a schematic block diagram of an electronic device 400 that may be used to implement embodiments of the present disclosure. As shown, device 400 includes a Central Processing Unit (CPU)401 that may perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)402 or loaded from a storage unit 508 into a Random Access Memory (RAM) 403. In the RAM503, various programs and data required for the operation of the device 400 can also be stored. The CPU 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

A number of components in device 400 are connected to I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 409 such as a network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processing unit 401 executes the various methods and processes described above. For example, in some embodiments, the method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 400 via the ROM 402 and/or the communication unit 409. When the computer program is loaded into RAM 403 and executed by CPU 401, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the CPU 401 may be configured to perform the method by any other suitable means (e.g., by way of firmware).

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and the like.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims, and the scope of the invention is not limited thereto, as modifications and substitutions may be readily made by those skilled in the art without departing from the spirit and scope of the invention as disclosed herein.

Claims

1. A method for crawling page information by a web crawler, comprising:

and crawling page information corresponding to url contained in the webpage.

2. The method of claim 1, wherein the obtaining the corresponding web page information according to the web page loading request comprises:

and acquiring webpage information through the url.

3. The method of claim 2, wherein analyzing the web page information and determining url information contained in the web page comprises:

4. The method of claim 3, wherein crawling page information corresponding to urls contained in the web page comprises:

crawling is performed through Ajax rendering, and page information is crawled;

5. The method of claim 4, further comprising:

and if the special page is a non-verifiable page, performing bug processing.

6. The method of claim 1, further comprising:

7. An apparatus for crawling page information by a web crawler, comprising:

8. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program, wherein the processor, when executing the program, implements the method of any of claims 1-6.

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method of any one of claims 1 to 6.