CN108804657A

CN108804657A - A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule

Info

Publication number: CN108804657A
Application number: CN201810585918.5A
Authority: CN
Inventors: 王浩
Original assignee: Shenzhen Dingfeng Cattle Technology Co Ltd
Current assignee: Shenzhen Dingfeng Cattle Technology Co Ltd
Priority date: 2018-06-08
Filing date: 2018-06-08
Publication date: 2018-11-13

Abstract

The present invention provides a kind of configurable grab of Zero-code based on WebMagic to climb regular crawler system, in the case of Zero-code, by the way that download schedule module, dissection process module, crawl contrast module are arranged in the Spider components of WebMagic and extracts storage module with Extracting Information, download schedule module is used for from the Internet download target pages to obtain target information；Dissection process module extracts target information, and find new URL link for parsing target pages；Contrast module is captured, for managing URL link to be captured, and removes the URL link of repetition；Storage module is extracted, processing target information is used for.The present invention will be by that will crawl the procedural abstraction of webpage information, it is cured in database, specifying information extracting mode is described using expression formula, convert whole process to one or more data-base recording, and then become doing database configuration for demand program reptile code traditional, publication number is reduced, iteration speed is improved.

Description

A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule

Technical field

The present invention relates to configurable grab of intelligent grabbing technical field more particularly to a kind of Zero-code based on WebMagic to climb The crawler system of rule.

Background technology

Reptile is a kind of program of automatic acquisition web page contents, is the important component of search engine.In order to which image is retouched It states computer program and constantly goes the link of extraction webpage by the import address of customization in network, and grabbed once again according to these links The deeper other unknown links of extraction are taken, are gone down with this, are moved as similar reptile is described into the crawl behavior of such program Make, referred to as reptile.

Result using the grasping system Extracting Information of manually generated crawl Wrapper Technology is more accurate, but will be to mutual Thousands of a websites of networking carry out generation and the updating maintenance work of crawl wrapper, and normal vertical reptile can not undertake this very well Work can only rely on a large amount of manpower and participate in.

Safe and efficient real-time crawl technology needs when requiring high real-time to capture to capturing Website server Frequent to initiate link and download request, this will cause prodigious pressure to other side's server, and then other side can be caused to use Tactful such as denied access is closed to ensure that server works normally, this will cause crawl to fail；Crawl in real time high simultaneously needs It asks, expends very much the hardware resources such as network, server, lead to cost increase.

As AJAX technologies are constantly popularized, and this Single-page application frames of AngularJS now The appearance of frame, the page that present js is rendered are more and more；For reptile, the information displayed in this page is more numerous It is miscellaneous：HTML content is only extracted, effective information can not be often obtained.

Invention content

The present invention is directed to the shortcomings that existing way, proposes that a kind of configurable grab of Zero-code based on WebMagic climbs rule Crawler system, to solve the above problem of the existing technology.

According to an aspect of the invention, there is provided a kind of configurable grab of Zero-code based on WebMagic climbs climbing for rule Worm system, in the case of Zero-code, by the way that download schedule module, dissection process are arranged in the Spider components of WebMagic Module, crawl contrast module and extraction storage module are with Extracting Information：

The download schedule module is used for from the Internet download target pages with according to customer requirement retrieval target information；

The dissection process module extracts target information, and find new URL link for parsing target pages；

The crawl contrast module for managing URL link to be captured, and removes the URL link of repetition；

The extraction storage module, for processing target information to be sorted out and be stored in database and server.

Further, default time interval, network agent and the network request head for obtaining information of the download schedule module Information.

Further, the download schedule module uses Apache Http Client as download tool with from internet Download target pages.

Further, the download schedule module is preset with maximum number of concurrent to avoid system congestion.

Further, the dissection process module uses Jsoup as the tool of parsing HTML, and using based on Jsoup Xsoup as parsing XPath tool.

Further, the crawl contrast module manages URL link using the memory queue of JDK.

Further, the crawl contrast module removes the URL link of repetition using set.

Further, the crawl contrast module carries out distributed management using Redis to database and server.

Compared with prior art, the beneficial effects of the invention are as follows：

The present invention is cured in database by that will crawl the procedural abstraction of webpage information, expression formula is used Specifying information extracting mode is described, to convert whole process to one or more data-base recording, and then traditional needle Demand program reptile code is become to do database configuration, information is reduced and issues number, improve iteration speed.

The additional aspect of the present invention and advantage will be set forth in part in the description, these will become from the following description Obviously, or practice through the invention is recognized.

Description of the drawings

Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, wherein：

Fig. 1, which is that a kind of Zero-code based on WebMagic in the embodiment of the present invention is configurable, grabs the crawler system for climbing rule Structure diagram；

Fig. 2 is the flow chart of the webpage capture in the vertical field in the embodiment of the present invention；

Fig. 3 is the abstract flow chart of the crawl webpage in the embodiment of the present invention；

Fig. 4 is the business object structural schematic diagram of the crawl webpage in the embodiment of the present invention；

Fig. 5 is the practical flow chart of the crawl webpage in the embodiment of the present invention.

Specific implementation mode

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described.

In some flows of description in description and claims of this specification and above-mentioned attached drawing, contain according to Multiple operations that particular order occurs, but it should be clearly understood that these operations can not be what appears in this article suitable according to its Sequence is executed or is executed parallel, and the serial number such as 101,102 etc. of operation is only used for distinguishing each different operation, serial number It itself does not represent and any executes sequence.In addition, these flows may include more or fewer operations, and these operations can To execute or execute parallel in order.It should be noted that the descriptions such as " first " herein, " second ", are for distinguishing not Same message, equipment, module etc., does not represent sequencing, does not also limit " first " and " second " and be different type.

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiment is only a part of example of the present invention, is implemented instead of all the embodiments.It is based on Embodiment in the present invention, the every other implementation that those skilled in the art are obtained without creative efforts Example, shall fall within the protection scope of the present invention.

Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology art Language and scientific terminology), there is meaning identical with the general understanding of the those of ordinary skill in fields of the present invention.Should also Understand, those terms such as defined in the general dictionary, it should be understood that have in the context of the prior art The consistent meaning of meaning, and unless by specific definitions as here, the meaning of idealization or too formal otherwise will not be used To explain.

Embodiment

As shown in Figure 1, a kind of configurable grab of the Zero-code based on WebMagic for providing the embodiment of the present invention climbs rule Crawler system, in the case of Zero-code, by be arranged in the Spider components of WebMagic download schedule modules A 101, Dissection process modules A 102, crawl contrast module A103 and extraction storage module A104 are with Extracting Information：

Download schedule modules A 101 is used for from the Internet download target pages with according to customer requirement retrieval target information；

Download schedule modules A 101 uses Apache Http Client as download tool with from the Internet download target The page.

Download schedule modules A 101 is preset with maximum number of concurrent to avoid system congestion.

Default time interval, network agent and the network request header for obtaining information of download schedule modules A 101.

Dissection process modules A 102 extracts target information, and find new URL link for parsing target pages；

Dissection process modules A 102 uses Jsoup as the tool of parsing HTML, and is made using the Xsoup based on Jsoup To parse the tool of XPath.

Contrast module A103 is captured, for managing URL link to be captured, and removes the URL link of repetition；

Crawl contrast module A103 manages URL link using the memory queue of JDK.

Crawl contrast module A103 removes the URL link of repetition using set.

It captures contrast module A103 and distributed management is carried out to database and server using Redis.

Storage module A104 is extracted, for processing target information to be sorted out and be stored in database and server.

The present invention is cured in database by that will crawl the procedural abstraction of webpage information, expression formula is used Specifying information extracting mode is described, to convert whole process to one or more data-base recording, and then traditional needle Demand program reptile code is become to do database configuration, publication number is reduced, improves iteration speed.

When capturing the webpage information in a certain field, reptile engineer is usually to initiate network request, then to obtaining Html information or json information extract, and then obtained information preservation is got off.It in this process, perhaps can also be to net Request is initiated in some link on page, does further information extraction.And at the same time, in vertical field on website, carry What the information of confession was often shown in the form of a list, it is described in detail below：

For example, target pages are the websites of a training organization, all teachers are listed on some page on the website Name, and provide link, clickthrough can check the details of giving lessons of the teacher, such as lecture contents, hours of instruction etc., Perhaps chain also being had in details and fetching go to subtleer content.If crawling these information, it can be abstracted as, first have to The information shown with list is obtained, most important is exactly the url links of its details, then recycles and is asked to these url initiations It asks, information extraction is done to details page, if also two level details, request is initiated to next stage url, and so on, similar to this The webpage capture in the vertical field of kind, rough flow are as shown in Figure 2.

The flow that webpage is captured in Fig. 2 can be abstracted, as shown in Figure 3：

The abstract flow of crawl webpage is specially in Fig. 3：It is first the result set for creating a category information, is being needed to define Which information of webpage extracted.Next the web page address to be crawled is there is provision of, after having a web page address, so that it may with profit With webmagic crawl.During crawling webpage information, list page information is typically first crawled, is got in detail Behind the address of feelings, further crawl is remake.It, can be according to previously defined result in such circular treatment again and again Collection, assembles a result information, finally again by these result information persistences using html information extractions tool.

Above procedure is taken out 5 business objects by the present invention, grab climb collection, grab climb object, grab climb channel, grab climb the page, Binding setting, it is specific as shown in Figure 4.

It grabs to climb to collect and is used for indicating that one kind is grabbed climbing set, it is main to play a grouping.It grabs and climbs object for describing one kind Information is captured, can be news information, recruitment information etc., the attribute of these information and the class of each attribute is also defined Type.Grab the mode that crawls that channel is used to specify this category information of climbing, such as the time interval of crawl, the agency of setting, network request Header etc..The crawl page and binding are arranged to, to occurring, crawl page address and information extraction mode be specifically designated.Capture page It face can be with the address and their relationship between superior and subordinate of specified list page and details page, correspondingly, will be set in binding setting Fortunately what expression formula information is extracted in list page, how information is extracted on details page.Simultaneously also as network request Return may be the html pages, it is also possible to be json information, thus binding setting in will have corresponding attribute, tell journey Sequence extracts information with suitable tool.The present invention stores the information in database, after startup program, reads all Information to memory in, according to these configuration, climb channel for each grab, start a thread of webmagic to capture letter The information of breath, these last structurings is deposited into designated position.It, can also be according to the reality of the crawl page during actual coding The problem of situation makes flexible processing, and here it is specific implementations.As shown in figure 5, this is a kind of performance：Dev- Cloud cloud data centers are used for providing configuration management interface, and corresponding rule is deposited into database.Crawler task Dispatcher center, which are grabbed, climbs the reading configuration of task center, the mode handled using task, and each grab of the processing of multithreading is climbed Task.Final result is all deposited into kafka message queues, with for later use.

In embodiment provided herein, it should be understood that disclosed method, system, device, module and/or Unit may be implemented in other ways.For example, embodiment of the method described above is only schematical, for example, institute The division of module is stated, only a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple moulds Block or component can be combined or can be integrated into another system, or some features can be ignored or not executed.The conduct The unit that separating component illustrates may or may not be physically separated, the component shown as unit can be or Person may not be physical unit, you can be located at a place, or may be distributed over multiple network units.It can root According to actual needs, some or all of the units may be selected to achieve the purpose of the solution of this embodiment.

The above is only some embodiments of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule, which is characterized in that in Zero-code In the case of, by the way that download schedule module, dissection process module, crawl contrast module are arranged in the Spider components of WebMagic And storage module is extracted with Extracting Information：

2. crawler system according to claim 1, which is characterized in that the download schedule module is default obtain information when Between interval, network agent and network request header.

3. crawler system according to claim 1, which is characterized in that the download schedule module uses Apache Http Client is as download tool with from the Internet download target pages.

4. crawler system according to claim 1, which is characterized in that the download schedule module is preset with maximum number of concurrent To avoid system congestion.

5. crawler system according to claim 1, which is characterized in that the dissection process module uses Jsoup as solution The tool of HTML is analysed, and uses the Xsoup based on Jsoup as the tool of parsing XPath.

6. crawler system according to claim 1, which is characterized in that the crawl contrast module uses the memory team of JDK It arranges to manage URL link.

7. crawler system according to claim 1, which is characterized in that the crawl contrast module removes weight using set Multiple URL link.

8. crawler system according to claim 1, which is characterized in that the crawl contrast module is using Redis to data Library and server carry out distributed management.