CN108804657A - A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule - Google Patents

A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule Download PDF

Info

Publication number
CN108804657A
CN108804657A CN201810585918.5A CN201810585918A CN108804657A CN 108804657 A CN108804657 A CN 108804657A CN 201810585918 A CN201810585918 A CN 201810585918A CN 108804657 A CN108804657 A CN 108804657A
Authority
CN
China
Prior art keywords
module
information
crawl
crawler system
download
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201810585918.5A
Other languages
Chinese (zh)
Inventor
王浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Dingfeng Cattle Technology Co Ltd
Original Assignee
Shenzhen Dingfeng Cattle Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Dingfeng Cattle Technology Co Ltd filed Critical Shenzhen Dingfeng Cattle Technology Co Ltd
Priority to CN201810585918.5A priority Critical patent/CN108804657A/en
Publication of CN108804657A publication Critical patent/CN108804657A/en
Withdrawn legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides a kind of configurable grab of Zero-code based on WebMagic to climb regular crawler system, in the case of Zero-code, by the way that download schedule module, dissection process module, crawl contrast module are arranged in the Spider components of WebMagic and extracts storage module with Extracting Information, download schedule module is used for from the Internet download target pages to obtain target information;Dissection process module extracts target information, and find new URL link for parsing target pages;Contrast module is captured, for managing URL link to be captured, and removes the URL link of repetition;Storage module is extracted, processing target information is used for.The present invention will be by that will crawl the procedural abstraction of webpage information, it is cured in database, specifying information extracting mode is described using expression formula, convert whole process to one or more data-base recording, and then become doing database configuration for demand program reptile code traditional, publication number is reduced, iteration speed is improved.

Description

A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule
Technical field
The present invention relates to configurable grab of intelligent grabbing technical field more particularly to a kind of Zero-code based on WebMagic to climb The crawler system of rule.
Background technology
Reptile is a kind of program of automatic acquisition web page contents, is the important component of search engine.In order to which image is retouched It states computer program and constantly goes the link of extraction webpage by the import address of customization in network, and grabbed once again according to these links The deeper other unknown links of extraction are taken, are gone down with this, are moved as similar reptile is described into the crawl behavior of such program Make, referred to as reptile.
Result using the grasping system Extracting Information of manually generated crawl Wrapper Technology is more accurate, but will be to mutual Thousands of a websites of networking carry out generation and the updating maintenance work of crawl wrapper, and normal vertical reptile can not undertake this very well Work can only rely on a large amount of manpower and participate in.
Safe and efficient real-time crawl technology needs when requiring high real-time to capture to capturing Website server Frequent to initiate link and download request, this will cause prodigious pressure to other side's server, and then other side can be caused to use Tactful such as denied access is closed to ensure that server works normally, this will cause crawl to fail;Crawl in real time high simultaneously needs It asks, expends very much the hardware resources such as network, server, lead to cost increase.
As AJAX technologies are constantly popularized, and this Single-page application frames of AngularJS now The appearance of frame, the page that present js is rendered are more and more;For reptile, the information displayed in this page is more numerous It is miscellaneous:HTML content is only extracted, effective information can not be often obtained.
Invention content
The present invention is directed to the shortcomings that existing way, proposes that a kind of configurable grab of Zero-code based on WebMagic climbs rule Crawler system, to solve the above problem of the existing technology.
According to an aspect of the invention, there is provided a kind of configurable grab of Zero-code based on WebMagic climbs climbing for rule Worm system, in the case of Zero-code, by the way that download schedule module, dissection process are arranged in the Spider components of WebMagic Module, crawl contrast module and extraction storage module are with Extracting Information:
The download schedule module is used for from the Internet download target pages with according to customer requirement retrieval target information;
The dissection process module extracts target information, and find new URL link for parsing target pages;
The crawl contrast module for managing URL link to be captured, and removes the URL link of repetition;
The extraction storage module, for processing target information to be sorted out and be stored in database and server.
Further, default time interval, network agent and the network request head for obtaining information of the download schedule module Information.
Further, the download schedule module uses Apache Http Client as download tool with from internet Download target pages.
Further, the download schedule module is preset with maximum number of concurrent to avoid system congestion.
Further, the dissection process module uses Jsoup as the tool of parsing HTML, and using based on Jsoup Xsoup as parsing XPath tool.
Further, the crawl contrast module manages URL link using the memory queue of JDK.
Further, the crawl contrast module removes the URL link of repetition using set.
Further, the crawl contrast module carries out distributed management using Redis to database and server.
Compared with prior art, the beneficial effects of the invention are as follows:
The present invention is cured in database by that will crawl the procedural abstraction of webpage information, expression formula is used Specifying information extracting mode is described, to convert whole process to one or more data-base recording, and then traditional needle Demand program reptile code is become to do database configuration, information is reduced and issues number, improve iteration speed.
The additional aspect of the present invention and advantage will be set forth in part in the description, these will become from the following description Obviously, or practice through the invention is recognized.
Description of the drawings
Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, wherein:
Fig. 1, which is that a kind of Zero-code based on WebMagic in the embodiment of the present invention is configurable, grabs the crawler system for climbing rule Structure diagram;
Fig. 2 is the flow chart of the webpage capture in the vertical field in the embodiment of the present invention;
Fig. 3 is the abstract flow chart of the crawl webpage in the embodiment of the present invention;
Fig. 4 is the business object structural schematic diagram of the crawl webpage in the embodiment of the present invention;
Fig. 5 is the practical flow chart of the crawl webpage in the embodiment of the present invention.
Specific implementation mode
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described.
In some flows of description in description and claims of this specification and above-mentioned attached drawing, contain according to Multiple operations that particular order occurs, but it should be clearly understood that these operations can not be what appears in this article suitable according to its Sequence is executed or is executed parallel, and the serial number such as 101,102 etc. of operation is only used for distinguishing each different operation, serial number It itself does not represent and any executes sequence.In addition, these flows may include more or fewer operations, and these operations can To execute or execute parallel in order.It should be noted that the descriptions such as " first " herein, " second ", are for distinguishing not Same message, equipment, module etc., does not represent sequencing, does not also limit " first " and " second " and be different type.
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiment is only a part of example of the present invention, is implemented instead of all the embodiments.It is based on Embodiment in the present invention, the every other implementation that those skilled in the art are obtained without creative efforts Example, shall fall within the protection scope of the present invention.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology art Language and scientific terminology), there is meaning identical with the general understanding of the those of ordinary skill in fields of the present invention.Should also Understand, those terms such as defined in the general dictionary, it should be understood that have in the context of the prior art The consistent meaning of meaning, and unless by specific definitions as here, the meaning of idealization or too formal otherwise will not be used To explain.
Embodiment
As shown in Figure 1, a kind of configurable grab of the Zero-code based on WebMagic for providing the embodiment of the present invention climbs rule Crawler system, in the case of Zero-code, by be arranged in the Spider components of WebMagic download schedule modules A 101, Dissection process modules A 102, crawl contrast module A103 and extraction storage module A104 are with Extracting Information:
Download schedule modules A 101 is used for from the Internet download target pages with according to customer requirement retrieval target information;
Download schedule modules A 101 uses Apache Http Client as download tool with from the Internet download target The page.
Download schedule modules A 101 is preset with maximum number of concurrent to avoid system congestion.
Default time interval, network agent and the network request header for obtaining information of download schedule modules A 101.
Dissection process modules A 102 extracts target information, and find new URL link for parsing target pages;
Dissection process modules A 102 uses Jsoup as the tool of parsing HTML, and is made using the Xsoup based on Jsoup To parse the tool of XPath.
Contrast module A103 is captured, for managing URL link to be captured, and removes the URL link of repetition;
Crawl contrast module A103 manages URL link using the memory queue of JDK.
Crawl contrast module A103 removes the URL link of repetition using set.
It captures contrast module A103 and distributed management is carried out to database and server using Redis.
Storage module A104 is extracted, for processing target information to be sorted out and be stored in database and server.
The present invention is cured in database by that will crawl the procedural abstraction of webpage information, expression formula is used Specifying information extracting mode is described, to convert whole process to one or more data-base recording, and then traditional needle Demand program reptile code is become to do database configuration, publication number is reduced, improves iteration speed.
When capturing the webpage information in a certain field, reptile engineer is usually to initiate network request, then to obtaining Html information or json information extract, and then obtained information preservation is got off.It in this process, perhaps can also be to net Request is initiated in some link on page, does further information extraction.And at the same time, in vertical field on website, carry What the information of confession was often shown in the form of a list, it is described in detail below:
For example, target pages are the websites of a training organization, all teachers are listed on some page on the website Name, and provide link, clickthrough can check the details of giving lessons of the teacher, such as lecture contents, hours of instruction etc., Perhaps chain also being had in details and fetching go to subtleer content.If crawling these information, it can be abstracted as, first have to The information shown with list is obtained, most important is exactly the url links of its details, then recycles and is asked to these url initiations It asks, information extraction is done to details page, if also two level details, request is initiated to next stage url, and so on, similar to this The webpage capture in the vertical field of kind, rough flow are as shown in Figure 2.
The flow that webpage is captured in Fig. 2 can be abstracted, as shown in Figure 3:
The abstract flow of crawl webpage is specially in Fig. 3:It is first the result set for creating a category information, is being needed to define Which information of webpage extracted.Next the web page address to be crawled is there is provision of, after having a web page address, so that it may with profit With webmagic crawl.During crawling webpage information, list page information is typically first crawled, is got in detail Behind the address of feelings, further crawl is remake.It, can be according to previously defined result in such circular treatment again and again Collection, assembles a result information, finally again by these result information persistences using html information extractions tool.
Above procedure is taken out 5 business objects by the present invention, grab climb collection, grab climb object, grab climb channel, grab climb the page, Binding setting, it is specific as shown in Figure 4.
It grabs to climb to collect and is used for indicating that one kind is grabbed climbing set, it is main to play a grouping.It grabs and climbs object for describing one kind Information is captured, can be news information, recruitment information etc., the attribute of these information and the class of each attribute is also defined Type.Grab the mode that crawls that channel is used to specify this category information of climbing, such as the time interval of crawl, the agency of setting, network request Header etc..The crawl page and binding are arranged to, to occurring, crawl page address and information extraction mode be specifically designated.Capture page It face can be with the address and their relationship between superior and subordinate of specified list page and details page, correspondingly, will be set in binding setting Fortunately what expression formula information is extracted in list page, how information is extracted on details page.Simultaneously also as network request Return may be the html pages, it is also possible to be json information, thus binding setting in will have corresponding attribute, tell journey Sequence extracts information with suitable tool.The present invention stores the information in database, after startup program, reads all Information to memory in, according to these configuration, climb channel for each grab, start a thread of webmagic to capture letter The information of breath, these last structurings is deposited into designated position.It, can also be according to the reality of the crawl page during actual coding The problem of situation makes flexible processing, and here it is specific implementations.As shown in figure 5, this is a kind of performance:Dev- Cloud cloud data centers are used for providing configuration management interface, and corresponding rule is deposited into database.Crawler task Dispatcher center, which are grabbed, climbs the reading configuration of task center, the mode handled using task, and each grab of the processing of multithreading is climbed Task.Final result is all deposited into kafka message queues, with for later use.
In embodiment provided herein, it should be understood that disclosed method, system, device, module and/or Unit may be implemented in other ways.For example, embodiment of the method described above is only schematical, for example, institute The division of module is stated, only a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple moulds Block or component can be combined or can be integrated into another system, or some features can be ignored or not executed.The conduct The unit that separating component illustrates may or may not be physically separated, the component shown as unit can be or Person may not be physical unit, you can be located at a place, or may be distributed over multiple network units.It can root According to actual needs, some or all of the units may be selected to achieve the purpose of the solution of this embodiment.
The above is only some embodiments of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (8)

1. a kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule, which is characterized in that in Zero-code In the case of, by the way that download schedule module, dissection process module, crawl contrast module are arranged in the Spider components of WebMagic And storage module is extracted with Extracting Information:
The download schedule module is used for from the Internet download target pages with according to customer requirement retrieval target information;
The dissection process module extracts target information, and find new URL link for parsing target pages;
The crawl contrast module for managing URL link to be captured, and removes the URL link of repetition;
The extraction storage module, for processing target information to be sorted out and be stored in database and server.
2. crawler system according to claim 1, which is characterized in that the download schedule module is default obtain information when Between interval, network agent and network request header.
3. crawler system according to claim 1, which is characterized in that the download schedule module uses Apache Http Client is as download tool with from the Internet download target pages.
4. crawler system according to claim 1, which is characterized in that the download schedule module is preset with maximum number of concurrent To avoid system congestion.
5. crawler system according to claim 1, which is characterized in that the dissection process module uses Jsoup as solution The tool of HTML is analysed, and uses the Xsoup based on Jsoup as the tool of parsing XPath.
6. crawler system according to claim 1, which is characterized in that the crawl contrast module uses the memory team of JDK It arranges to manage URL link.
7. crawler system according to claim 1, which is characterized in that the crawl contrast module removes weight using set Multiple URL link.
8. crawler system according to claim 1, which is characterized in that the crawl contrast module is using Redis to data Library and server carry out distributed management.
CN201810585918.5A 2018-06-08 2018-06-08 A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule Withdrawn CN108804657A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810585918.5A CN108804657A (en) 2018-06-08 2018-06-08 A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810585918.5A CN108804657A (en) 2018-06-08 2018-06-08 A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule

Publications (1)

Publication Number Publication Date
CN108804657A true CN108804657A (en) 2018-11-13

Family

ID=64087893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810585918.5A Withdrawn CN108804657A (en) 2018-06-08 2018-06-08 A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule

Country Status (1)

Country Link
CN (1) CN108804657A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528119A (en) * 2020-12-21 2021-03-19 北京中安智达科技有限公司 Distributed webpage information crawling system based on Pulsar

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528119A (en) * 2020-12-21 2021-03-19 北京中安智达科技有限公司 Distributed webpage information crawling system based on Pulsar

Similar Documents

Publication Publication Date Title
CN107025296B (en) Based on science service information intelligent grasping system method of data capture
CN107895009A (en) One kind is based on distributed internet data acquisition method and system
CN106528769A (en) Data acquisition method and apparatus
CN104268148B (en) A kind of forum page Information Automatic Extraction method and system based on time string
CN110413864A (en) A kind of network security information collection method, apparatus, equipment and storage medium
CN101727486A (en) Web forum information extraction system
CN105550359B (en) Webpage sorting method and device based on vertical search and server
CN107340954A (en) A kind of information extracting method and device
CN107506389A (en) A kind of method and apparatus for extracting position skill requirement
CN108090104A (en) For obtaining the method and apparatus of webpage information
CN106649334A (en) Conjunction word set processing method and device
CN108197030A (en) Software interface based on deep learning tests cloud platform device and test method automatically
CN107766509A (en) A kind of method and apparatus of webpage static backup
CN103136358A (en) Method for automatically extracting BBS (bulletin board system) data
CN110417873A (en) A kind of network information extraction system for realizing record webpage interactive operation
CN110134845A (en) Project public sentiment monitoring method, device, computer equipment and storage medium
Prasad et al. Coreex: content extraction from online news articles
CN104967698B (en) A kind of method and apparatus crawling network data
CN108804657A (en) A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule
CN109635089B (en) Literature work novelty evaluation system and method based on semantic network
CN107247789A (en) user interest acquisition method based on internet
CN106649732A (en) Information pushing method and device
CN107590121A (en) Text-normalization method and system
CN113886204A (en) User behavior data collection method and device, electronic equipment and readable storage medium
CN109243549A (en) A kind of intelligent follow-up method, device and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20181113

WW01 Invention patent application withdrawn after publication