CN108804657A - A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule - Google Patents
A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule Download PDFInfo
- Publication number
- CN108804657A CN108804657A CN201810585918.5A CN201810585918A CN108804657A CN 108804657 A CN108804657 A CN 108804657A CN 201810585918 A CN201810585918 A CN 201810585918A CN 108804657 A CN108804657 A CN 108804657A
- Authority
- CN
- China
- Prior art keywords
- module
- information
- crawl
- crawler system
- download
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention provides a kind of configurable grab of Zero-code based on WebMagic to climb regular crawler system, in the case of Zero-code, by the way that download schedule module, dissection process module, crawl contrast module are arranged in the Spider components of WebMagic and extracts storage module with Extracting Information, download schedule module is used for from the Internet download target pages to obtain target information;Dissection process module extracts target information, and find new URL link for parsing target pages;Contrast module is captured, for managing URL link to be captured, and removes the URL link of repetition;Storage module is extracted, processing target information is used for.The present invention will be by that will crawl the procedural abstraction of webpage information, it is cured in database, specifying information extracting mode is described using expression formula, convert whole process to one or more data-base recording, and then become doing database configuration for demand program reptile code traditional, publication number is reduced, iteration speed is improved.
Description
Technical field
The present invention relates to configurable grab of intelligent grabbing technical field more particularly to a kind of Zero-code based on WebMagic to climb
The crawler system of rule.
Background technology
Reptile is a kind of program of automatic acquisition web page contents, is the important component of search engine.In order to which image is retouched
It states computer program and constantly goes the link of extraction webpage by the import address of customization in network, and grabbed once again according to these links
The deeper other unknown links of extraction are taken, are gone down with this, are moved as similar reptile is described into the crawl behavior of such program
Make, referred to as reptile.
Result using the grasping system Extracting Information of manually generated crawl Wrapper Technology is more accurate, but will be to mutual
Thousands of a websites of networking carry out generation and the updating maintenance work of crawl wrapper, and normal vertical reptile can not undertake this very well
Work can only rely on a large amount of manpower and participate in.
Safe and efficient real-time crawl technology needs when requiring high real-time to capture to capturing Website server
Frequent to initiate link and download request, this will cause prodigious pressure to other side's server, and then other side can be caused to use
Tactful such as denied access is closed to ensure that server works normally, this will cause crawl to fail;Crawl in real time high simultaneously needs
It asks, expends very much the hardware resources such as network, server, lead to cost increase.
As AJAX technologies are constantly popularized, and this Single-page application frames of AngularJS now
The appearance of frame, the page that present js is rendered are more and more;For reptile, the information displayed in this page is more numerous
It is miscellaneous:HTML content is only extracted, effective information can not be often obtained.
Invention content
The present invention is directed to the shortcomings that existing way, proposes that a kind of configurable grab of Zero-code based on WebMagic climbs rule
Crawler system, to solve the above problem of the existing technology.
According to an aspect of the invention, there is provided a kind of configurable grab of Zero-code based on WebMagic climbs climbing for rule
Worm system, in the case of Zero-code, by the way that download schedule module, dissection process are arranged in the Spider components of WebMagic
Module, crawl contrast module and extraction storage module are with Extracting Information:
The download schedule module is used for from the Internet download target pages with according to customer requirement retrieval target information;
The dissection process module extracts target information, and find new URL link for parsing target pages;
The crawl contrast module for managing URL link to be captured, and removes the URL link of repetition;
The extraction storage module, for processing target information to be sorted out and be stored in database and server.
Further, default time interval, network agent and the network request head for obtaining information of the download schedule module
Information.
Further, the download schedule module uses Apache Http Client as download tool with from internet
Download target pages.
Further, the download schedule module is preset with maximum number of concurrent to avoid system congestion.
Further, the dissection process module uses Jsoup as the tool of parsing HTML, and using based on Jsoup
Xsoup as parsing XPath tool.
Further, the crawl contrast module manages URL link using the memory queue of JDK.
Further, the crawl contrast module removes the URL link of repetition using set.
Further, the crawl contrast module carries out distributed management using Redis to database and server.
Compared with prior art, the beneficial effects of the invention are as follows:
The present invention is cured in database by that will crawl the procedural abstraction of webpage information, expression formula is used
Specifying information extracting mode is described, to convert whole process to one or more data-base recording, and then traditional needle
Demand program reptile code is become to do database configuration, information is reduced and issues number, improve iteration speed.
The additional aspect of the present invention and advantage will be set forth in part in the description, these will become from the following description
Obviously, or practice through the invention is recognized.
Description of the drawings
Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments
Obviously and it is readily appreciated that, wherein:
Fig. 1, which is that a kind of Zero-code based on WebMagic in the embodiment of the present invention is configurable, grabs the crawler system for climbing rule
Structure diagram;
Fig. 2 is the flow chart of the webpage capture in the vertical field in the embodiment of the present invention;
Fig. 3 is the abstract flow chart of the crawl webpage in the embodiment of the present invention;
Fig. 4 is the business object structural schematic diagram of the crawl webpage in the embodiment of the present invention;
Fig. 5 is the practical flow chart of the crawl webpage in the embodiment of the present invention.
Specific implementation mode
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described.
In some flows of description in description and claims of this specification and above-mentioned attached drawing, contain according to
Multiple operations that particular order occurs, but it should be clearly understood that these operations can not be what appears in this article suitable according to its
Sequence is executed or is executed parallel, and the serial number such as 101,102 etc. of operation is only used for distinguishing each different operation, serial number
It itself does not represent and any executes sequence.In addition, these flows may include more or fewer operations, and these operations can
To execute or execute parallel in order.It should be noted that the descriptions such as " first " herein, " second ", are for distinguishing not
Same message, equipment, module etc., does not represent sequencing, does not also limit " first " and " second " and be different type.
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes, it is clear that described embodiment is only a part of example of the present invention, is implemented instead of all the embodiments.It is based on
Embodiment in the present invention, the every other implementation that those skilled in the art are obtained without creative efforts
Example, shall fall within the protection scope of the present invention.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology art
Language and scientific terminology), there is meaning identical with the general understanding of the those of ordinary skill in fields of the present invention.Should also
Understand, those terms such as defined in the general dictionary, it should be understood that have in the context of the prior art
The consistent meaning of meaning, and unless by specific definitions as here, the meaning of idealization or too formal otherwise will not be used
To explain.
Embodiment
As shown in Figure 1, a kind of configurable grab of the Zero-code based on WebMagic for providing the embodiment of the present invention climbs rule
Crawler system, in the case of Zero-code, by be arranged in the Spider components of WebMagic download schedule modules A 101,
Dissection process modules A 102, crawl contrast module A103 and extraction storage module A104 are with Extracting Information:
Download schedule modules A 101 is used for from the Internet download target pages with according to customer requirement retrieval target information;
Download schedule modules A 101 uses Apache Http Client as download tool with from the Internet download target
The page.
Download schedule modules A 101 is preset with maximum number of concurrent to avoid system congestion.
Default time interval, network agent and the network request header for obtaining information of download schedule modules A 101.
Dissection process modules A 102 extracts target information, and find new URL link for parsing target pages;
Dissection process modules A 102 uses Jsoup as the tool of parsing HTML, and is made using the Xsoup based on Jsoup
To parse the tool of XPath.
Contrast module A103 is captured, for managing URL link to be captured, and removes the URL link of repetition;
Crawl contrast module A103 manages URL link using the memory queue of JDK.
Crawl contrast module A103 removes the URL link of repetition using set.
It captures contrast module A103 and distributed management is carried out to database and server using Redis.
Storage module A104 is extracted, for processing target information to be sorted out and be stored in database and server.
The present invention is cured in database by that will crawl the procedural abstraction of webpage information, expression formula is used
Specifying information extracting mode is described, to convert whole process to one or more data-base recording, and then traditional needle
Demand program reptile code is become to do database configuration, publication number is reduced, improves iteration speed.
When capturing the webpage information in a certain field, reptile engineer is usually to initiate network request, then to obtaining
Html information or json information extract, and then obtained information preservation is got off.It in this process, perhaps can also be to net
Request is initiated in some link on page, does further information extraction.And at the same time, in vertical field on website, carry
What the information of confession was often shown in the form of a list, it is described in detail below:
For example, target pages are the websites of a training organization, all teachers are listed on some page on the website
Name, and provide link, clickthrough can check the details of giving lessons of the teacher, such as lecture contents, hours of instruction etc.,
Perhaps chain also being had in details and fetching go to subtleer content.If crawling these information, it can be abstracted as, first have to
The information shown with list is obtained, most important is exactly the url links of its details, then recycles and is asked to these url initiations
It asks, information extraction is done to details page, if also two level details, request is initiated to next stage url, and so on, similar to this
The webpage capture in the vertical field of kind, rough flow are as shown in Figure 2.
The flow that webpage is captured in Fig. 2 can be abstracted, as shown in Figure 3:
The abstract flow of crawl webpage is specially in Fig. 3:It is first the result set for creating a category information, is being needed to define
Which information of webpage extracted.Next the web page address to be crawled is there is provision of, after having a web page address, so that it may with profit
With webmagic crawl.During crawling webpage information, list page information is typically first crawled, is got in detail
Behind the address of feelings, further crawl is remake.It, can be according to previously defined result in such circular treatment again and again
Collection, assembles a result information, finally again by these result information persistences using html information extractions tool.
Above procedure is taken out 5 business objects by the present invention, grab climb collection, grab climb object, grab climb channel, grab climb the page,
Binding setting, it is specific as shown in Figure 4.
It grabs to climb to collect and is used for indicating that one kind is grabbed climbing set, it is main to play a grouping.It grabs and climbs object for describing one kind
Information is captured, can be news information, recruitment information etc., the attribute of these information and the class of each attribute is also defined
Type.Grab the mode that crawls that channel is used to specify this category information of climbing, such as the time interval of crawl, the agency of setting, network request
Header etc..The crawl page and binding are arranged to, to occurring, crawl page address and information extraction mode be specifically designated.Capture page
It face can be with the address and their relationship between superior and subordinate of specified list page and details page, correspondingly, will be set in binding setting
Fortunately what expression formula information is extracted in list page, how information is extracted on details page.Simultaneously also as network request
Return may be the html pages, it is also possible to be json information, thus binding setting in will have corresponding attribute, tell journey
Sequence extracts information with suitable tool.The present invention stores the information in database, after startup program, reads all
Information to memory in, according to these configuration, climb channel for each grab, start a thread of webmagic to capture letter
The information of breath, these last structurings is deposited into designated position.It, can also be according to the reality of the crawl page during actual coding
The problem of situation makes flexible processing, and here it is specific implementations.As shown in figure 5, this is a kind of performance:Dev-
Cloud cloud data centers are used for providing configuration management interface, and corresponding rule is deposited into database.Crawler task
Dispatcher center, which are grabbed, climbs the reading configuration of task center, the mode handled using task, and each grab of the processing of multithreading is climbed
Task.Final result is all deposited into kafka message queues, with for later use.
In embodiment provided herein, it should be understood that disclosed method, system, device, module and/or
Unit may be implemented in other ways.For example, embodiment of the method described above is only schematical, for example, institute
The division of module is stated, only a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple moulds
Block or component can be combined or can be integrated into another system, or some features can be ignored or not executed.The conduct
The unit that separating component illustrates may or may not be physically separated, the component shown as unit can be or
Person may not be physical unit, you can be located at a place, or may be distributed over multiple network units.It can root
According to actual needs, some or all of the units may be selected to achieve the purpose of the solution of this embodiment.
The above is only some embodiments of the present invention, it is noted that for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered
It is considered as protection scope of the present invention.
Claims (8)
1. a kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule, which is characterized in that in Zero-code
In the case of, by the way that download schedule module, dissection process module, crawl contrast module are arranged in the Spider components of WebMagic
And storage module is extracted with Extracting Information:
The download schedule module is used for from the Internet download target pages with according to customer requirement retrieval target information;
The dissection process module extracts target information, and find new URL link for parsing target pages;
The crawl contrast module for managing URL link to be captured, and removes the URL link of repetition;
The extraction storage module, for processing target information to be sorted out and be stored in database and server.
2. crawler system according to claim 1, which is characterized in that the download schedule module is default obtain information when
Between interval, network agent and network request header.
3. crawler system according to claim 1, which is characterized in that the download schedule module uses Apache Http
Client is as download tool with from the Internet download target pages.
4. crawler system according to claim 1, which is characterized in that the download schedule module is preset with maximum number of concurrent
To avoid system congestion.
5. crawler system according to claim 1, which is characterized in that the dissection process module uses Jsoup as solution
The tool of HTML is analysed, and uses the Xsoup based on Jsoup as the tool of parsing XPath.
6. crawler system according to claim 1, which is characterized in that the crawl contrast module uses the memory team of JDK
It arranges to manage URL link.
7. crawler system according to claim 1, which is characterized in that the crawl contrast module removes weight using set
Multiple URL link.
8. crawler system according to claim 1, which is characterized in that the crawl contrast module is using Redis to data
Library and server carry out distributed management.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810585918.5A CN108804657A (en) | 2018-06-08 | 2018-06-08 | A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810585918.5A CN108804657A (en) | 2018-06-08 | 2018-06-08 | A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108804657A true CN108804657A (en) | 2018-11-13 |
Family
ID=64087893
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810585918.5A Withdrawn CN108804657A (en) | 2018-06-08 | 2018-06-08 | A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108804657A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112528119A (en) * | 2020-12-21 | 2021-03-19 | 北京中安智达科技有限公司 | Distributed webpage information crawling system based on Pulsar |
-
2018
- 2018-06-08 CN CN201810585918.5A patent/CN108804657A/en not_active Withdrawn
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112528119A (en) * | 2020-12-21 | 2021-03-19 | 北京中安智达科技有限公司 | Distributed webpage information crawling system based on Pulsar |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107025296B (en) | Based on science service information intelligent grasping system method of data capture | |
CN107895009A (en) | One kind is based on distributed internet data acquisition method and system | |
CN106528769A (en) | Data acquisition method and apparatus | |
CN104268148B (en) | A kind of forum page Information Automatic Extraction method and system based on time string | |
CN110413864A (en) | A kind of network security information collection method, apparatus, equipment and storage medium | |
CN101727486A (en) | Web forum information extraction system | |
CN105550359B (en) | Webpage sorting method and device based on vertical search and server | |
CN107340954A (en) | A kind of information extracting method and device | |
CN107506389A (en) | A kind of method and apparatus for extracting position skill requirement | |
CN108090104A (en) | For obtaining the method and apparatus of webpage information | |
CN106649334A (en) | Conjunction word set processing method and device | |
CN108197030A (en) | Software interface based on deep learning tests cloud platform device and test method automatically | |
CN107766509A (en) | A kind of method and apparatus of webpage static backup | |
CN103136358A (en) | Method for automatically extracting BBS (bulletin board system) data | |
CN110417873A (en) | A kind of network information extraction system for realizing record webpage interactive operation | |
CN110134845A (en) | Project public sentiment monitoring method, device, computer equipment and storage medium | |
Prasad et al. | Coreex: content extraction from online news articles | |
CN104967698B (en) | A kind of method and apparatus crawling network data | |
CN108804657A (en) | A kind of Zero-code based on WebMagic is configurable to grab the crawler system for climbing rule | |
CN109635089B (en) | Literature work novelty evaluation system and method based on semantic network | |
CN107247789A (en) | user interest acquisition method based on internet | |
CN106649732A (en) | Information pushing method and device | |
CN107590121A (en) | Text-normalization method and system | |
CN113886204A (en) | User behavior data collection method and device, electronic equipment and readable storage medium | |
CN109243549A (en) | A kind of intelligent follow-up method, device and server |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20181113 |
|
WW01 | Invention patent application withdrawn after publication |