CN107451261A

CN107451261A - A kind of reptile network path method for tracing based on object storage

Info

Publication number: CN107451261A
Application number: CN201710642232.0A
Authority: CN
Inventors: 陈开冉; 邓楚健
Original assignee: Guangzhou Trace Technology Co Ltd
Current assignee: Guangzhou Trace Technology Co Ltd
Priority date: 2017-07-31
Filing date: 2017-07-31
Publication date: 2017-12-08
Anticipated expiration: 2037-07-31
Also published as: CN107451261B

Abstract

The invention discloses a kind of reptile network path method for tracing based on object storage, it is proposed to establish an object storage system and a logger in method, logger generates result path daily record, and the index for carrying out origin url reptile destination file to object storage system for crawling result is have recorded in the result path daily record；In the data during external system needs called data storehouse, reptile destination file on object storage system is directly obtained by index.The present invention can improve file read-write speed, by establishing result path daily record, can be retrieved in external system called data in the daily record, without going in database to retrieve, therefore avoid issuable read/write conflict by introducing object storage system.

Description

A kind of reptile network path method for tracing based on object storage

Technical field

The invention belongs to follow field in path in soft project, more particularly to a kind of reptile net based on object storage Network route tracing method.

Background technology

Web crawlers is a kind of program or script according to certain rule, automatically crawl web message, in mesh It is to carry out reptile network path tracking using reptile task as base unit mostly in preceding path tracing, reptile frame of such as increasing income Frame pyspider, default action are stored the result into database.External system if desired data in called data storehouse, and There is no a set of very easily search method, table scan can only be carried out to database, and need to change the number of results in database According to state, so as to next time handle when exclude these processed results.This data resulted in the database needs two System is safeguarded together, causes very big uncertainty, or even produce read/write conflict.

In the prior art it is also mentioned that result is directly stored in file system by another method, i.e. crawler system, The file of traversal file system is removed when needing to transfer related data by down-stream system.This method is due to needing frequently read-write magnetic Disk can cause disk I/O load serious, and in addition, this method still has is safeguarded same part data by two systems jointly Problem, it is impossible to from avoid at all produce read/write conflict.

The content of the invention

The shortcomings that it is an object of the invention to overcome prior art and deficiency, there is provided a kind of reptile net based on object storage Network route tracing method, this method have an IO efficiency highs, and recall precision is high and accurate, and reduce the advantages of coupling between system.

The purpose of the present invention is realized by following technical scheme：A kind of reptile network path tracking based on object storage Method, establish an object storage system and a logger, logger generation result path daily record, the result path The index for carrying out origin url reptile destination file to object storage system for crawling result is have recorded in daily record；Needed in external system When wanting the data in called data storehouse, reptile destination file on object storage system is directly obtained by index.The present invention passes through Object storage system is introduced, file read-write speed can be improved, by establishing result path daily record, in external system called data It can be retrieved in the daily record, without going in database to retrieve, therefore avoid issuable read/write conflict.

Specifically, this method comprises the following steps：

(1) object storage system and log processor are disposed, it is assumed that the reptile S destination file that crawls is R；

(2) it is loaded into the stage：Log processor crawls destination file R for this and generates path P, will crawl destination file R uploads The above-mentioned path position into object storage system；

Log processor generates one and includes the polymerization route record for coming origin url, path P for crawling result, and this is recorded It is added to the end of each self-corresponding result path daily record of each reptile；

(3) stage is transferred：External system travels through journal file or according to the path daily record of origin url retrieval result is carried out, and retrieves During corresponding document, then mark current file is downloaded this by the path P in the record and crawled in the line number of result path daily record Destination file is to locally；The line number marked if needing to continue retrieval from above starts to continue to retrieve record below.By above-mentioned Step, external system need not be recorded to any destination file and modified, it is only necessary to which record reads journal file which row just Result treatment can be operated to persistence, file path of the immediately obtained pending result in object storage system, so as to obtain To the reptile result.

Preferably, in the step (2), log processor is according to the reptile numbering S and crawls destination file R content Generate unique Hash path P.It thereby may be ensured that each destination file that crawls can correspond with its path.

Specifically, the structure of the result path daily record includes three row, the first column data is logging time, i.e. destination file The time of file system is write, the second column data is the origin url of coming for crawling result, and the 3rd column data is path P.

Preferably, the object storage system is a document storage system based on HBASE distributed file systems.From And while cost is reduced, storage and the speed downloaded can be improved, the storage of PT number of levels files can be supported.

Further, the object storage system is realized and object is deposited by calling HTTP interface to transmit relevant parameter Deletion (DELETE), establishment (POST), the rewriting (PUT) of file on storage system.

Preferably, in the step (3), external system, which by the path P in the record downloads this and crawls destination file, to be arrived Local step is：It is a URL to generate absolute path P2s, P2 of the P in object storage system according to path P, and P2 is carried out HTTP GET request can download files into local.

The present invention compared with prior art, has the following advantages that and beneficial effect：

1st, the present invention is provided with an object storage system, and the system is based on distributed file system, so as to realize The storage of magnanimity destination file, and improve IO efficiency.

2nd, the present invention is according to each web crawlers, foundation crawl result URL to object storage system reptile destination file Index, ensure recall precision, and release and external system between coupling.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of the inventive method.

Embodiment

Accompanying drawing being given for example only property explanation, it is impossible to be interpreted as the limitation to this patent；It is attached in order to more preferably illustrate the present embodiment Scheme some parts to have omission, zoom in or out, do not represent the size of actual product；To those skilled in the art, Some known features and its explanation may be omitted and will be understood by accompanying drawing.The present invention is made with reference to embodiment and accompanying drawing Further detailed description, but the implementation of the present invention is not limited to this.

Embodiment

As shown in figure 1, a kind of reptile network path method for tracing based on object storage of the present embodiment, including following steps Suddenly：

First, object storage system and log processor are disposed.

The object storage system is a document storage system based on HBASE distributed file systems, can be supported The storage of PT number of levels files.By calling HTTP interface to transmit relevant parameter, it is possible to achieve above to object storage system The deletion (DELETE) of part, create (POST), rewrite (PUT).It should be noted that object storage system of the present invention is only capable of providing text Part is deleted, creates and rewritten, and does not support the increment of file to write.

Log processor is a single-piece, for handling the result received, and is distinguished according to reptile, each reptile pair It should be write as a result path journal file, facilitate the reading of follow-up system to index.

2nd, the step of loading stage, i.e. write-in result path log recording.

1st, the destination file that crawls for assuming a reptile S is R, and log processor is according to the reptile numbering S and crawls knot Fruit file R content generates unique Hash path P.

2nd, log processor will crawl destination file R and upload to above-mentioned path position in object storage system.

3rd, log processor generates a polymerization route record, and the record is added to result path corresponding to current reptile The end of daily record.

The structure of result path daily record includes three row, and the first column data is logging time, i.e. destination file write-in file system The time of system, the second column data are the origin urls of coming for crawling result, and the 3rd column data is path P.For example, the record form of one It is as follows：

First row：

[2017-04-06 16:34:01,361][145]

Secondary series：

http://ln.gsxt.gov.cn/saicpub/entPublicitySC/entPublicityDC/ getJbxxAction.actio nType=1130＆pripid=210105000022005060877866

3rd row：

145/2f6a2cc7d144f83c6e9abc5416d047e3

Certainly, in actual applications, the structure of result path daily record can also be adjusted slightly, such as the order of three, with And other searched targets of increase etc., those skilled in the art can be improved appropriately on this basis.

3rd, the stage is transferred, i.e. external system obtains corresponding the step of crawling destination file.

1st, external system is according to URL, such as：

http://ln.gsxt.gov.cn/saicpub/entPublicitySC/entPublicityDC/ getJbxxAction.actionTy pe=1130＆pripid=210105000022005060877866 retrieval results path Daily record, when retrieving the URL in corresponding secondary series, obtain path P, i.e., 145/ in the 3rd row 2f6a2cc7d144f83c6e9abc5416d047e3, line number N of the mark current file in result path daily record；

2nd, complete path URL is generated such as according to path P：

http://xxx-host/xxx-bucket/145/2f6a2cc7d144f83c6e9abc5416d047e3, to this URL carries out HTTP GET request, downloads this and crawls destination file to locally；

3rd, the line number marked if needing to continue retrieval from above starts to continue to retrieve record below.

The technology that the present invention describes can be implemented by various means.For example, these technologies may be implemented in hardware, consolidate In part, software or its combination.For hardware embodiments, processing module may be implemented in one or more application specific integrated circuits (ASIC), digital signal processor (DSP), programmable logic device (PLD), field-programmable logic gate array (FPGA), place Manage device, controller, microcontroller, electronic installation, other electronic units for being designed to perform function described in the invention or In it is combined.

For firmware and/or Software implementations, the module for performing functions described herein can be used (for example, process, step Suddenly, flow etc.) implement the technology.Firmware and/or software code are storable in memory and by computing device.Storage Device may be implemented in processor or outside processor.

One of ordinary skill in the art will appreciate that：Realizing all or part of step of above method embodiment can pass through Programmed instruction related hardware is completed, and foregoing program can be stored in a computer read/write memory medium, the program Upon execution, the step of execution includes above method embodiment；And foregoing storage medium includes：ROM, RAM, magnetic disc or light Disk etc. is various can be with the medium of store program codes.

Above-described embodiment is the preferable embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, other any Spirit Essences without departing from the present invention with made under principle change, modification, replacement, combine, simplification, Equivalent substitute mode is should be, is included within protection scope of the present invention.

Claims

1. it is a kind of based on object storage reptile network path method for tracing, it is characterised in that establish an object storage system and One logger, logger generation result path daily record, have recorded in the result path daily record and crawls coming for result Index of the origin url to reptile destination file on object storage system；In the data during external system needs called data storehouse, lead to Cross index and directly obtain reptile destination file on object storage system.

2. it is according to claim 1 based on object storage reptile network path method for tracing, it is characterised in that including with Lower step：

(2) it is loaded into the stage：Log processor crawls destination file R for this and generates path P, will crawl destination file R and uploads to pair As above-mentioned path position in storage system；

Log processor generates one and includes the polymerization route record for coming origin url, path P for crawling result, and the record is added To the end of each self-corresponding result path daily record of each reptile；

(3) stage is transferred：External system travels through journal file or according to the path daily record of origin url retrieval result is carried out, and retrieves corresponding During file, then mark current file downloads this by the path P in the record and crawls result in the line number of result path daily record File is to locally；The line number marked if needing to continue retrieval from above starts to continue to retrieve record below.

3. the reptile network path method for tracing according to claim 2 based on object storage, it is characterised in that the step Suddenly in (2), log processor is according to the reptile numbering S and crawls destination file R content and generates unique Hash path P.

4. the reptile network path method for tracing according to claim 2 based on object storage, it is characterised in that the knot The structure of fruit path daily record includes three row, and the first column data is logging time, i.e. the time of destination file write-in file system, the Two column datas are the origin urls of coming for crawling result, and the 3rd column data is path P.

5. the reptile network path method for tracing according to claim 2 based on object storage, it is characterised in that described right As storage system is a document storage system based on HBASE distributed file systems.

6. the reptile network path method for tracing according to claim 5 based on object storage, it is characterised in that described right As storage system is by calling HTTP interface to transmit relevant parameter, deletion to file on object storage system, establishment, again are realized Write.

7. the reptile network path method for tracing according to claim 2 based on object storage, it is characterised in that the step Suddenly in (3), external system by the path P in the record download this crawl destination file to local step be：According to path P It is a URL to generate absolute path P2s, P2 of the P in object storage system, and the GET request that HTTP is carried out to P2 can be by file It is locally downloading.