CN107451261A - A kind of reptile network path method for tracing based on object storage - Google Patents

A kind of reptile network path method for tracing based on object storage Download PDF

Info

Publication number
CN107451261A
CN107451261A CN201710642232.0A CN201710642232A CN107451261A CN 107451261 A CN107451261 A CN 107451261A CN 201710642232 A CN201710642232 A CN 201710642232A CN 107451261 A CN107451261 A CN 107451261A
Authority
CN
China
Prior art keywords
path
object storage
reptile
file
storage system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710642232.0A
Other languages
Chinese (zh)
Other versions
CN107451261B (en
Inventor
陈开冉
邓楚健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Trace Technology Co Ltd
Original Assignee
Guangzhou Trace Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Trace Technology Co Ltd filed Critical Guangzhou Trace Technology Co Ltd
Priority to CN201710642232.0A priority Critical patent/CN107451261B/en
Publication of CN107451261A publication Critical patent/CN107451261A/en
Application granted granted Critical
Publication of CN107451261B publication Critical patent/CN107451261B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of reptile network path method for tracing based on object storage, it is proposed to establish an object storage system and a logger in method, logger generates result path daily record, and the index for carrying out origin url reptile destination file to object storage system for crawling result is have recorded in the result path daily record;In the data during external system needs called data storehouse, reptile destination file on object storage system is directly obtained by index.The present invention can improve file read-write speed, by establishing result path daily record, can be retrieved in external system called data in the daily record, without going in database to retrieve, therefore avoid issuable read/write conflict by introducing object storage system.

Description

A kind of reptile network path method for tracing based on object storage
Technical field
The invention belongs to follow field in path in soft project, more particularly to a kind of reptile net based on object storage Network route tracing method.
Background technology
Web crawlers is a kind of program or script according to certain rule, automatically crawl web message, in mesh It is to carry out reptile network path tracking using reptile task as base unit mostly in preceding path tracing, reptile frame of such as increasing income Frame pyspider, default action are stored the result into database.External system if desired data in called data storehouse, and There is no a set of very easily search method, table scan can only be carried out to database, and need to change the number of results in database According to state, so as to next time handle when exclude these processed results.This data resulted in the database needs two System is safeguarded together, causes very big uncertainty, or even produce read/write conflict.
In the prior art it is also mentioned that result is directly stored in file system by another method, i.e. crawler system, The file of traversal file system is removed when needing to transfer related data by down-stream system.This method is due to needing frequently read-write magnetic Disk can cause disk I/O load serious, and in addition, this method still has is safeguarded same part data by two systems jointly Problem, it is impossible to from avoid at all produce read/write conflict.
The content of the invention
The shortcomings that it is an object of the invention to overcome prior art and deficiency, there is provided a kind of reptile net based on object storage Network route tracing method, this method have an IO efficiency highs, and recall precision is high and accurate, and reduce the advantages of coupling between system.
The purpose of the present invention is realized by following technical scheme:A kind of reptile network path tracking based on object storage Method, establish an object storage system and a logger, logger generation result path daily record, the result path The index for carrying out origin url reptile destination file to object storage system for crawling result is have recorded in daily record;Needed in external system When wanting the data in called data storehouse, reptile destination file on object storage system is directly obtained by index.The present invention passes through Object storage system is introduced, file read-write speed can be improved, by establishing result path daily record, in external system called data It can be retrieved in the daily record, without going in database to retrieve, therefore avoid issuable read/write conflict.
Specifically, this method comprises the following steps:
(1) object storage system and log processor are disposed, it is assumed that the reptile S destination file that crawls is R;
(2) it is loaded into the stage:Log processor crawls destination file R for this and generates path P, will crawl destination file R uploads The above-mentioned path position into object storage system;
Log processor generates one and includes the polymerization route record for coming origin url, path P for crawling result, and this is recorded It is added to the end of each self-corresponding result path daily record of each reptile;
(3) stage is transferred:External system travels through journal file or according to the path daily record of origin url retrieval result is carried out, and retrieves During corresponding document, then mark current file is downloaded this by the path P in the record and crawled in the line number of result path daily record Destination file is to locally;The line number marked if needing to continue retrieval from above starts to continue to retrieve record below.By above-mentioned Step, external system need not be recorded to any destination file and modified, it is only necessary to which record reads journal file which row just Result treatment can be operated to persistence, file path of the immediately obtained pending result in object storage system, so as to obtain To the reptile result.
Preferably, in the step (2), log processor is according to the reptile numbering S and crawls destination file R content Generate unique Hash path P.It thereby may be ensured that each destination file that crawls can correspond with its path.
Specifically, the structure of the result path daily record includes three row, the first column data is logging time, i.e. destination file The time of file system is write, the second column data is the origin url of coming for crawling result, and the 3rd column data is path P.
Preferably, the object storage system is a document storage system based on HBASE distributed file systems.From And while cost is reduced, storage and the speed downloaded can be improved, the storage of PT number of levels files can be supported.
Further, the object storage system is realized and object is deposited by calling HTTP interface to transmit relevant parameter Deletion (DELETE), establishment (POST), the rewriting (PUT) of file on storage system.
Preferably, in the step (3), external system, which by the path P in the record downloads this and crawls destination file, to be arrived Local step is:It is a URL to generate absolute path P2s, P2 of the P in object storage system according to path P, and P2 is carried out HTTP GET request can download files into local.
The present invention compared with prior art, has the following advantages that and beneficial effect:
1st, the present invention is provided with an object storage system, and the system is based on distributed file system, so as to realize The storage of magnanimity destination file, and improve IO efficiency.
2nd, the present invention is according to each web crawlers, foundation crawl result URL to object storage system reptile destination file Index, ensure recall precision, and release and external system between coupling.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the inventive method.
Embodiment
Accompanying drawing being given for example only property explanation, it is impossible to be interpreted as the limitation to this patent;It is attached in order to more preferably illustrate the present embodiment Scheme some parts to have omission, zoom in or out, do not represent the size of actual product;To those skilled in the art, Some known features and its explanation may be omitted and will be understood by accompanying drawing.The present invention is made with reference to embodiment and accompanying drawing Further detailed description, but the implementation of the present invention is not limited to this.
Embodiment
As shown in figure 1, a kind of reptile network path method for tracing based on object storage of the present embodiment, including following steps Suddenly:
First, object storage system and log processor are disposed.
The object storage system is a document storage system based on HBASE distributed file systems, can be supported The storage of PT number of levels files.By calling HTTP interface to transmit relevant parameter, it is possible to achieve above to object storage system The deletion (DELETE) of part, create (POST), rewrite (PUT).It should be noted that object storage system of the present invention is only capable of providing text Part is deleted, creates and rewritten, and does not support the increment of file to write.
Log processor is a single-piece, for handling the result received, and is distinguished according to reptile, each reptile pair It should be write as a result path journal file, facilitate the reading of follow-up system to index.
2nd, the step of loading stage, i.e. write-in result path log recording.
1st, the destination file that crawls for assuming a reptile S is R, and log processor is according to the reptile numbering S and crawls knot Fruit file R content generates unique Hash path P.
2nd, log processor will crawl destination file R and upload to above-mentioned path position in object storage system.
3rd, log processor generates a polymerization route record, and the record is added to result path corresponding to current reptile The end of daily record.
The structure of result path daily record includes three row, and the first column data is logging time, i.e. destination file write-in file system The time of system, the second column data are the origin urls of coming for crawling result, and the 3rd column data is path P.For example, the record form of one It is as follows:
First row:
[2017-04-06 16:34:01,361][145]
Secondary series:
http://ln.gsxt.gov.cn/saicpub/entPublicitySC/entPublicityDC/ getJbxxAction.actio nType=1130&pripid=210105000022005060877866
3rd row:
145/2f6a2cc7d144f83c6e9abc5416d047e3
Certainly, in actual applications, the structure of result path daily record can also be adjusted slightly, such as the order of three, with And other searched targets of increase etc., those skilled in the art can be improved appropriately on this basis.
3rd, the stage is transferred, i.e. external system obtains corresponding the step of crawling destination file.
1st, external system is according to URL, such as:
http://ln.gsxt.gov.cn/saicpub/entPublicitySC/entPublicityDC/ getJbxxAction.actionTy pe=1130&pripid=210105000022005060877866 retrieval results path Daily record, when retrieving the URL in corresponding secondary series, obtain path P, i.e., 145/ in the 3rd row 2f6a2cc7d144f83c6e9abc5416d047e3, line number N of the mark current file in result path daily record;
2nd, complete path URL is generated such as according to path P:
http://xxx-host/xxx-bucket/145/2f6a2cc7d144f83c6e9abc5416d047e3, to this URL carries out HTTP GET request, downloads this and crawls destination file to locally;
3rd, the line number marked if needing to continue retrieval from above starts to continue to retrieve record below.
The technology that the present invention describes can be implemented by various means.For example, these technologies may be implemented in hardware, consolidate In part, software or its combination.For hardware embodiments, processing module may be implemented in one or more application specific integrated circuits (ASIC), digital signal processor (DSP), programmable logic device (PLD), field-programmable logic gate array (FPGA), place Manage device, controller, microcontroller, electronic installation, other electronic units for being designed to perform function described in the invention or In it is combined.
For firmware and/or Software implementations, the module for performing functions described herein can be used (for example, process, step Suddenly, flow etc.) implement the technology.Firmware and/or software code are storable in memory and by computing device.Storage Device may be implemented in processor or outside processor.
One of ordinary skill in the art will appreciate that:Realizing all or part of step of above method embodiment can pass through Programmed instruction related hardware is completed, and foregoing program can be stored in a computer read/write memory medium, the program Upon execution, the step of execution includes above method embodiment;And foregoing storage medium includes:ROM, RAM, magnetic disc or light Disk etc. is various can be with the medium of store program codes.
Above-described embodiment is the preferable embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, other any Spirit Essences without departing from the present invention with made under principle change, modification, replacement, combine, simplification, Equivalent substitute mode is should be, is included within protection scope of the present invention.

Claims (7)

1. it is a kind of based on object storage reptile network path method for tracing, it is characterised in that establish an object storage system and One logger, logger generation result path daily record, have recorded in the result path daily record and crawls coming for result Index of the origin url to reptile destination file on object storage system;In the data during external system needs called data storehouse, lead to Cross index and directly obtain reptile destination file on object storage system.
2. it is according to claim 1 based on object storage reptile network path method for tracing, it is characterised in that including with Lower step:
(1) object storage system and log processor are disposed, it is assumed that the reptile S destination file that crawls is R;
(2) it is loaded into the stage:Log processor crawls destination file R for this and generates path P, will crawl destination file R and uploads to pair As above-mentioned path position in storage system;
Log processor generates one and includes the polymerization route record for coming origin url, path P for crawling result, and the record is added To the end of each self-corresponding result path daily record of each reptile;
(3) stage is transferred:External system travels through journal file or according to the path daily record of origin url retrieval result is carried out, and retrieves corresponding During file, then mark current file downloads this by the path P in the record and crawls result in the line number of result path daily record File is to locally;The line number marked if needing to continue retrieval from above starts to continue to retrieve record below.
3. the reptile network path method for tracing according to claim 2 based on object storage, it is characterised in that the step Suddenly in (2), log processor is according to the reptile numbering S and crawls destination file R content and generates unique Hash path P.
4. the reptile network path method for tracing according to claim 2 based on object storage, it is characterised in that the knot The structure of fruit path daily record includes three row, and the first column data is logging time, i.e. the time of destination file write-in file system, the Two column datas are the origin urls of coming for crawling result, and the 3rd column data is path P.
5. the reptile network path method for tracing according to claim 2 based on object storage, it is characterised in that described right As storage system is a document storage system based on HBASE distributed file systems.
6. the reptile network path method for tracing according to claim 5 based on object storage, it is characterised in that described right As storage system is by calling HTTP interface to transmit relevant parameter, deletion to file on object storage system, establishment, again are realized Write.
7. the reptile network path method for tracing according to claim 2 based on object storage, it is characterised in that the step Suddenly in (3), external system by the path P in the record download this crawl destination file to local step be:According to path P It is a URL to generate absolute path P2s, P2 of the P in object storage system, and the GET request that HTTP is carried out to P2 can be by file It is locally downloading.
CN201710642232.0A 2017-07-31 2017-07-31 Crawler network path tracking method based on object storage Active CN107451261B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710642232.0A CN107451261B (en) 2017-07-31 2017-07-31 Crawler network path tracking method based on object storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710642232.0A CN107451261B (en) 2017-07-31 2017-07-31 Crawler network path tracking method based on object storage

Publications (2)

Publication Number Publication Date
CN107451261A true CN107451261A (en) 2017-12-08
CN107451261B CN107451261B (en) 2020-06-09

Family

ID=60489279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710642232.0A Active CN107451261B (en) 2017-07-31 2017-07-31 Crawler network path tracking method based on object storage

Country Status (1)

Country Link
CN (1) CN107451261B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614308A (en) * 2018-10-16 2019-04-12 深圳壹账通智能科技有限公司 Test data generating method, device and computer equipment based on crawler log
CN110334056A (en) * 2019-06-24 2019-10-15 广州探迹科技有限公司 A kind of crawler result storage method and device based on object storage
CN111356991A (en) * 2018-06-14 2020-06-30 西部数据技术公司 Logical block addressing range conflict crawler

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080270404A1 (en) * 2001-04-16 2008-10-30 Arkady Borkovsky Using Network Traffic Logs for Search Enhancement
CN102063477A (en) * 2010-12-13 2011-05-18 百度在线网络技术(北京)有限公司 Website data extraction device and method
CN106055618A (en) * 2016-05-26 2016-10-26 优品财富管理有限公司 Data processing method based on web crawlers and structural storage
CN106648445A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Data storage method and apparatus used for crawler

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080270404A1 (en) * 2001-04-16 2008-10-30 Arkady Borkovsky Using Network Traffic Logs for Search Enhancement
CN102063477A (en) * 2010-12-13 2011-05-18 百度在线网络技术(北京)有限公司 Website data extraction device and method
CN106648445A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Data storage method and apparatus used for crawler
CN106055618A (en) * 2016-05-26 2016-10-26 优品财富管理有限公司 Data processing method based on web crawlers and structural storage

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
南通大学教务处: "《学海图南 南通大学优秀毕业涉及(论文)集 2015届》", 31 March 2016, 苏州大学出版社 *
石志广: "移动互联网阅读业务用户行为分析系统的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
马费成: "《信息管理与信息系统研究进展 第2辑》", 30 June 2017, 武汉:武汉大学出版社 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111356991A (en) * 2018-06-14 2020-06-30 西部数据技术公司 Logical block addressing range conflict crawler
CN111356991B (en) * 2018-06-14 2023-09-19 西部数据技术公司 Logical block addressing range conflict crawler
CN109614308A (en) * 2018-10-16 2019-04-12 深圳壹账通智能科技有限公司 Test data generating method, device and computer equipment based on crawler log
CN110334056A (en) * 2019-06-24 2019-10-15 广州探迹科技有限公司 A kind of crawler result storage method and device based on object storage

Also Published As

Publication number Publication date
CN107451261B (en) 2020-06-09

Similar Documents

Publication Publication Date Title
US8886700B2 (en) Content sharing with limited cloud storage
CN103365996B (en) file management and processing method, device and system
CN107451261A (en) A kind of reptile network path method for tracing based on object storage
JP2007094449A5 (en)
CN103631905A (en) Webpage loading method and browser
CN103812939A (en) Big data storage system
CN103714164B (en) A kind of device and corresponding method controlling the translation of electronics map
KR102024998B1 (en) Extracting similar group elements
CN103577552A (en) Webpage picture processing method and device
CN105117499A (en) File display method and device based on cloud disk
CN109213824B (en) Data capture system, method and device
CN109145194A (en) The acquisition method and device of user behavior data
CN103646054A (en) Method for playing multimedia data and browser device
CN111159192B (en) Big data based data warehousing method and device, storage medium and processor
CN103095698A (en) Client software repairing method and repairing device and communication system
CN115795187A (en) Resource access method, device and equipment
CN103744852A (en) Snapshot processing method, snapshot display method, server, browser and system
CN116150236A (en) Data synchronization method and device, electronic equipment and computer readable storage medium
US9626371B2 (en) Attribute selectable file operation
US9158600B2 (en) System and method for automating the transfer of a data from a web interface to a database or another web interface
CN114817176A (en) Distributed file storage system and method based on Nginx + MinIO + Redis
GB2522832A (en) A method and a system for loading data with complex relationships
CN102867061A (en) System management method and system management device
CN103269352A (en) Point-to-point (P2P) file downloading method and device
CN105426541A (en) General data storing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant