CN107451261A - A kind of reptile network path method for tracing based on object storage - Google Patents
A kind of reptile network path method for tracing based on object storage Download PDFInfo
- Publication number
- CN107451261A CN107451261A CN201710642232.0A CN201710642232A CN107451261A CN 107451261 A CN107451261 A CN 107451261A CN 201710642232 A CN201710642232 A CN 201710642232A CN 107451261 A CN107451261 A CN 107451261A
- Authority
- CN
- China
- Prior art keywords
- path
- object storage
- reptile
- file
- storage system
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/1805—Append-only file systems, e.g. using logs or journals to store data
- G06F16/1815—Journaling file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of reptile network path method for tracing based on object storage, it is proposed to establish an object storage system and a logger in method, logger generates result path daily record, and the index for carrying out origin url reptile destination file to object storage system for crawling result is have recorded in the result path daily record;In the data during external system needs called data storehouse, reptile destination file on object storage system is directly obtained by index.The present invention can improve file read-write speed, by establishing result path daily record, can be retrieved in external system called data in the daily record, without going in database to retrieve, therefore avoid issuable read/write conflict by introducing object storage system.
Description
Technical field
The invention belongs to follow field in path in soft project, more particularly to a kind of reptile net based on object storage
Network route tracing method.
Background technology
Web crawlers is a kind of program or script according to certain rule, automatically crawl web message, in mesh
It is to carry out reptile network path tracking using reptile task as base unit mostly in preceding path tracing, reptile frame of such as increasing income
Frame pyspider, default action are stored the result into database.External system if desired data in called data storehouse, and
There is no a set of very easily search method, table scan can only be carried out to database, and need to change the number of results in database
According to state, so as to next time handle when exclude these processed results.This data resulted in the database needs two
System is safeguarded together, causes very big uncertainty, or even produce read/write conflict.
In the prior art it is also mentioned that result is directly stored in file system by another method, i.e. crawler system,
The file of traversal file system is removed when needing to transfer related data by down-stream system.This method is due to needing frequently read-write magnetic
Disk can cause disk I/O load serious, and in addition, this method still has is safeguarded same part data by two systems jointly
Problem, it is impossible to from avoid at all produce read/write conflict.
The content of the invention
The shortcomings that it is an object of the invention to overcome prior art and deficiency, there is provided a kind of reptile net based on object storage
Network route tracing method, this method have an IO efficiency highs, and recall precision is high and accurate, and reduce the advantages of coupling between system.
The purpose of the present invention is realized by following technical scheme:A kind of reptile network path tracking based on object storage
Method, establish an object storage system and a logger, logger generation result path daily record, the result path
The index for carrying out origin url reptile destination file to object storage system for crawling result is have recorded in daily record;Needed in external system
When wanting the data in called data storehouse, reptile destination file on object storage system is directly obtained by index.The present invention passes through
Object storage system is introduced, file read-write speed can be improved, by establishing result path daily record, in external system called data
It can be retrieved in the daily record, without going in database to retrieve, therefore avoid issuable read/write conflict.
Specifically, this method comprises the following steps:
(1) object storage system and log processor are disposed, it is assumed that the reptile S destination file that crawls is R;
(2) it is loaded into the stage:Log processor crawls destination file R for this and generates path P, will crawl destination file R uploads
The above-mentioned path position into object storage system;
Log processor generates one and includes the polymerization route record for coming origin url, path P for crawling result, and this is recorded
It is added to the end of each self-corresponding result path daily record of each reptile;
(3) stage is transferred:External system travels through journal file or according to the path daily record of origin url retrieval result is carried out, and retrieves
During corresponding document, then mark current file is downloaded this by the path P in the record and crawled in the line number of result path daily record
Destination file is to locally;The line number marked if needing to continue retrieval from above starts to continue to retrieve record below.By above-mentioned
Step, external system need not be recorded to any destination file and modified, it is only necessary to which record reads journal file which row just
Result treatment can be operated to persistence, file path of the immediately obtained pending result in object storage system, so as to obtain
To the reptile result.
Preferably, in the step (2), log processor is according to the reptile numbering S and crawls destination file R content
Generate unique Hash path P.It thereby may be ensured that each destination file that crawls can correspond with its path.
Specifically, the structure of the result path daily record includes three row, the first column data is logging time, i.e. destination file
The time of file system is write, the second column data is the origin url of coming for crawling result, and the 3rd column data is path P.
Preferably, the object storage system is a document storage system based on HBASE distributed file systems.From
And while cost is reduced, storage and the speed downloaded can be improved, the storage of PT number of levels files can be supported.
Further, the object storage system is realized and object is deposited by calling HTTP interface to transmit relevant parameter
Deletion (DELETE), establishment (POST), the rewriting (PUT) of file on storage system.
Preferably, in the step (3), external system, which by the path P in the record downloads this and crawls destination file, to be arrived
Local step is:It is a URL to generate absolute path P2s, P2 of the P in object storage system according to path P, and P2 is carried out
HTTP GET request can download files into local.
The present invention compared with prior art, has the following advantages that and beneficial effect:
1st, the present invention is provided with an object storage system, and the system is based on distributed file system, so as to realize
The storage of magnanimity destination file, and improve IO efficiency.
2nd, the present invention is according to each web crawlers, foundation crawl result URL to object storage system reptile destination file
Index, ensure recall precision, and release and external system between coupling.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the inventive method.
Embodiment
Accompanying drawing being given for example only property explanation, it is impossible to be interpreted as the limitation to this patent;It is attached in order to more preferably illustrate the present embodiment
Scheme some parts to have omission, zoom in or out, do not represent the size of actual product;To those skilled in the art,
Some known features and its explanation may be omitted and will be understood by accompanying drawing.The present invention is made with reference to embodiment and accompanying drawing
Further detailed description, but the implementation of the present invention is not limited to this.
Embodiment
As shown in figure 1, a kind of reptile network path method for tracing based on object storage of the present embodiment, including following steps
Suddenly:
First, object storage system and log processor are disposed.
The object storage system is a document storage system based on HBASE distributed file systems, can be supported
The storage of PT number of levels files.By calling HTTP interface to transmit relevant parameter, it is possible to achieve above to object storage system
The deletion (DELETE) of part, create (POST), rewrite (PUT).It should be noted that object storage system of the present invention is only capable of providing text
Part is deleted, creates and rewritten, and does not support the increment of file to write.
Log processor is a single-piece, for handling the result received, and is distinguished according to reptile, each reptile pair
It should be write as a result path journal file, facilitate the reading of follow-up system to index.
2nd, the step of loading stage, i.e. write-in result path log recording.
1st, the destination file that crawls for assuming a reptile S is R, and log processor is according to the reptile numbering S and crawls knot
Fruit file R content generates unique Hash path P.
2nd, log processor will crawl destination file R and upload to above-mentioned path position in object storage system.
3rd, log processor generates a polymerization route record, and the record is added to result path corresponding to current reptile
The end of daily record.
The structure of result path daily record includes three row, and the first column data is logging time, i.e. destination file write-in file system
The time of system, the second column data are the origin urls of coming for crawling result, and the 3rd column data is path P.For example, the record form of one
It is as follows:
First row:
[2017-04-06 16:34:01,361][145]
Secondary series:
http://ln.gsxt.gov.cn/saicpub/entPublicitySC/entPublicityDC/
getJbxxAction.actio nType=1130&pripid=210105000022005060877866
3rd row:
145/2f6a2cc7d144f83c6e9abc5416d047e3
Certainly, in actual applications, the structure of result path daily record can also be adjusted slightly, such as the order of three, with
And other searched targets of increase etc., those skilled in the art can be improved appropriately on this basis.
3rd, the stage is transferred, i.e. external system obtains corresponding the step of crawling destination file.
1st, external system is according to URL, such as:
http://ln.gsxt.gov.cn/saicpub/entPublicitySC/entPublicityDC/
getJbxxAction.actionTy pe=1130&pripid=210105000022005060877866 retrieval results path
Daily record, when retrieving the URL in corresponding secondary series, obtain path P, i.e., 145/ in the 3rd row
2f6a2cc7d144f83c6e9abc5416d047e3, line number N of the mark current file in result path daily record;
2nd, complete path URL is generated such as according to path P:
http://xxx-host/xxx-bucket/145/2f6a2cc7d144f83c6e9abc5416d047e3, to this
URL carries out HTTP GET request, downloads this and crawls destination file to locally;
3rd, the line number marked if needing to continue retrieval from above starts to continue to retrieve record below.
The technology that the present invention describes can be implemented by various means.For example, these technologies may be implemented in hardware, consolidate
In part, software or its combination.For hardware embodiments, processing module may be implemented in one or more application specific integrated circuits
(ASIC), digital signal processor (DSP), programmable logic device (PLD), field-programmable logic gate array (FPGA), place
Manage device, controller, microcontroller, electronic installation, other electronic units for being designed to perform function described in the invention or
In it is combined.
For firmware and/or Software implementations, the module for performing functions described herein can be used (for example, process, step
Suddenly, flow etc.) implement the technology.Firmware and/or software code are storable in memory and by computing device.Storage
Device may be implemented in processor or outside processor.
One of ordinary skill in the art will appreciate that:Realizing all or part of step of above method embodiment can pass through
Programmed instruction related hardware is completed, and foregoing program can be stored in a computer read/write memory medium, the program
Upon execution, the step of execution includes above method embodiment;And foregoing storage medium includes:ROM, RAM, magnetic disc or light
Disk etc. is various can be with the medium of store program codes.
Above-described embodiment is the preferable embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment
Limitation, other any Spirit Essences without departing from the present invention with made under principle change, modification, replacement, combine, simplification,
Equivalent substitute mode is should be, is included within protection scope of the present invention.
Claims (7)
1. it is a kind of based on object storage reptile network path method for tracing, it is characterised in that establish an object storage system and
One logger, logger generation result path daily record, have recorded in the result path daily record and crawls coming for result
Index of the origin url to reptile destination file on object storage system;In the data during external system needs called data storehouse, lead to
Cross index and directly obtain reptile destination file on object storage system.
2. it is according to claim 1 based on object storage reptile network path method for tracing, it is characterised in that including with
Lower step:
(1) object storage system and log processor are disposed, it is assumed that the reptile S destination file that crawls is R;
(2) it is loaded into the stage:Log processor crawls destination file R for this and generates path P, will crawl destination file R and uploads to pair
As above-mentioned path position in storage system;
Log processor generates one and includes the polymerization route record for coming origin url, path P for crawling result, and the record is added
To the end of each self-corresponding result path daily record of each reptile;
(3) stage is transferred:External system travels through journal file or according to the path daily record of origin url retrieval result is carried out, and retrieves corresponding
During file, then mark current file downloads this by the path P in the record and crawls result in the line number of result path daily record
File is to locally;The line number marked if needing to continue retrieval from above starts to continue to retrieve record below.
3. the reptile network path method for tracing according to claim 2 based on object storage, it is characterised in that the step
Suddenly in (2), log processor is according to the reptile numbering S and crawls destination file R content and generates unique Hash path P.
4. the reptile network path method for tracing according to claim 2 based on object storage, it is characterised in that the knot
The structure of fruit path daily record includes three row, and the first column data is logging time, i.e. the time of destination file write-in file system, the
Two column datas are the origin urls of coming for crawling result, and the 3rd column data is path P.
5. the reptile network path method for tracing according to claim 2 based on object storage, it is characterised in that described right
As storage system is a document storage system based on HBASE distributed file systems.
6. the reptile network path method for tracing according to claim 5 based on object storage, it is characterised in that described right
As storage system is by calling HTTP interface to transmit relevant parameter, deletion to file on object storage system, establishment, again are realized
Write.
7. the reptile network path method for tracing according to claim 2 based on object storage, it is characterised in that the step
Suddenly in (3), external system by the path P in the record download this crawl destination file to local step be:According to path P
It is a URL to generate absolute path P2s, P2 of the P in object storage system, and the GET request that HTTP is carried out to P2 can be by file
It is locally downloading.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710642232.0A CN107451261B (en) | 2017-07-31 | 2017-07-31 | Crawler network path tracking method based on object storage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710642232.0A CN107451261B (en) | 2017-07-31 | 2017-07-31 | Crawler network path tracking method based on object storage |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107451261A true CN107451261A (en) | 2017-12-08 |
CN107451261B CN107451261B (en) | 2020-06-09 |
Family
ID=60489279
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710642232.0A Active CN107451261B (en) | 2017-07-31 | 2017-07-31 | Crawler network path tracking method based on object storage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107451261B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109614308A (en) * | 2018-10-16 | 2019-04-12 | 深圳壹账通智能科技有限公司 | Test data generating method, device and computer equipment based on crawler log |
CN110334056A (en) * | 2019-06-24 | 2019-10-15 | 广州探迹科技有限公司 | A kind of crawler result storage method and device based on object storage |
CN111356991A (en) * | 2018-06-14 | 2020-06-30 | 西部数据技术公司 | Logical block addressing range conflict crawler |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080270404A1 (en) * | 2001-04-16 | 2008-10-30 | Arkady Borkovsky | Using Network Traffic Logs for Search Enhancement |
CN102063477A (en) * | 2010-12-13 | 2011-05-18 | 百度在线网络技术(北京)有限公司 | Website data extraction device and method |
CN106055618A (en) * | 2016-05-26 | 2016-10-26 | 优品财富管理有限公司 | Data processing method based on web crawlers and structural storage |
CN106648445A (en) * | 2015-10-30 | 2017-05-10 | 北京国双科技有限公司 | Data storage method and apparatus used for crawler |
-
2017
- 2017-07-31 CN CN201710642232.0A patent/CN107451261B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080270404A1 (en) * | 2001-04-16 | 2008-10-30 | Arkady Borkovsky | Using Network Traffic Logs for Search Enhancement |
CN102063477A (en) * | 2010-12-13 | 2011-05-18 | 百度在线网络技术(北京)有限公司 | Website data extraction device and method |
CN106648445A (en) * | 2015-10-30 | 2017-05-10 | 北京国双科技有限公司 | Data storage method and apparatus used for crawler |
CN106055618A (en) * | 2016-05-26 | 2016-10-26 | 优品财富管理有限公司 | Data processing method based on web crawlers and structural storage |
Non-Patent Citations (3)
Title |
---|
南通大学教务处: "《学海图南 南通大学优秀毕业涉及(论文)集 2015届》", 31 March 2016, 苏州大学出版社 * |
石志广: "移动互联网阅读业务用户行为分析系统的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
马费成: "《信息管理与信息系统研究进展 第2辑》", 30 June 2017, 武汉:武汉大学出版社 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111356991A (en) * | 2018-06-14 | 2020-06-30 | 西部数据技术公司 | Logical block addressing range conflict crawler |
CN111356991B (en) * | 2018-06-14 | 2023-09-19 | 西部数据技术公司 | Logical block addressing range conflict crawler |
CN109614308A (en) * | 2018-10-16 | 2019-04-12 | 深圳壹账通智能科技有限公司 | Test data generating method, device and computer equipment based on crawler log |
CN110334056A (en) * | 2019-06-24 | 2019-10-15 | 广州探迹科技有限公司 | A kind of crawler result storage method and device based on object storage |
Also Published As
Publication number | Publication date |
---|---|
CN107451261B (en) | 2020-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8886700B2 (en) | Content sharing with limited cloud storage | |
CN103365996B (en) | file management and processing method, device and system | |
CN107451261A (en) | A kind of reptile network path method for tracing based on object storage | |
JP2007094449A5 (en) | ||
CN103631905A (en) | Webpage loading method and browser | |
CN103812939A (en) | Big data storage system | |
CN103714164B (en) | A kind of device and corresponding method controlling the translation of electronics map | |
KR102024998B1 (en) | Extracting similar group elements | |
CN103577552A (en) | Webpage picture processing method and device | |
CN105117499A (en) | File display method and device based on cloud disk | |
CN109213824B (en) | Data capture system, method and device | |
CN109145194A (en) | The acquisition method and device of user behavior data | |
CN103646054A (en) | Method for playing multimedia data and browser device | |
CN111159192B (en) | Big data based data warehousing method and device, storage medium and processor | |
CN103095698A (en) | Client software repairing method and repairing device and communication system | |
CN115795187A (en) | Resource access method, device and equipment | |
CN103744852A (en) | Snapshot processing method, snapshot display method, server, browser and system | |
CN116150236A (en) | Data synchronization method and device, electronic equipment and computer readable storage medium | |
US9626371B2 (en) | Attribute selectable file operation | |
US9158600B2 (en) | System and method for automating the transfer of a data from a web interface to a database or another web interface | |
CN114817176A (en) | Distributed file storage system and method based on Nginx + MinIO + Redis | |
GB2522832A (en) | A method and a system for loading data with complex relationships | |
CN102867061A (en) | System management method and system management device | |
CN103269352A (en) | Point-to-point (P2P) file downloading method and device | |
CN105426541A (en) | General data storing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |