CN106855872A - The method for quickly retrieving of the mass picture based on Hadoop platform - Google Patents

The method for quickly retrieving of the mass picture based on Hadoop platform Download PDF

Info

Publication number
CN106855872A
CN106855872A CN201510908363.XA CN201510908363A CN106855872A CN 106855872 A CN106855872 A CN 106855872A CN 201510908363 A CN201510908363 A CN 201510908363A CN 106855872 A CN106855872 A CN 106855872A
Authority
CN
China
Prior art keywords
namenode
picture
file
datanode
hadoop
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510908363.XA
Other languages
Chinese (zh)
Inventor
孙玉林
徐宝华
贾春朴
张福元
陈守森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Business Institute
Original Assignee
Shandong Business Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Business Institute filed Critical Shandong Business Institute
Priority to CN201510908363.XA priority Critical patent/CN106855872A/en
Publication of CN106855872A publication Critical patent/CN106855872A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/144Query formulation

Abstract

The present invention relates to computer big data process field, the specifically method for quickly retrieving of the mass picture based on Hadoop platform.Step 1, build Hadoop cluster platforms;Step 2, setting security strategy;Step 3, free hand drawing piece storage treatment;Step 4, file pretreatment merge;Step 5:Set up picture indices;Step 6, client initiate access request with picture name and creation time as parameter, and NameNode computings obtain the Blocks information corresponding with file is merged of minutes section where picture, return to client.The present invention can be very good to solve the problems, such as that NameNode memory consumptions are excessively and recall precision is low during Hadoop retrieval mass pictures, and NameNode when effectively reducing retrieval is loaded, the lifting to NameNode performances is realized, so as to promote hadoop platforms widely to apply.

Description

The method for quickly retrieving of the mass picture based on Hadoop platform
Technical field
The present invention relates to computer big data process field, the specifically quick inspection of the mass picture based on Hadoop platform Suo Fangfa.
Background technology
With the popularization and extensive use of internet, electric business platform and social networks are also continued to develop, for merchandise display Or the picture number that social activity is shared is in explosive growth.On these e-commerce websites and social network sites, the information table of picture Up to the description considerably beyond text information, so these e-commerce websites and social network sites more focus on the quality of picture. From the point of view of the analysis to Taobao, in the flow of whole business platform, the access to picture is up to more than 91.5%.Tengxun's phase Also up to 1,100,000,000, the picture that the user of volume uploads weekly, current total picture number has nearly 70,000,000,000, and total capacity is up to 15PB.Because mass picture needs to consume the memory space of magnanimity, performance bottleneck can all occur in the storage and retrieval of picture.Face How the picture resource of magnanimity, efficiently retrieve and how to meet the inspection of structure high efficiency low cost on the premise of high concurrent is accessed Cable system turns into needs the urgent problem for solving.
Hadoop is a software frame that distributed treatment can be carried out to mass data, while it is again reliable, high Effect, it is expansible.Reliability is embodied in it is assumed that calculating elements and storage can fail, therefore it safeguards multiple operational data pairs This, it is ensured that the node redistribution treatment of failure can be directed to.High efficiency is embodied in it and works in a parallel fashion, by parallel Treatment speed up processing.Expansibility refers to that it can process PB DBMSs.
Initially it is directed to large scale text data treatment design due to Hadoop, internal data type is limited, it is impossible to straight Connect treatment image data.In HDFS, file or catalogue etc. are stored in internal memory with object form, and each object is about used 150 bit internal memories.With the increase of mass picture quantity, the internal memory of consuming also increases sharply, the consumption of a large amount of namenode internal memories Take, had a strong impact on the application of Hadoop.Meanwhile, the speed for retrieving a large amount of pictures is much more slowly than the big of access same quantity of data File.
The content of the invention
The performance bottleneck problem that retrieval for mass picture occurs, the present invention proposes the mass picture based on Hadoop Search method, realizes merging small picture, and set the inclined of single Sequence File in merging process by Sequence Shifting amount, the DataNode and Fileld of the quick positioning storage picture Block of parsing index solve mass picture data dilatation and fast The problem of speed retrieval.
In order to solve the above technical problems, of the invention be achieved through the following technical solutions:
Step one, build Hadoop cluster platforms.Every computer installation operation system and Hadoop softwares, by a meter Calculation machine is configured to NameNode, and other allocation of computer are into DataNodes.Each machine passes through SSH direct communications.NameNode Responsible is the management of whole accumulation layer, and DataNode is mainly as memory node.Between checking DataNode and NameNode Connectivity is realized by heartbeat detection, and also periodically will be sent to for the memory block information of oneself by DataNode NameNode.When client is accessed, NameNode is accessed first, NameNode can distribute corresponding space, obtaining corresponding Space after start each operation.
Step 2, setting security strategy.A DataNode2 is increased in Hadoop cluster platforms newly to be backed up as NameNode Machine, by the data duplication in original NameNode to selected DataNode2, when NameNode runs, NameNode2 meetings The running status of NameNode is detected in real time, while the operation real-time update in NameNode to local, in NameNode During failure, NameNode2 ensures being normally carried out for service instead of NameNode.
Step 3, free hand drawing piece storage treatment.Picture first passes through load balancing module filtering, into application server queue etc. HDFS storage systems to be entered, distribute DataNode and are stored by NameNode, and write-in is first determined in picture ablation process Block, then determine Sequence File, the ID combinations of the two are named as system the title in the system of picture.Picture unit number According to HBase is stored in, while metadata is also stored in the caching system built by Redis.Picture completes write operation.
Step 4, file pretreatment merge.Picture file under assigned catalogue is read into picture array, and is initialized Byte arrays, in the merging file picture in byte being read under specified path with corresponding output file stream.
Step 5:Set up picture indices.Picture name be combined coding mode, mainly comprising BlockId with FileId two parts.What wherein BlockId was represented is a memory cell, and NameNode can be nearest according to its determination DateNode addresses, what FileId was represented is the Id of small picture SequenceFile when splicing;Offset represent be The side-play amount of of corresponding key values.HDFS front ends after the request for receiving client first can resolution file name, according to phase Information locating to corresponding Block files, FileId and offset is closed, then client is directly read out to picture.Right After filename parsing, DateNode node datas can be directly read, it is possible to the beginning of picture is navigated to by side-play amount Position.
Step 6, client initiate access request with picture name and creation time as parameter, and NameNode computings are obtained The Blocks information corresponding with file is merged of minutes section where picture, returns to client.Client is to nearest DataNode initiates picture read requests.DataNode computings obtain picture specific address information.
Compared with prior art, it is beneficial in that the present invention:The present invention can be very good to solve Hadoop retrievals sea During spirogram piece NameNode memory consumptions excessively and the low problem of recall precision, and NameNode when effectively reducing retrieval Load, realizes the lifting to NameNode performances, so as to promote hadoop platforms widely to apply.
Brief description of the drawings
Fig. 1 is picture Stored Procedure figure.
Fig. 2 is picture retrieval flow chart.
Specific embodiment
1 to Fig. 2, provides specific embodiment of the invention referring to the drawings, for the present invention will be further described.
Embodiment 1:
First:Deployment Hadoop clusters.Dispose after system, checked network, it is ensured that each machine energy phase in cluster Mutual communication.SSH is installed, configuration SSH exempts from password login.IP Host map relations are added to etc/hosts end of file, are installed Java context.At conf/hadoop-env.sh ends, addition export JAVA_HOME=/usr/jdk1.6.0 add testA It is added in master files, test1, test2, test3 is added in slaves files and changes conf/core-site.xml File.
Second:Redis is installed.Redis is downloaded, and is copied under respective directories, installation is compiled and starts service.
3rd:HAProxy is installed.Haproxy is downloaded, and is copied under respective directories, compiling is installed.
4th:Client initiates write data requests to NameNode first, is filtered by load balancing module, comes first Application server is waited in line to enter HDFS storage systems, and after request reaches NameNode, NameNode is according on DataNode Writeable piece, capacity and load weighted average be the DataNode that selects a writeable Block and can write Block, information Return to client.
5th:Selection one is used as Master in DataNode that client is returned from NameNode set, the value by The load of DataNode and currently determine as the number of times of Master so that each DataNode as Master chance It is impartial.Master- sections is selected, and the machine unless Master delays will not be changed again.The machine once Master delays is, it is necessary to remaining New Master is selected in DataNode.
6th:Client writes data into Master, and Master is written to further in accordance with the concurrent write data procedures of HDFS Slave A and Slave B.When all of data writing process all terminates, Master by Block information report to NameNode.NameNode receives Block information and returns to write operation and completes information.
7th:Read request reaches picture servers by load balancing, and request first passes through Redis cache modules inspection caching Whether area includes pictorial information, otherwise to arrive HBase retrieving image information, and retrieval result is written into buffer area.
8th:Request reaches HDFS requests and reads image content.Picture name is designed as in Blockid plus Block Fileld and offset side-play amounts, HBase inquires the relevant informations such as the name of picture, description according to picture file name.
9th:NameNode safeguards the map information between Block and DataNode, and NameNode is according in request analysis Block determine the Block in DataNode information.
Tenth:After the DataNode addresses that client is given according to NameNode obtain Block, obtained according to Fileld retrievals Take pictorial information.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be in other specific forms realized.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power Profit requires to be limited rather than described above, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the present invention.
Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each implementation method is included One independent technical scheme, this narrating mode of specification is only that for clarity, those skilled in the art should be by Used as an entirety, technical scheme in each embodiment can also be through appropriately combined, and forming those skilled in the art can for specification With the other embodiment for understanding.

Claims (6)

1. Hadoop cluster platforms are built:Every computer installation operation system and Hadoop softwares, by an allocation of computer Into NameNode, other allocation of computer are into DataNodes;Each machine passes through SSH direct communications;NameNode be responsible for be The management of whole accumulation layer, DataNode is mainly as memory node;Connectivity is between checking DataNode and NameNode Realized by heartbeat detection, and the memory block information of oneself also periodically will be sent to NameNode by DataNode;Work as visitor When family end accesses, NameNode is accessed first, NameNode can distribute corresponding space, start after corresponding space is obtained each Individual operation.
2. security strategy is set:A DataNode2 is increased in Hadoop cluster platforms newly as NameNode backup machines, will be original , in selected DataNode2, when NameNode runs, NameNode2 can be examined in real time for data duplication in NameNode The running status of NameNode is surveyed, while the operation real-time update in NameNode is broken down to local in NameNode When, NameNode2 ensures being normally carried out for service instead of NameNode.
3. free hand drawing piece storage treatment:Picture first passes through load balancing module filtering, is waited into application server queue and entered HDFS storage systems, distribute DataNode and are stored by NameNode, and write-in Block is first determined in picture ablation process, Sequence File are determined again, and the ID combinations of the two are named as system the title in the system of picture;Picture metadata is preserved In HBase, while metadata is also stored in the caching system built by Redis;Picture completes write operation.
4. file pretreatment merges:Picture file under assigned catalogue is read into picture array, and initializes byte arrays, used In the merging file that be read into picture in byte under specified path by corresponding output file stream.
5. picture indices are set up:Picture name be combined coding mode, it is main comprising BlockId and FileId two parts; What wherein BlockId was represented is a memory cell, NameNode can according to the nearest DateNode addresses of its determination, That FileId is represented is the Id of small picture SequenceFile when splicing;What offset was represented is the one of corresponding key values Individual side-play amount;HDFS front ends after the request for receiving client first can resolution file name, navigated to according to relevant information Corresponding Block files, FileId and offset, then client directly picture is read out;To filename parsing with Afterwards, DateNode node datas can be directly read, it is possible to the starting position of picture is navigated to by side-play amount.
6. client initiates access request with picture name and creation time as parameter, and NameNode computings divide where obtaining picture Clock time section Blocks information corresponding with file is merged, returns to client;Client initiates figure to nearest DataNode Piece read requests;DataNode computings obtain picture specific address information.
CN201510908363.XA 2015-12-08 2015-12-08 The method for quickly retrieving of the mass picture based on Hadoop platform Pending CN106855872A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510908363.XA CN106855872A (en) 2015-12-08 2015-12-08 The method for quickly retrieving of the mass picture based on Hadoop platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510908363.XA CN106855872A (en) 2015-12-08 2015-12-08 The method for quickly retrieving of the mass picture based on Hadoop platform

Publications (1)

Publication Number Publication Date
CN106855872A true CN106855872A (en) 2017-06-16

Family

ID=59133083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510908363.XA Pending CN106855872A (en) 2015-12-08 2015-12-08 The method for quickly retrieving of the mass picture based on Hadoop platform

Country Status (1)

Country Link
CN (1) CN106855872A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107800808A (en) * 2017-11-15 2018-03-13 广东奥飞数据科技股份有限公司 A kind of data-storage system based on Hadoop framework
CN108647290A (en) * 2018-05-06 2018-10-12 深圳市保千里电子有限公司 Internet cell phone cloud photograph album backup querying method based on HBase and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116643A (en) * 2013-02-25 2013-05-22 江苏物联网研究发展中心 Hadoop-based intelligent medical data management method
CN103500089A (en) * 2013-09-18 2014-01-08 北京航空航天大学 Small file storage system suitable for Mapreduce calculation model
CN103559229A (en) * 2013-10-22 2014-02-05 西安电子科技大学 Small file management service (SFMS) system based on MapFile and use method thereof
US20140215258A1 (en) * 2013-01-31 2014-07-31 International Business Machines Corporation Cluster management in a shared nothing cluster

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140215258A1 (en) * 2013-01-31 2014-07-31 International Business Machines Corporation Cluster management in a shared nothing cluster
CN103116643A (en) * 2013-02-25 2013-05-22 江苏物联网研究发展中心 Hadoop-based intelligent medical data management method
CN103500089A (en) * 2013-09-18 2014-01-08 北京航空航天大学 Small file storage system suitable for Mapreduce calculation model
CN103559229A (en) * 2013-10-22 2014-02-05 西安电子科技大学 Small file management service (SFMS) system based on MapFile and use method thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
左大鹏: "Hadoop小文件存储管理的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
张卫东: "基于Hadoop的海量图片云存储系统研究与设计", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
李林: "基于hadoop的海量图片存储模型的分析和设计", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107800808A (en) * 2017-11-15 2018-03-13 广东奥飞数据科技股份有限公司 A kind of data-storage system based on Hadoop framework
CN108647290A (en) * 2018-05-06 2018-10-12 深圳市保千里电子有限公司 Internet cell phone cloud photograph album backup querying method based on HBase and system

Similar Documents

Publication Publication Date Title
Dong et al. A novel approach to improving the efficiency of storing and accessing small files on hadoop: a case study by powerpoint files
Liu et al. Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS
US7743038B1 (en) Inode based policy identifiers in a filing system
CA3132946C (en) Distributing data on distributed storage systems
AU2018202230A1 (en) Client-configurable security options for data streams
Donvito et al. Testing of several distributed file-systems (HDFS, Ceph and GlusterFS) for supporting the HEP experiments analysis
CN107547653A (en) A kind of distributed file storage system
CN101997823A (en) Distributed file system and data access method thereof
CN103631820B (en) The metadata management method and equipment of distributed file system
CN107562757A (en) Inquiry, access method based on distributed file system, apparatus and system
CN108108476A (en) The method of work of highly reliable distributed information log system
Singh et al. Scalable metadata management techniques for ultra-large distributed storage systems--A systematic review
CN105763604B (en) Lightweight distributed file system and the method for restoring downloading file original name
CN110008197A (en) A kind of data processing method, system and electronic equipment and storage medium
Xiahou et al. Multi-datacenter cloud storage service selection strategy based on AHP and backward cloud generator model
CN107844542A (en) A kind of distributed document storage method and device
CN110502472A (en) A kind of the cloud storage optimization method and its system of large amount of small documents
CN106855872A (en) The method for quickly retrieving of the mass picture based on Hadoop platform
CN110362590A (en) Data managing method, device, system, electronic equipment and computer-readable medium
Acquaviva et al. Cloud distributed file systems: A benchmark of HDFS, Ceph, GlusterFS, and XtremeFS
CN104281486B (en) A kind of virtual machine treating method and apparatus
CN103108045A (en) Web map service implementation method based on cloud framework
Filippidis et al. IKAROS: A scalable I/O framework for high-performance computing systems
WO2020009287A1 (en) Method for providing internet service using por on basis of blockchain and distributed infrastructure p2p model
Wang et al. ASDF: An autonomous and scalable distributed file system

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170616