CN106855872A

CN106855872A - The method for quickly retrieving of the mass picture based on Hadoop platform

Info

Publication number: CN106855872A
Application number: CN201510908363.XA
Authority: CN
Inventors: 孙玉林; 徐宝华; 贾春朴; 张福元; 陈守森
Original assignee: Shandong Business Institute
Current assignee: Shandong Business Institute
Priority date: 2015-12-08
Filing date: 2015-12-08
Publication date: 2017-06-16

Abstract

The present invention relates to computer big data process field, the specifically method for quickly retrieving of the mass picture based on Hadoop platform.Step 1, build Hadoop cluster platforms；Step 2, setting security strategy；Step 3, free hand drawing piece storage treatment；Step 4, file pretreatment merge；Step 5：Set up picture indices；Step 6, client initiate access request with picture name and creation time as parameter, and NameNode computings obtain the Blocks information corresponding with file is merged of minutes section where picture, return to client.The present invention can be very good to solve the problems, such as that NameNode memory consumptions are excessively and recall precision is low during Hadoop retrieval mass pictures, and NameNode when effectively reducing retrieval is loaded, the lifting to NameNode performances is realized, so as to promote hadoop platforms widely to apply.

Description

The method for quickly retrieving of the mass picture based on Hadoop platform

Technical field

The present invention relates to computer big data process field, the specifically quick inspection of the mass picture based on Hadoop platform Suo Fangfa.

Background technology

With the popularization and extensive use of internet, electric business platform and social networks are also continued to develop, for merchandise display Or the picture number that social activity is shared is in explosive growth.On these e-commerce websites and social network sites, the information table of picture Up to the description considerably beyond text information, so these e-commerce websites and social network sites more focus on the quality of picture. From the point of view of the analysis to Taobao, in the flow of whole business platform, the access to picture is up to more than 91.5%.Tengxun's phase Also up to 1,100,000,000, the picture that the user of volume uploads weekly, current total picture number has nearly 70,000,000,000, and total capacity is up to 15PB.Because mass picture needs to consume the memory space of magnanimity, performance bottleneck can all occur in the storage and retrieval of picture.Face How the picture resource of magnanimity, efficiently retrieve and how to meet the inspection of structure high efficiency low cost on the premise of high concurrent is accessed Cable system turns into needs the urgent problem for solving.

Hadoop is a software frame that distributed treatment can be carried out to mass data, while it is again reliable, high Effect, it is expansible.Reliability is embodied in it is assumed that calculating elements and storage can fail, therefore it safeguards multiple operational data pairs This, it is ensured that the node redistribution treatment of failure can be directed to.High efficiency is embodied in it and works in a parallel fashion, by parallel Treatment speed up processing.Expansibility refers to that it can process PB DBMSs.

Initially it is directed to large scale text data treatment design due to Hadoop, internal data type is limited, it is impossible to straight Connect treatment image data.In HDFS, file or catalogue etc. are stored in internal memory with object form, and each object is about used 150 bit internal memories.With the increase of mass picture quantity, the internal memory of consuming also increases sharply, the consumption of a large amount of namenode internal memories Take, had a strong impact on the application of Hadoop.Meanwhile, the speed for retrieving a large amount of pictures is much more slowly than the big of access same quantity of data File.

The content of the invention

The performance bottleneck problem that retrieval for mass picture occurs, the present invention proposes the mass picture based on Hadoop Search method, realizes merging small picture, and set the inclined of single Sequence File in merging process by Sequence Shifting amount, the DataNode and Fileld of the quick positioning storage picture Block of parsing index solve mass picture data dilatation and fast The problem of speed retrieval.

In order to solve the above technical problems, of the invention be achieved through the following technical solutions：

Step one, build Hadoop cluster platforms.Every computer installation operation system and Hadoop softwares, by a meter Calculation machine is configured to NameNode, and other allocation of computer are into DataNodes.Each machine passes through SSH direct communications.NameNode Responsible is the management of whole accumulation layer, and DataNode is mainly as memory node.Between checking DataNode and NameNode Connectivity is realized by heartbeat detection, and also periodically will be sent to for the memory block information of oneself by DataNode NameNode.When client is accessed, NameNode is accessed first, NameNode can distribute corresponding space, obtaining corresponding Space after start each operation.

Step 2, setting security strategy.A DataNode2 is increased in Hadoop cluster platforms newly to be backed up as NameNode Machine, by the data duplication in original NameNode to selected DataNode2, when NameNode runs, NameNode2 meetings The running status of NameNode is detected in real time, while the operation real-time update in NameNode to local, in NameNode During failure, NameNode2 ensures being normally carried out for service instead of NameNode.

Step 3, free hand drawing piece storage treatment.Picture first passes through load balancing module filtering, into application server queue etc. HDFS storage systems to be entered, distribute DataNode and are stored by NameNode, and write-in is first determined in picture ablation process Block, then determine Sequence File, the ID combinations of the two are named as system the title in the system of picture.Picture unit number According to HBase is stored in, while metadata is also stored in the caching system built by Redis.Picture completes write operation.

Step 4, file pretreatment merge.Picture file under assigned catalogue is read into picture array, and is initialized Byte arrays, in the merging file picture in byte being read under specified path with corresponding output file stream.

Step 5：Set up picture indices.Picture name be combined coding mode, mainly comprising BlockId with FileId two parts.What wherein BlockId was represented is a memory cell, and NameNode can be nearest according to its determination DateNode addresses, what FileId was represented is the Id of small picture SequenceFile when splicing；Offset represent be The side-play amount of of corresponding key values.HDFS front ends after the request for receiving client first can resolution file name, according to phase Information locating to corresponding Block files, FileId and offset is closed, then client is directly read out to picture.Right After filename parsing, DateNode node datas can be directly read, it is possible to the beginning of picture is navigated to by side-play amount Position.

Step 6, client initiate access request with picture name and creation time as parameter, and NameNode computings are obtained The Blocks information corresponding with file is merged of minutes section where picture, returns to client.Client is to nearest DataNode initiates picture read requests.DataNode computings obtain picture specific address information.

Compared with prior art, it is beneficial in that the present invention：The present invention can be very good to solve Hadoop retrievals sea During spirogram piece NameNode memory consumptions excessively and the low problem of recall precision, and NameNode when effectively reducing retrieval Load, realizes the lifting to NameNode performances, so as to promote hadoop platforms widely to apply.

Brief description of the drawings

Fig. 1 is picture Stored Procedure figure.

Fig. 2 is picture retrieval flow chart.

Specific embodiment

1 to Fig. 2, provides specific embodiment of the invention referring to the drawings, for the present invention will be further described.

Embodiment 1：

First：Deployment Hadoop clusters.Dispose after system, checked network, it is ensured that each machine energy phase in cluster Mutual communication.SSH is installed, configuration SSH exempts from password login.IP Host map relations are added to etc/hosts end of file, are installed Java context.At conf/hadoop-env.sh ends, addition export JAVA_HOME=/usr/jdk1.6.0 add testA It is added in master files, test1, test2, test3 is added in slaves files and changes conf/core-site.xml File.

Second：Redis is installed.Redis is downloaded, and is copied under respective directories, installation is compiled and starts service.

3rd：HAProxy is installed.Haproxy is downloaded, and is copied under respective directories, compiling is installed.

4th：Client initiates write data requests to NameNode first, is filtered by load balancing module, comes first Application server is waited in line to enter HDFS storage systems, and after request reaches NameNode, NameNode is according on DataNode Writeable piece, capacity and load weighted average be the DataNode that selects a writeable Block and can write Block, information Return to client.

5th：Selection one is used as Master in DataNode that client is returned from NameNode set, the value by The load of DataNode and currently determine as the number of times of Master so that each DataNode as Master chance It is impartial.Master- sections is selected, and the machine unless Master delays will not be changed again.The machine once Master delays is, it is necessary to remaining New Master is selected in DataNode.

6th：Client writes data into Master, and Master is written to further in accordance with the concurrent write data procedures of HDFS Slave A and Slave B.When all of data writing process all terminates, Master by Block information report to NameNode.NameNode receives Block information and returns to write operation and completes information.

7th：Read request reaches picture servers by load balancing, and request first passes through Redis cache modules inspection caching Whether area includes pictorial information, otherwise to arrive HBase retrieving image information, and retrieval result is written into buffer area.

8th：Request reaches HDFS requests and reads image content.Picture name is designed as in Blockid plus Block Fileld and offset side-play amounts, HBase inquires the relevant informations such as the name of picture, description according to picture file name.

9th：NameNode safeguards the map information between Block and DataNode, and NameNode is according in request analysis Block determine the Block in DataNode information.

Tenth：After the DataNode addresses that client is given according to NameNode obtain Block, obtained according to Fileld retrievals Take pictorial information.

It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be in other specific forms realized.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power Profit requires to be limited rather than described above, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the present invention.

Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each implementation method is included One independent technical scheme, this narrating mode of specification is only that for clarity, those skilled in the art should be by Used as an entirety, technical scheme in each embodiment can also be through appropriately combined, and forming those skilled in the art can for specification With the other embodiment for understanding.

Claims

1. Hadoop cluster platforms are built：Every computer installation operation system and Hadoop softwares, by an allocation of computer Into NameNode, other allocation of computer are into DataNodes；Each machine passes through SSH direct communications；NameNode be responsible for be The management of whole accumulation layer, DataNode is mainly as memory node；Connectivity is between checking DataNode and NameNode Realized by heartbeat detection, and the memory block information of oneself also periodically will be sent to NameNode by DataNode；Work as visitor When family end accesses, NameNode is accessed first, NameNode can distribute corresponding space, start after corresponding space is obtained each Individual operation.

2. security strategy is set：A DataNode2 is increased in Hadoop cluster platforms newly as NameNode backup machines, will be original , in selected DataNode2, when NameNode runs, NameNode2 can be examined in real time for data duplication in NameNode The running status of NameNode is surveyed, while the operation real-time update in NameNode is broken down to local in NameNode When, NameNode2 ensures being normally carried out for service instead of NameNode.

3. free hand drawing piece storage treatment：Picture first passes through load balancing module filtering, is waited into application server queue and entered HDFS storage systems, distribute DataNode and are stored by NameNode, and write-in Block is first determined in picture ablation process, Sequence File are determined again, and the ID combinations of the two are named as system the title in the system of picture；Picture metadata is preserved In HBase, while metadata is also stored in the caching system built by Redis；Picture completes write operation.

4. file pretreatment merges：Picture file under assigned catalogue is read into picture array, and initializes byte arrays, used In the merging file that be read into picture in byte under specified path by corresponding output file stream.

5. picture indices are set up：Picture name be combined coding mode, it is main comprising BlockId and FileId two parts； What wherein BlockId was represented is a memory cell, NameNode can according to the nearest DateNode addresses of its determination, That FileId is represented is the Id of small picture SequenceFile when splicing；What offset was represented is the one of corresponding key values Individual side-play amount；HDFS front ends after the request for receiving client first can resolution file name, navigated to according to relevant information Corresponding Block files, FileId and offset, then client directly picture is read out；To filename parsing with Afterwards, DateNode node datas can be directly read, it is possible to the starting position of picture is navigated to by side-play amount.

6. client initiates access request with picture name and creation time as parameter, and NameNode computings divide where obtaining picture Clock time section Blocks information corresponding with file is merged, returns to client；Client initiates figure to nearest DataNode Piece read requests；DataNode computings obtain picture specific address information.