CN103473121A

CN103473121A - Mass image parallel processing method based on cloud computing platform

Info

Publication number: CN103473121A
Application number: CN2013103650914A
Authority: CN
Inventors: 张亮; 沈沛意; 宋娟; 董洛兵; 王剑; 胡正川; 孙庚泽
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2013-08-20
Filing date: 2013-08-20
Publication date: 2013-12-25

Abstract

The invention discloses a mass image parallel processing method based on a cloud computing platform. The method is based on a cloud computing distribution type parallel computation model. The model comprises a single main node and a plurality of task nodes, wherein the main node is used for mapping a picture to be processed to different task nodes according to an allocation strategy, and the picture is subjected to parallel processing by processing codes via the task nodes. According to the mass image parallel processing method based on the cloud computing platform, which is disclosed by the invention, mass picture documents or a single big picture document can be subjected to quick parallel processing by the cloud computing platform, and the picture analyzing and processing response speed and throughput capacity can be improved.

Description

A kind of mass picture method for parallel processing based on cloud computing platform

Technical field

The present invention relates to cloud computing technology field and image processing field, relate in particular to a kind of method of the mass picture parallel processing based on cloud computing platform.

Background technology

Cloud computing (Cloud Computing) is grid computing (Grid Computing), Distributed Calculation (Distributed Computing), parallel computation (Parallel Computing), effectiveness is calculated (Utility Computing), the network storage (Network Storage Technologies), virtual (Virtualization), the product that the traditional calculations machine technology such as load balancing (Load Balance) and network technical development merge, cloud computing is a kind of computation schema of dynamic telescopic virtual resources that provides in the mode of service by Internet.Increase, use and the delivery mode of the related service of cloud computing Internet-based, being usually directed to provides dynamically easily expansion and is often virtualized resource by internet.

Flourish today in cloud computing, mass picture is treated as the field of a very popular and worth research, how effectively this research field relates to the Storage and Processing mass picture, wherein picture by measure size be divided into again the little picture of several KB and the large picture of several GB.In order efficiently the mass picture of different sizes to be processed, the disposal route adopted is also different.

Domestic in the little picture-storage of magnanimity and process field, the Storage and Processing that Taobao's TFS file system is directed to little picture has carried out optimizing targetedly, makes Taobao can deal with reading of the little picture of magnanimity commodity.

External Facebook is also more authoritative in mass picture Storage and Processing field, and their framework will be processed the picture that the whole world several hundred million users upload and carry out Storage and Processing, and user's request of accepting corresponding high concurrency.

Taobao and Facebook are being also targetedly aspect the processing mass picture, their fundamental purpose is efficient image data access, be not targetedly mass picture to be carried out to algorithm process and analysis, and still fewer for the correlation technique research of the large picture processing of GB rank at present.

Summary of the invention

The present invention proposes a kind of mass picture method for parallel processing based on cloud computing platform, its fundamental purpose is to use the parallel process model of cloud computing platform to realize the parallel parsing processing to the little picture of magnanimity (the size judgement of little picture depends on algorithm, individual node processing power etc.) and single large picture (generally several GB of possibility or tens GB) file, promotes response speed and the handling capacity of picture analyzing processing.

A kind of mass picture method for parallel processing based on cloud computing platform, the Distributed Parallel Computing Model based on cloud computing, this model comprises data storage server, a host node and a plurality of task node comprise following step:

1) client by allocation strategy, process code and be stored in pictorial information pending in data storage server and be forwarded to host node;

2) host node is according to receiving pictorial information, picture in the data storage server is traveled through, obtain the picture positional information that pictorial information is corresponding, and the picture positional information is carried out to subpackage according to allocation strategy, then subpackage and processing code are sent to each task node;

3) each task node reads the picture positional information in received subpackage, then according to the picture positional information, reads corresponding picture from data storage server, and according to processing code, picture is carried out to parallel processing;

4) picture-storage of each task node after processing be in data storage server, and the process state information of picture is fed back to host node.

In whole processing procedure, host node is controlled node as one, be in charge of a plurality of task nodes, host node is distributed the mass picture that will process, processed picture distribution to different task nodes, then wait for that each task node finishes dealing with, in this way the processing of mass picture has been carried out to efficient parallelization, more multiprocessing speed of task node is faster.

For the processing of large picture and little picture, host node is different on distribution policy.If the little picture that picture is magnanimity, and picture size is several million even tens o'clock, described subpackage comprises the positional information of corresponding little picture at data storage server.A plurality of this little picture packings can be distributed to a task node and carry out batch processing, distribute little picture strategy and mainly depend on the quantity of task node, the processing power of individual task node, and the node location of picture-storage.

When described picture is single large picture, and picture size is greater than at 100,000,000 o'clock, because the speed that the individual task node carries out algorithm process to a large picture is slower, when even the computational resource of individual task node can't meet the computation requirement that this magnifies picture at all, host node can be according to the characteristics of picture itself, the computing power of individual task node, algorithm limits etc., one is cut apart by a certain size in the picture of GB greatly, different bursts is distributed to different task nodes to be processed, host node is responsible for the result that the wait task node returns to the picture burst, carry out again the picture splicing.

Described host node becomes large picture segmentation some little pictures and little picture is sorted, described subpackage comprises the positional information of large picture and the segmentation information of each little picture, and described segmentation information comprises burst ID, skew and the length information of corresponding little picture.Host node is cut apart pending large picture according to the quantity of task node and corresponding picture segmentation strategy, then picture positional information and segmentation information package are distributed to each task node, each task node obtains the pictorial information of cutting apart of appointment from data storage server, and it is processed.

After large picture segmentation becomes some, the pictorial information parallel processing of each task node after to burst, after handling, described host node is spliced into corresponding large picture by the little picture of handling according to sequencing information.

Parallel processing can be chosen different disposal routes according to the purpose difference to picture processing, in the picture processing process, if between pixel without content relevance, can adopt the method for single picture file being carried out to the burst parallel processing, if exist content relevant between pixel, can adopt and take the method that whole file carries out the burst processing as unit.

If the picture format of the form of described picture for compressing, in step 2) in, host node need carry out decompress(ion) to picture.Pending picture format only supports bmp etc. there is no the file of compression coding, if the picture format that JPEG etc. compressed need first carry out decompress(ion) and be processed.

Described each task node generates corresponding journal file after handling picture, and is kept in data storage server.

The present invention is as follows for the processing procedure of little picture and large picture:

1, the processing procedure of the little picture of magnanimity

The cloud computing platform environment of this disposal route based on a plurality of nodes, setting a main frame is host node, other node is task node.Host node is responsible for the task distribution of mass picture, this node is according to the memory node position of pending picture, the limiting factors such as processing power of each task node, each picture is packed, each subpackage comprises the stored position information of picture, subpackage is transferred to each task node to be processed, task node obtains pictorial information to be dealt with from minute package informatin, then read image data from data storage server, each picture is carried out to the image analysis algorithm of appointment, result is fed back to host node.After host node distributes task, wait for and collect the result of each task node.

2, the processing procedure of large picture

This disposal route is the cloud platform environment based on a plurality of main frames equally, and setting a main frame is host node, and other node is task node.The difference of this disposal route and the little image processing method of magnanimity is, while due to the individual task node, large picture file being carried out to analyzing and processing, no matter, in time efficiency or aspect handling capacity, all can be subject to larger restriction.Therefore need to be cut large picture, host node is designed concrete Cut Stratagem according to actual picture processing demand, a large picture is cut into to each<key, value > key-value pair, wherein, key is the skew of picture fragment in whole picture, the length that value is the photo current fragment.Host node is by picture deposit position and corresponding<key, value > be distributed to different task nodes, and identify each burst ID when distribution.The different task node according to the picture deposit position that gets and<key, value > information reads the picture fragment data, carry out algorithm process for this picture burst, after finishing dealing with, result is fed back to host node, host node is responsible for result is integrated according to burst ID.

The present invention utilizes the parallel process model of cloud computing platform to realize the parallel parsing processing to the little picture of magnanimity and single large picture file, promotes corresponding speed and handling capacity that picture analyzing is processed.

The accompanying drawing explanation

Fig. 1 is that MapReduce of the present invention calculates schematic diagram.

Fig. 2 is the process flow diagram that the present invention processes the little picture of magnanimity.

Fig. 3 is the process flow diagram that the present invention processes large picture.

Embodiment

The present invention is to provide a kind of mass picture method for parallel processing based on cloud computing platform.Concrete cloud platform is used the Hadoop MapReduce framework of increasing income to be built, and this MapReduce framework is a distributed computing framework be deployed on the multiple host node.A node is arranged as host node in this framework, all the other nodes are as task node.Host node is also task allocation node TaskTrack simultaneously, and task node is also tasks carrying node JobTrack simultaneously.

The data of Hadoop cloud platform are stored on HDFS, HDFS is a distributed file system, be the data storage server of mentioning in the present invention, this document system is the storage solution that can realize that large data reliability storage and high-throughput read.The MapReduce distributed computing framework of Hadoop is based upon on the basis of HDFS distributed file system, therefore in order to carry out the parallel processing of mass picture, must at first mass picture be stored in distributed HDFS.

As shown in Figure 1, the MapReduce programming model of Hadoop carries out cutting by large data, be divided into input section one by one, then each input section is mapped to respectively to different map nodes, different<key, value are processed and generated to the map node according to the input data > key-value pair, then the map node is according to different key values, the key-value pair of generation is mapped to different reduce nodes and is processed, general mapping method commonly used is the cryptographic hash according to key.Then the reduce node is merged the key-value pair of the identical key value of different map node outputs, finally obtains result.Simplified summary, the MapReduce programming model of Hadoop is exactly that a Task-decomposing by large data becomes little task, giving respectively different map nodes is processed, again result is integrated to output by the reduce node, be similar to the process that a serialized tasks in parallel of script is carried out.

Be the treatment scheme of picture as shown in Figures 2 and 3, at first whole processing framework comprises client, host node, task node and four assemblies of HDFS file storage.Wherein client is important assembly, because according to different task processing demands, client is different when setting required parameter and concrete Processing Algorithm, client is submitted to host node by a packed pending task and relevant parameter, task is corresponding jar bag just, wherein comprises corresponding processing code.Host node is carried out the task allocation of codes in the jar bag, waiting task is distributed to the task node of appointment according to corresponding Task Assigned Policy, the jar bag is sent to corresponding task node simultaneously, the task that task node is carried out in the jar bag is processed code, and read corresponding data and processed from HDFS, then result is saved in the file of HDFS.As can be seen here, in whole flow process, most important part is exactly the design that Task Assigned Policy corresponding to client, specific algorithm are processed function, map function and reduce function.

Below for processing the little picture of magnanimity and processing large picture and set forth respectively:

The processing procedure of the little picture of magnanimity is as shown in Figure 2:

1, client is submitted the request of processing all pictures under assigned catalogue to host node, the jar bag that comprises the client engineering in request, contain in jar bag that host node allocation strategy, Map are processed code, Reduce processes code, the code in this jar bag can be in each map task node and the execution of reduce task node;

2, host node is inquired about pending pictorial information in the HDFS file system, picture file in catalogue to be dealt with is traveled through, according to the memory node (being the position of picture in the HDFS file system) at each picture place and the computing power of each task node, read under the prerequisite that image data postpones reducing to greatest extent each task node, file in catalogue is carried out to effective subpackage, the positional information of each picture of storage in subpackage;

3, host node shines upon respectively different minute package informatins and jar bag to different Map task nodes, after different Map task nodes obtains minute package informatin, obtain processing the positional information of picture from subpackage, each Map task node obtains the data of each picture from HDFS, calls corresponding Map processing function in the jar bag it is carried out to corresponding algorithm process;

4, each Map task node, after handling the task that host node submits, feeds back to host node by result, and feedback information may be the statistical log information of pictorial information or picture, specifically according to actual algorithm, determines;

5, after host node is received the feedback information of all Map task nodes, different results is shone upon to different Reduce task nodes, carry out follow-up processing;

6, each Reduce node is saved in picture or journal file after processing in the HDFS distributed file system, and returns to corresponding process state information to host node.

The processing procedure of picture is as shown in Figure 3 greatly:

1, client submits to host node the processing request of specifying large picture (several GB or tens GB) of processing, the jar bag that comprises the client engineering in request, contain in jar bag that host node allocation strategy, Map are processed code, Reduce processes code, the code in this jar bag can be in each map node and the execution of reduce node;

2, host node is inquired about this large pictorial information from the HDFS file system, the actual size that can process according to algorithm large picture to be dealt with and the restrictive conditions such as processing power of individual node, this large picture is carried out to cutting, the cutting result is a series of<key, value > key-value pair, the corresponding current file burst of key value is with respect to the reference position of source document, the length information that value is burst, in addition, host node generates a picture burst ID for each key-value pair, for the picture restructuring of reduce node;

3, host node is by picture-storage positional information to be dealt with and corresponding<key, value > key-value pair information and jar bag, the packing mapping is to different Map task nodes;

4, each Map task node is after getting the processing request that host node submits, according to the picture-storage position got, corresponding picture burst position and offset information, read corresponding picture fragment from the HDFS file system, then the map function called in the jar bag carries out algorithm process, and result is returned to host node;

5,, after host node gets the result of different Map task nodes, the different disposal result is shone upon to different Reduce task nodes;

6, the Reduce task node will carry out the picture restructuring according to the burst ID of different pictures, then picture or log information after restructuring be write in the HDFS file system, and be returned to treatment state to host node.

Claims

1. the mass picture method for parallel processing based on cloud computing platform, Distributed Parallel Computing Model based on cloud computing, this model comprises client, data storage server, a host node and a plurality of task node, and its reciprocal process mainly comprises following step:

2) host node is according to the pictorial information received, picture in the data storage server is traveled through, obtain the picture positional information that pictorial information is corresponding, and the picture positional information is carried out to subpackage according to allocation strategy, then subpackage and described processing code are sent to each task node;

2. the mass picture method for parallel processing based on cloud computing platform as claimed in claim 1, is characterized in that, described picture comprises the little picture of magnanimity, and described subpackage comprises the positional information of corresponding little picture at data storage server.

3. the mass picture method for parallel processing based on cloud computing platform as claimed in claim 1, is characterized in that, described picture is single large picture, and large picture is greater than 100,000,000.

4. the mass picture method for parallel processing based on cloud computing platform as claimed in claim 3, it is characterized in that, described host node becomes large picture segmentation some little pictures and little picture is sorted, described subpackage comprises the positional information of large picture and the segmentation information of each little picture, and described segmentation information comprises burst ID, skew and the length information of corresponding little picture.

5. the mass picture method for parallel processing based on cloud computing platform as claimed in claim 4, is characterized in that, in described step 3), described host node is spliced into corresponding large picture by the little picture of handling according to sequencing information.

6. the mass picture method for parallel processing based on cloud computing platform as claimed in claim 1, is characterized in that, described picture is not for carrying out the file of compression coding.

7. the mass picture method for parallel processing based on cloud computing platform as claimed in claim 1, is characterized in that, the picture format of the form of described picture for compressing, in step 2) in, at first host node need carry out decompress(ion) to picture.

8. the mass picture method for parallel processing based on cloud computing platform as claimed in claim 1, is characterized in that, described each task node generates corresponding journal file after handling picture, and is kept in data storage server.