CN104579357A

CN104579357A - Method and device for processing compressed file

Info

Publication number: CN104579357A
Application number: CN201510016383.6A
Authority: CN
Inventors: 袁安峰; 吕信
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2015-01-13
Filing date: 2015-01-13
Publication date: 2015-04-29
Anticipated expiration: 2035-01-13
Also published as: CN104579357B

Abstract

The invention provides a method and device for processing a compressed file. The method and device for processing the compressed file enable the Presto to support the LZO compressed format. The method for processing the compressed file comprises the steps that when a Presto server is started, a Hadoop-lzo plug is led in as a third-party plug-in, and the Hadoop-lzo plug-in is used for processing the file in the LZO compressed format under the condition that the Presto server reads the file in the LZO compressed format.

Description

The method and apparatus of process compressed file

Technical field

The present invention relates to field of computer technology, particularly a kind of method and apparatus processing compressed file.

Background technology

In large data fields, data compression is a very important technology, carries out compression and stores, can save server storage, improve data-handling efficiency, reduce internal memory and magnetic disc i/o expense, improve the SQL query efficiency of large data mass data.

LZO (Lempel-Ziv-Oberhumer) is a kind of data compression algorithm being devoted to decompress(ion) speed, this algorithm is lossless compression, with reference to realize program be thread-safe, and have that decompress(ion) is simple, speed quickly, decompress(ion) does not need internal memory, and compression is the feature such as fast considerably.

Because Presto and LZO follows different open source protocol, Presto follows ApacheLicence 2.0 (a free software licence issued at Apache Software Foundation), and LZO follows GPL (General Public License, be one extensively by the free software permission agreement clause used) agreement, therefore Presto cannot in source code integrated LZO source code realize support to LZO compressed format.

Summary of the invention

In view of this, the invention provides a kind of method and apparatus processing compressed file, Presto can be made to support LZO compressed format.

For achieving the above object, according to an aspect of the present invention, a kind of method processing compressed file is provided.

The method of process compressed file of the present invention comprises: when Presto startup of server, is imported by Hadoop-lzo plug-in unit as third party's plug-in unit; When described Presto server reads the file of lzo compressed format, use the file of lzo compressed format described in the process of described Hadoop-lzo plug-in unit.

Alternatively, when described Presto server reads the file of lzo compressed format, also comprise: judge whether the file of described lzo compressed format exists index file, if so, then carry out burst according to the file of this index file to described lzo compressed format and obtain multiple data slice; Use the step of the file of lzo compressed format described in the process of described Hadoop-lzo plug-in unit to comprise: to use described Hadoop-lzo plug-in unit, parallel processing is carried out to described multiple data slice.

Alternatively, when using the data of described Hadoop-lzo plug-in unit process lzo compressed format, lzo decompression function is called; Wherein this lzo decompression function inherits general decompression function, and the interface using Hadoop-lzo plug-in unit to provide rewrites.

According to a further aspect in the invention, a kind of device processing compressed file is provided.

The device of process compressed file of the present invention comprises: plug-in unit imports module, for when Presto startup of server, is imported by Hadoop-lzo plug-in unit as third party's plug-in unit; Processing module, for when described Presto server reads the file of lzo compressed format, uses the file of lzo compressed format described in the process of described Hadoop-lzo plug-in unit.

Alternatively, described processing module is also for when described Presto server reads the file of lzo compressed format, judge whether the file of described lzo compressed format exists index file, if so, then carry out burst according to the file of this index file to described lzo compressed format and obtain multiple data slice; And use described Hadoop-lzo plug-in unit, parallel processing is carried out to described multiple data slice.

Alternatively, described processing module also for when using the data of described Hadoop-lzo plug-in unit process lzo compressed format, calls lzo decompression function; Wherein this lzo decompression function inherits general decompression function, and the interface using Hadoop-lzo plug-in unit to provide rewrites.

According to technical scheme of the present invention, when Presto startup of server, Hadoop-lzo plug-in unit is imported as third party's plug-in unit, utilizes the file of Hadoop-lzo plug-in unit process lzo compressed format.Hadoop-lzo plug-in unit provides interface LZO compressed file being carried out to various process, and the common interface therefore provided by this plug-in unit can realize the process to LZO compressed file, also can not bring open source protocol skimble-scamble puzzlement when using LZO source code.So just make Presto can support LZO compressed format.In addition, by process LZO index, burst parallel processing is carried out to LZO file, data processing speed can be improved further.As other compressed format supported by needs, only need to add new plug-in unit, and the interface using this plug-in unit to provide rewrites general decompression function, make systemic-function be easy to expansion.

Accompanying drawing explanation

Accompanying drawing is used for understanding the present invention better, does not form inappropriate limitation of the present invention.Wherein:

Fig. 1 is the schematic diagram of relation of Presto server according to the embodiment of the present invention, card i/f, third party's plug-in unit;

Fig. 2 is the schematic diagram of the basic step of the method for process compressed file according to the embodiment of the present invention;

Fig. 3 is the schematic diagram of the main modular of the device of process compressed file according to the embodiment of the present invention.

Embodiment

Below in conjunction with accompanying drawing, one exemplary embodiment of the present invention is explained, comprising the various details of the embodiment of the present invention to help understanding, they should be thought it is only exemplary.Therefore, those of ordinary skill in the art will be appreciated that, can make various change and amendment, and can not deviate from scope and spirit of the present invention to the embodiments described herein.Equally, for clarity and conciseness, the description to known function and structure is eliminated in following description.

The embodiment of the present invention make use of the support function to third party's plug-in unit that Presto provides, and ultimate principle is: be integrated into by third party's plug-in unit in the set of Presto server plug-ins, when Presto startup of server, dynamically imports third party's plug-in unit.Dynamic lookup when Presto runs also binds third party's plug-in unit, calls the interface that third party's plug-in unit provides, thus by the function i ntegration of third party's plug-in unit in Presto.In embodiments of the present invention, Hadoop-lzo plug-in unit is imported to as third party's plug-in unit in the set of Presto plug-in unit, this plug-in unit of Presto dynamic binding.Fig. 1 is the schematic diagram of relation of Presto server according to the embodiment of the present invention, card i/f, third party's plug-in unit.

Fig. 2 is the schematic diagram of the basic step of the method for process compressed file according to the embodiment of the present invention.The method is performed by Presto server.

Step S21:Presto startup of server.

Step S22: Hadoop-lzo plug-in unit is imported as third party's plug-in unit.

Step S23: read data from data source.

Step S24: judge that whether the data that read are the file of lzo compressed format, if so, enter step S25, otherwise enter step S26.

Step S25: the file using Hadoop-lzo plug-in unit process lzo compressed format.

Step S26: the form according to the file read handles accordingly.

In above-mentioned flow process, can also judge whether the file of the lzo compressed format read exists index file, if so, then according to this index file, burst be carried out to the file of this lzo compressed format and obtain multiple data slice; Like this when processing the file of this lzo compressed format, using Hadoop-lzo plug-in unit to carry out parallel processing to above-mentioned multiple data slice, the efficiency of process data can be improved so further.

When processing the file of lzo compressed format, call LZO compression and process with decompression function, LZO compression function inherits general decompression function, and uses the interface that Hadoop-lzo plug-in unit provides, and the compression function general to this rewrites.This mode makes system be convenient to expanded function, when other compressed formats supported by needs, only need to add new plug-in unit, and the interface using this plug-in unit to provide rewrites above-mentioned general decompression function.For decompression in above description, but be equally applicable to the situation that needs to compress data.

Fig. 3 is the schematic diagram of the main modular of the device of process compressed file according to the embodiment of the present invention.As shown in Figure 3, the device 30 of the process compressed file of the embodiment of the present invention mainly comprises plug-in unit importing module 31 and processing module 32.Plug-in unit imports module 31 for when Presto startup of server, is imported by Hadoop-lzo plug-in unit as third party's plug-in unit; Processing module 32, for when Presto server reads the file of lzo compressed format, uses the file of Hadoop-lzo plug-in unit process lzo compressed format.

Processing module 32 is also used in Presto server when reading the file of lzo compressed format, judge whether the file of this lzo compressed format exists index file, if so, then according to this index file, burst is carried out to the file of this lzo compressed format and obtain multiple data slice; And use above-mentioned Hadoop-lzo plug-in unit, parallel processing is carried out to the plurality of data slice.

When processing module 32 is also used in the data using Hadoop-lzo plug-in unit process lzo compressed format, call lzo decompression function; Wherein this lzo decompression function inherits general decompression function, and the interface using Hadoop-lzo plug-in unit to provide rewrites.

According to the technical scheme of the embodiment of the present invention, when Presto startup of server, Hadoop-lzo plug-in unit is imported as third party's plug-in unit, utilizes the file of Hadoop-lzo plug-in unit process lzo compressed format.Hadoop-lzo plug-in unit provides interface LZO compressed file being carried out to various process, and the common interface therefore provided by this plug-in unit can realize the process to LZO compressed file, also can not bring open source protocol skimble-scamble puzzlement when using LZO source code.So just make Presto can support LZO compressed format.In addition, by process LZO index, burst parallel processing is carried out to LZO file, data processing speed can be improved further.As other compressed format supported by needs, only need to add new plug-in unit, and the interface using this plug-in unit to provide rewrites general decompression function, make systemic-function be easy to expansion.

Below describe ultimate principle of the present invention in conjunction with specific embodiments, in apparatus and method of the present invention, obviously, each parts or each step can decompose and/or reconfigure.These decompose and/or reconfigure and should be considered as equivalents of the present invention.Further, the step performing above-mentioned series of processes can order naturally following the instructions perform in chronological order, but does not need necessarily to perform according to time sequencing.Some step can walk abreast or perform independently of one another.

Above-mentioned embodiment, does not form limiting the scope of the invention.It is to be understood that depend on designing requirement and other factors, various amendment, combination, sub-portfolio can be there is and substitute in those skilled in the art.Any amendment done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within scope.

Claims

1. process a method for compressed file, it is characterized in that, comprising:

When Presto startup of server, Hadoop-lzo plug-in unit is imported as third party's plug-in unit;

When described Presto server reads the file of lzo compressed format, use the file of lzo compressed format described in the process of described Hadoop-lzo plug-in unit.

2. method according to claim 1, is characterized in that,

When described Presto server reads the file of lzo compressed format, also comprise: judge whether the file of described lzo compressed format exists index file, if so, then carry out burst according to the file of this index file to described lzo compressed format and obtain multiple data slice;

Use the step of the file of lzo compressed format described in the process of described Hadoop-lzo plug-in unit to comprise: to use described Hadoop-lzo plug-in unit, parallel processing is carried out to described multiple data slice.

3. method according to claim 1 and 2, is characterized in that, when using the data of described Hadoop-lzo plug-in unit process lzo compressed format, calls lzo decompression function; Wherein this lzo decompression function inherits general decompression function, and the interface using Hadoop-lzo plug-in unit to provide rewrites.

4. process a device for compressed file, it is characterized in that, comprising:

Plug-in unit imports module, for when Presto startup of server, is imported by Hadoop-lzo plug-in unit as third party's plug-in unit;

Processing module, for when described Presto server reads the file of lzo compressed format, uses the file of lzo compressed format described in the process of described Hadoop-lzo plug-in unit.

5. device according to claim 4, is characterized in that,

Described processing module is also for when described Presto server reads the file of lzo compressed format, judge whether the file of described lzo compressed format exists index file, if so, then carry out burst according to the file of this index file to described lzo compressed format and obtain multiple data slice; And use described Hadoop-lzo plug-in unit, parallel processing is carried out to described multiple data slice.

6. the device according to claim 4 or 5, is characterized in that, described processing module also for when using the data of described Hadoop-lzo plug-in unit process lzo compressed format, calls lzo decompression function; Wherein this lzo decompression function inherits general decompression function, and the interface using Hadoop-lzo plug-in unit to provide rewrites.