CN104579357B - The method and apparatus for handling compressed file - Google Patents

The method and apparatus for handling compressed file Download PDF

Info

Publication number
CN104579357B
CN104579357B CN201510016383.6A CN201510016383A CN104579357B CN 104579357 B CN104579357 B CN 104579357B CN 201510016383 A CN201510016383 A CN 201510016383A CN 104579357 B CN104579357 B CN 104579357B
Authority
CN
China
Prior art keywords
lzo
plug
file
hadoop
units
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510016383.6A
Other languages
Chinese (zh)
Other versions
CN104579357A (en
Inventor
袁安峰
吕信
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201510016383.6A priority Critical patent/CN104579357B/en
Publication of CN104579357A publication Critical patent/CN104579357A/en
Application granted granted Critical
Publication of CN104579357B publication Critical patent/CN104579357B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of method and apparatus for handling compressed file, and Presto can be made to support LZO compressed formats.The method of the processing compressed file of the present invention includes:In Presto startup of server, imported Hadoop lzo plug-in units as third side plug;In the case where the Presto servers read the file of lzo compressed formats, the file of the lzo compressed formats is handled using the Hadoop lzo plug-in units.

Description

The method and apparatus for handling compressed file
Technical field
The present invention relates to field of computer technology, a kind of particularly method and apparatus for handling compressed file.
Background technology
In big data field, data compression is a very important technology, and compression storage, energy are carried out to mass data Server storage is enough saved, improves data-handling efficiency, memory is reduced and magnetic disc i/o expense, the SQL for improving big data is looked into Ask efficiency.
LZO (Lempel-Ziv-Oberhumer) is directed to a kind of data compression algorithm of decompression speed, this algorithm It is lossless compression, with reference to realizing that program is thread-safe, and with decompression is simple, speed is very fast, in decompression does not need to It deposits, compresses the features such as considerably fast.
Since Presto and LZO follow different open source protocols, Presto follow Apache Licence 2.0 (one The free software licensing of Apache Software Foundation publication), and LZO follows GPL (General Public License is One free software permission agreement clause used extensively) agreement, therefore Presto can not integrate LZO source codes simultaneously in source code Realize the support to LZO compressed formats.
Invention content
In view of this, the present invention provides a kind of method and apparatus for handling compressed file, and Presto can be made to support LZO pressures Contracting form.
To achieve the above object, according to an aspect of the invention, there is provided a kind of method for handling compressed file.
The method of the processing compressed file of the present invention includes:In Presto startup of server, by Hadoop-lzo plug-in units It is imported as third side plug;In the case where the Presto servers read the file of lzo compressed formats, using described The file of the Hadoop-lzo plug-in units processing lzo compressed formats.
Optionally, it in the case where the Presto servers read the file of lzo compressed formats, further includes:Judge The file of the lzo compressed formats whether there is index file, if so, according to the index file to the lzo compressed formats File carry out fragment obtain multiple data slices;The file of the lzo compressed formats is handled using the Hadoop-lzo plug-in units The step of include:Using the Hadoop-lzo plug-in units, parallel processing is carried out to the multiple data slice.
Optionally, in the data for using the Hadoop-lzo plug-in units processing lzo compressed formats, lzo decompressions are called Function;Wherein the lzo decompression functions inherit general decompression function, and the interface provided using Hadoop-lzo plug-in units It is written over.
According to another aspect of the present invention, a kind of device for handling compressed file is provided.
The device of the processing compressed file of the present invention includes:Plug-in unit import modul, in Presto startup of server, It is imported Hadoop-lzo plug-in units as third side plug;Processing module, for reading lzo pressures in the Presto servers In the case of the file of contracting form, the file of the lzo compressed formats is handled using the Hadoop-lzo plug-in units.
Optionally, the processing module is additionally operable to read the file of lzo compressed formats in the Presto servers In the case of, the file of the lzo compressed formats is judged with the presence or absence of index file, if so, according to the index file to described The file of lzo compressed formats carries out fragment and obtains multiple data slices;And the Hadoop-lzo plug-in units are used, to the multiple Data slice carries out parallel processing.
Optionally, the processing module is additionally operable in the number for using the Hadoop-lzo plug-in units processing lzo compressed formats According to when, call lzo decompression functions;Wherein the lzo decompression functions inherit general decompression function, and use The interface that Hadoop-lzo plug-in units provide is written over.
According to the technique and scheme of the present invention, in Presto startup of server, using Hadoop-lzo plug-in units as third party Plug-in unit imports, and utilizes the file of Hadoop-lzo plug-in units processing lzo compressed formats.Hadoop-lzo plug-in units, which provide, presses LZO Contracting file carries out the interface of various processing, therefore the common interface provided by the plug-in unit can be realized to LZO compressed files Processing, will not bring using open source protocol skimble-scamble puzzlement during LZO source codes.LZO can be supported by allowing for Presto in this way Compressed format.In addition, carrying out fragment parallel processing to LZO files by handling LZO indexes, data processing can be further improved Speed.If desired for other compressed formats of support, it is only necessary to add new plug-in unit, and general using the interface rewriting that the plug-in unit provides Decompression function, system function is made to be easy to extend.
Description of the drawings
Attached drawing does not form inappropriate limitation of the present invention for more fully understanding the present invention.Wherein:
Fig. 1 be Presto servers according to embodiments of the present invention, card i/f, third side plug relationship signal Figure;
Fig. 2 is the schematic diagram of the basic step of the method for processing compressed file according to embodiments of the present invention;
Fig. 3 is the schematic diagram of the main modular of the device of processing compressed file according to embodiments of the present invention.
Specific embodiment
It explains below in conjunction with attached drawing to the exemplary embodiment of the present invention, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, various changes and modifications can be made to the embodiments described herein, without departing from scope and spirit of the present invention.Together For clarity and conciseness, the description to known function and structure is omitted in sample in following description.
The support function to third side plug of Presto offers is utilized in the embodiment of the present invention, and basic principle is:By In tripartite's plug-in unit to Presto server plug-ins set, in Presto startup of server, dynamic imports third side plug. Presto dynamic lookup and binds third side plug when running, the interface that third side plug is called to provide, so as to which third party be inserted The function of part is integrated into Presto.In embodiments of the present invention, Hadoop-lzo plug-in units are imported into Presto plug-in unit set It is middle to be used as third side plug, the Presto dynamic bindings plug-in unit.Fig. 1 is Presto servers according to embodiments of the present invention, inserts Part interface, third side plug relationship schematic diagram.
Fig. 2 is the schematic diagram of the basic step of the method for processing compressed file according to embodiments of the present invention.This method by Presto servers perform.
Step S21:Presto startup of server.
Step S22:It is imported Hadoop-lzo plug-in units as third side plug.
Step S23:Data are read from data source.
Step S24:Whether the data for judging to read are the file of lzo compressed formats, if so, entering step S25, otherwise Enter step S26.
Step S25:Use the file of Hadoop-lzo plug-in units processing lzo compressed formats.
Step S26:It is handled accordingly according to the form of the file read.
In above-mentioned flow, the file of lzo compressed formats read can also be judged with the presence or absence of index file, if It is that then carrying out fragment to the file of the lzo compressed formats according to the index file obtains multiple data slices;It in this way should in processing During the file of lzo compressed formats, parallel processings are carried out to above-mentioned multiple data slices using Hadoop-lzo plug-in units, it in this way can be into One step improves the efficiency of processing data.
Handle lzo compressed formats file when, call LZO compression is handled with decompression function, LZO compression functions after General decompression function, and the interface provided using Hadoop-lzo plug-in units are provided, weight is carried out to the general compression function It writes.This mode causes system convenient for expanded function, when needing to support other compressed formats, it is only necessary to new plug-in unit is added, And rewrite above-mentioned general decompression function using the interface that the plug-in unit provides.With decompression in above description For, but it is equally applicable to the situation for needing to compress data.
Fig. 3 is the schematic diagram of the main modular of the device of processing compressed file according to embodiments of the present invention.Such as Fig. 3 institutes Show, the device 30 of the processing compressed file of the embodiment of the present invention mainly includes plug-in unit import modul 31 and processing module 32.Plug-in unit Import modul 31 is used in Presto startup of server, is imported Hadoop-lzo plug-in units as third side plug;Handle mould Block 32 is used in the case where Presto servers read the file of lzo compressed formats, is handled using Hadoop-lzo plug-in units The file of lzo compressed formats.
Processing module 32 can be additionally used in the case where Presto servers read the file of lzo compressed formats, judge The file of the lzo compressed formats whether there is index file, if so, according to the index file to the text of the lzo compressed formats Part carries out fragment and obtains multiple data slices;And above-mentioned Hadoop-lzo plug-in units are used, multiple data slice is located parallel Reason.
Processing module 32 can also be used to, in the data for using Hadoop-lzo plug-in units processing lzo compressed formats, call lzo Decompression function;Wherein the lzo decompression functions inherit general decompression function, and are provided using Hadoop-lzo plug-in units Interface be written over.
Technical solution according to embodiments of the present invention, in Presto startup of server, using Hadoop-lzo plug-in units as Third side plug imports, and utilizes the file of Hadoop-lzo plug-in units processing lzo compressed formats.Hadoop-lzo plug-in units provide pair LZO compressed files carry out the interface of various processing, therefore the common interface provided by the plug-in unit can be realized and compress text to LZO The processing of part will not be brought using open source protocol skimble-scamble puzzlement during LZO source codes.Allowing for Presto in this way can support LZO compressed formats.In addition, carrying out fragment parallel processing to LZO files by handling LZO indexes, data can be further improved Processing speed.If desired for other compressed formats of support, it is only necessary to add new plug-in unit, and be rewritten using the interface that the plug-in unit provides General decompression function makes system function be easy to extend.
The basic principle of the present invention is described above in association with specific embodiment, in apparatus and method of the present invention, it is clear that Each component or each step can be decomposed and/or be reconfigured.These decompose and/or reconfigure should be regarded as the present invention etc. Efficacious prescriptions case.Also, the step of performing above-mentioned series of processes can perform in chronological order according to the sequence of explanation naturally, still It does not need to centainly perform sequentially in time.Certain steps can perform parallel or independently of one another.
Above-mentioned specific embodiment, does not form limiting the scope of the invention.Those skilled in the art should be bright It is white, depending on design requirement and other factors, various modifications, combination, sub-portfolio and replacement can occur.It is any Modifications, equivalent substitutions and improvements made within the spirit and principles in the present invention etc., should be included in the scope of the present invention Within.

Claims (2)

  1. A kind of 1. method for handling compressed file, which is characterized in that including:
    In Presto startup of server, imported Hadoop-lzo plug-in units as third side plug;
    In the case where the Presto servers read the file of lzo compressed formats, the Hadoop-lzo plug-in units are used Handle the file of the lzo compressed formats;
    In the data for using the Hadoop-lzo plug-in units processing lzo compressed formats, lzo compressions or decompression function are called; The wherein lzo compresses or decompression function inherits general compression or decompression function, and carry using Hadoop-lzo plug-in units The interface of confession is written over;When needing to support other compressed formats, new plug-in unit, and the interface provided using the plug-in unit are added Rewrite the general compression or decompression function;
    In the case where the Presto servers read the file of lzo compressed formats, further include:Judge the lzo compressions The file of form whether there is index file, if so, being divided according to the index file the file of the lzo compressed formats Piece obtains multiple data slices;
    The step of handling the file of the lzo compressed formats using the Hadoop-lzo plug-in units includes:Use the Hadoop- Lzo plug-in units carry out parallel processing to the multiple data slice.
  2. 2. a kind of device for handling compressed file, which is characterized in that including:
    Plug-in unit import modul, in Presto startup of server, being imported Hadoop-lzo plug-in units as third side plug;
    Processing module, in the case of reading the file of lzo compressed formats in the Presto servers, using described The file of the Hadoop-lzo plug-in units processing lzo compressed formats;The processing module is additionally operable to using the Hadoop- When lzo plug-in units handle the data of lzo compressed formats, lzo compressions or decompression function are called;The wherein lzo compresses or decompression Function inherits general compression or decompression function, and is written over using the interface that Hadoop-lzo plug-in units provide;It is needing When supporting other compressed formats, add new plug-in unit, and using the interface that the plug-in unit provides rewrite the general compression or Decompression function;
    The processing module is additionally operable in the case where the Presto servers read the file of lzo compressed formats, is judged The file of the lzo compressed formats whether there is index file, if so, according to the index file to the lzo compressed formats File carry out fragment obtain multiple data slices;And the Hadoop-lzo plug-in units are used, the multiple data slice is carried out Parallel processing.
CN201510016383.6A 2015-01-13 2015-01-13 The method and apparatus for handling compressed file Active CN104579357B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510016383.6A CN104579357B (en) 2015-01-13 2015-01-13 The method and apparatus for handling compressed file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510016383.6A CN104579357B (en) 2015-01-13 2015-01-13 The method and apparatus for handling compressed file

Publications (2)

Publication Number Publication Date
CN104579357A CN104579357A (en) 2015-04-29
CN104579357B true CN104579357B (en) 2018-06-22

Family

ID=53094688

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510016383.6A Active CN104579357B (en) 2015-01-13 2015-01-13 The method and apparatus for handling compressed file

Country Status (1)

Country Link
CN (1) CN104579357B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6826301B2 (en) * 2002-10-07 2004-11-30 Infocus Corporation Data transmission system and method
US8497788B1 (en) * 2012-04-25 2013-07-30 Pure Storage Inc. Efficient techniques for aligned fixed-length compression
CN102708187B (en) * 2012-05-14 2014-04-30 成都信息工程学院 Reverse index mixed compression and decompression method based on Hbase database
CN102970158B (en) * 2012-11-05 2017-02-08 广东睿江云计算股份有限公司 Log storage and processing method and log server
CN103366015B (en) * 2013-07-31 2016-04-27 东南大学 A kind of OLAP data based on Hadoop stores and querying method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
hadoop中使用lzo的压缩;tinyid;《http://blog.csdn.net/xiaolang85/article/details/8649756》;20130308;第1-2页 *
lzo本地压缩与解压缩实例;喜啊;《http://blog.csdn.net/scorpiohjx2/article/details/18423529》;20140117;第1页 *
另一种扩展并加速Hadoop计算能力的计算架构—Presto;tinyid;《http://blog.csdn.net/cnweike/article/details/39519059》;20140925;第4页 *

Also Published As

Publication number Publication date
CN104579357A (en) 2015-04-29

Similar Documents

Publication Publication Date Title
CN105354314B (en) Data migration method and device
CN107832406B (en) Method, device, equipment and storage medium for removing duplicate entries of mass log data
US7937371B2 (en) Ordering compression and deduplication of data
EP3049983B1 (en) Adaptive and recursive filtering for sample submission
DE60107964D1 (en) DEVICE FOR CODING AND DECODING STRUCTURED DOCUMENTS
CN106528896B (en) A kind of database optimizing method and device
CN106547911B (en) Access method and system for massive small files
CN106407442B (en) A kind of mass text data processing method and device
US20220360628A1 (en) Technologies for conversion of acquirer files for big data ingestion
KR101379855B1 (en) Method and apparatus for data migration from hierarchical database of mainframe system to rehosting solution database of open system
CN115858488A (en) Parallel migration method and device based on data governance and readable medium
CN104579357B (en) The method and apparatus for handling compressed file
Wilke et al. An experience report: porting the MG‐RAST rapid metagenomics analysis pipeline to the cloud
US8924431B2 (en) Pluggable domain-specific typing systems and methods of use
CN104090748B (en) Source code based on Makefile simplifies the method that device carries out simplifying source code
CN106599244B (en) General original log cleaning device and method
CN105468936A (en) Application reinforcement method and apparatus
CN100511212C (en) Processing method and apparatus for electronic table file
CN115203674A (en) Automatic login method, system, device and storage medium for application program
US10223393B1 (en) Efficient processing of source code objects using probabilistic data structures
CN104301333A (en) Non-blocking type handshake implementation method and system
EP4046052A1 (en) Customizable delimited text compression framework
CN102929559B (en) Method and system for providing file
US10168909B1 (en) Compression hardware acceleration
US20160254824A1 (en) Determining compression techniques to apply to documents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant