CN107577809A

CN107577809A - Offline small documents processing method and processing device

Info

Publication number: CN107577809A
Application number: CN201710888790.5A
Authority: CN
Inventors: 谢永恒; 李鑫; 火莽; 火一莽; 万月亮
Original assignee: Beijing Ruian Technology Co Ltd
Current assignee: Beijing Ruian Technology Co Ltd
Priority date: 2017-09-27
Filing date: 2017-09-27
Publication date: 2018-01-12

Abstract

The embodiment of the invention discloses a kind of offline small documents processing method and processing device.Methods described includes：By reading column storage file in HDFS；There is provided and carry out the configuration that data processing is specified；The column storage file is pre-processed and merged；The column storage file merger after execution is pre-processed and merged according to specified merger frequency is big file.Offline small documents processing method and processing device provided in an embodiment of the present invention can effectively reduce the occupancy of internal memory on nameNode in Hadoop system.

Description

Offline small documents processing method and processing device

Technical field

The present embodiments relate to distributed computing technology field, more particularly to a kind of offline small documents processing method and dress Put.

Background technology

Worked for the off-line analysis of big data, take and stream data is converted into parquet column storage format files Mode, and combine the technological means such as spark-sql and carry out off-line analysis.The stream data wherein accessed in real time, can be from Kafka Obtain and be converted into parquet files in real time, stored using HDFS filesystem manners, finally as off-line analysis instrument Source file.Hadoop distributed file systems (Hadoop distributed file system, HDFS) are designed to suitable The distributed file system that conjunction is operated on common hardware (commodity hardware), there is Error Tolerance, high-throughput Etc. characteristic, the application being especially suitable on large-scale dataset.It is but right mainly towards Stream Processing at the beginning of Hadoop design When processing is largely much smaller than the small documents of block size values, the problem of due to design mechanism, it may appear that response speed is big Width decline, have a strong impact on performance, even result in can not normal operation phenomenon.Because data are real-time from each manufacturer, light splitting etc. Data are accessed, off-line analysis filing system can carry out real-time file conversion work to every log information on Kafka, Analyzed so as to form substantial amounts of small documents for subsequent product, it is at the same time, caused a large amount of small much smaller than block size File, leverage the reading performance of Hadoop storages.

The content of the invention

For above-mentioned technical problem, the embodiments of the invention provide a kind of offline small documents processing method and processing device, to carry The reading performance of high Hadoop system storage, effectively reduce the occupancy of internal memory on nameNode in Hadoop system.

On the one hand, the embodiments of the invention provide a kind of offline small documents processing method, run at Hadoop distributions In reason system, methods described includes：

By reading column storage file in HDFS, wherein, the size of column storage file is set less than predetermined file size Definite value；

There is provided and carry out the configuration that data processing is specified；

According to the configuration, based on Map-Reduce computation models, the column storage file is pre-processed and closed And；

Based on Map-Reduce computation models, the row after pre-processing and merging will be performed according to specified merger frequency The merger of formula storage file is big file.

On the other hand, the embodiment of the present invention additionally provides a kind of offline small documents processing unit, is integrated in Hadoop distributions In formula processing system, described device includes：

Read module, for by reading column storage file in HDFS, wherein, the size of column storage file is less than predetermined File size setting value；

Configuration module, for providing the configuration for carrying out data processing and specifying；

Pretreatment module, for according to the configuration, based on Map-Reduce computation models, to the column storage file Pre-processed and merged；

Merger module, for based on Map-Reduce computation models, according to specified merger frequency will perform pretreatment and The column storage file merger after merging is big file.

Offline small documents processing method and processing device provided in an embodiment of the present invention, by reading column storage file, there is provided The configuration of data processing execution is carried out, column storage file is pre-processed and merged, and according to specified merger frequency It is big file by the column storage file merger after performing pretreatment and merging, substantially increases the reading of Hadoop system storage Performance, the occupancy of internal memory on nameNode in Hadoop system can be effectively reduced.

Brief description of the drawings

By reading the detailed description made to non-limiting example made with reference to the following drawings, of the invention is other Feature, objects and advantages will become more apparent upon：

Fig. 1 is the schematic flow sheet for the offline small documents processing method that first embodiment of the invention provides；

Fig. 2 is the logical construction signal of the system for the offline small documents processing method of operation that first embodiment of the invention provides Figure；

Fig. 3 is the schematic flow sheet for the offline small documents processing method that second embodiment of the invention provides；

Fig. 4 is the schematic flow sheet pre-processed in the offline small documents processing method that third embodiment of the invention provides；

Fig. 5 is the flow signal that data processing is abandoned in the offline small documents processing method that fourth embodiment of the invention provides Figure；

Fig. 6 is the schematic flow sheet of file merger in the offline small documents processing method that fifth embodiment of the invention provides；

Fig. 7 is the structural representation for the offline small documents processing unit that sixth embodiment of the invention provides.

Embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that in order to just Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.

First embodiment

Present embodiments provide a kind of technical scheme of offline small documents processing method.It is offline small in the technical scheme Document handling method is performed by offline small documents processing unit, also, offline small documents processing unit is generally integrated in Hadoop Among distributed data processing system.

Referring to Fig. 1, offline small documents processing method includes：

S11, by reading column storage file in HDFS.

So-called column storage, the storage mode for exactly storing together the ownership that same data arrange.This mode On-line analytical processing (OLAP) is especially suitable for, because the inquiry of OLAP types is often only concerned a small amount of several row, when Field Count ratio When more and whole file has millions of or even billions of rows records, column storage can be significantly reduced scan data volume, add Fast feedback speed.

Column storage in the present embodiment is preferably parquet files.Parquet files can be the big data ecosystem In service efficient data be provided supported, it is and all not related in programming language, data model or data framework.Moreover, Parquet files can be handled by Hadoop system, and Hive, Impala and Spark SQL can be inquired about directly with parquet Data warehouse based on file.This characteristic allows parquet files to substitute traditional text file type.

In the present embodiment, the upper limit of the size for the parquet files being read without departing from file size.Specifically , the upper limit of above-mentioned file size is 128MB.Moreover, the storage location of Parquet files is typically in Hadoop system On DataNode.

System reads the parquet files that source data production system provides, and is the data increase of each data set UniqueNum fields, the unique data values of the data are calculated by MD5 algorithms for the unique field of data set according to configuration, Judge for data deduplication.For each data set data increase FIRST_COLLECT _ TIME, LAST_COLLECT_TIME, TORL_FOUND_COUNTER, TOTAL_DAY_ COUNTER fields, during for data deduplication, calculate what duplicate data occurred Earliest time, latest time, amount to and find the statistics such as number, accumulative discovery number of days.

S12, there is provided carry out the configuration that data processing is specified.

In Data processings such as follow-up Piece file mergence, merger, data processing needs to run according to specified rule.Separately Outside, some processing parameters are often used in data handling.These rules and processing parameters will be provided in the form of configuration.

With reference to practical application scene, the system, which provides, passes through profile information, carries out data screening, data uniqueization refers to Fixed configuration.The step needs user according to configuration information template, there is provided needs the data set scope retained, data intensive data The configuration such as unique identification information.

S13, according to the configuration, based on Map-Reduce computation models, the column storage file is pre-processed And merge.

In the present embodiment, pretreatment and merging treatment are primarily referred to as screening, duplicate removal, the merging of data, and storage mesh The optimization of directory structures.

Screening refers in the parquet small documents by magnanimity, filters out the file for needing to carry out off-line analysis.In this implementation In example, small documents (Small files, SF) refer to that file total size is not reaching to the file of the file size upper limit of setting.It is excellent Choosing, the setting value of the above-mentioned upper limit is 128MB.

So-called duplicate removal refers to be removed the data record repeated, and identical data record only retains wherein The operation of one record.

Specifically, deduplication operation is according to data buffer storage data-quantity threshold (Data buffer count, DBC) and number Performed according to caching duration threshold value (Data buffer time, DBT).DBC refers to generic data, is cached in computing The max-thresholds of data volume, system default are arranged to 1000000.DBT refers to generic data, once delays in computing The period max-thresholds deposited, system default are arranged to 5 minutes.It can be entered in internal memory according to data set generation Map key-value pair Row caching, data queue is preserved using Queue forms, the data of caching proceed by current class when reaching DBC or DBT The deduplication operation of other data.

So-called merging refers to that data merging is exactly the data message by reading multiple parquet files, is closed by processing And preserved for a larger parquet file.By data screening above, data classification caching, data deduplication it The data amalgamation result generation file formed afterwards is preserved.

The optimization of so-called storage catalogue can reach so that the bibliographic structure of the parquet fileinfos after merging more has The purpose of sequence.The retrieval and positioning that can significantly facilitate file in order of bibliographic structure.Preferably, in the present embodiment, pairing And the storage of file uses two layers of bibliographic structure.Further, according to collection date generation one-level storage catalogue, further according to number Classification storage is carried out according to collection title generation secondary storage catalogue.

Moreover, above-mentioned file pretreatment and the operation merged are the Map-Reduce meters provided based on Hadoop system Calculate model realization.This will imply that pretreatment and the calculating task merged will be divided into some tiny subtasks, transfer to Different computing node devices perform respectively.

S14, based on Map-Reduce computation models, the institute after pre-processing and merging will be performed according to specified merger frequency It is big file to state the merger of column storage file.

Periodic file merger refers to that system at regular intervals carries out the operation that merger is big file to small documents.System is understood System is set as currently processed the previous day catalogue data) merger operation is carried out to the previous day each catalogue data, for single file not File to threshold value (default 128MB) is attempted to carry out merger operation, and the file after merger is no more than given threshold (system The file size upper limit after merging is set as 1024MB), and delete the small documents for having carried out merger.

In the present embodiment, the given threshold that file can not exceed after merger is referred to as merging file size (Merge file size,MFZ).It refers to the max-thresholds of the big file size after merging, and system default is arranged to 1024MB.

Also above-mentioned file merger operation is performed according to fixed time limit.This fixed time limit is referred to as file and periodically returned And interval time (Fixed consolidation interval time, FCIT).It specifically refers to the regular merger operation of file Interval time threshold value, its system default setting value is 1 day.

Fig. 2 provides the overall structure frame for the Hadoop system for performing the offline small documents processing method that the present embodiment provides Frame.Referring to Fig. 2, the system convergence source data production system and off-line analysis system, the input file of off-line analysis system is done Further optimization, improves overall performance.Present system mainly includes configuration management, data prediction and the conjunction that data use And, abandon processing and recovery, the four most of function of periodic file merger of data, source data production system is landed Parquet small documents to be analyzed carry out excellent according to data set is first pressed temporally, afterwards after data screening, data deduplication, data merge Change the catalogue after creating and carry out file preservation.Here data caused by source data production system had done the processing of metadata, Classification Management can be carried out according to the data set in metadata.The configuration file that system provides, for whether being lost to data Abandon and screened, and going for data is carried out according to the data unique key field (can set multiple fields per data) of configuration Operate again；The data abandoned can periodically carry out the compression of data, reduce the nameNode internal memories use to Hadoop, and Certain time data recovery mechanism is provided；The data merged can periodically carry out the merger operation of small documents, further optimization Hadoop storage, optimize overall system performance.

Different third party derived datas of the present embodiment for clear and definite metadata, there is provided the screening customization work(of data plane Can, and the data abandoned to screening are compressed management, effectively reduce nameNode internal memory service conditions.For it is to be analyzed from Line file data have carried out data deduplication, data merge, the operation of catalogue storage optimization, and by periodic file merger function, return And small documents, the reading efficiency to former small documents is improved, improves overall performance.

Second embodiment

The present embodiment further provides the one of offline small documents processing method based on present invention Kind technical scheme.In the technical scheme, offline small documents processing method still further comprises：To the column storage file After being pre-processed and being merged, based on Map-Reduce computation models, to abandoning data regularly compress and recover.

Referring to Fig. 3, offline small documents processing method includes：

S31, by reading column storage file in HDFS, wherein, the size of column storage file is big less than predetermined file Small setting value.

S32, there is provided carry out the configuration that data processing is specified.

S33, according to the configuration, based on Map-Reduce computation models, the column storage file is pre-processed And merge.

S34, based on Map-Reduce computation models, to abandoning data regularly compress and recover.

Abandon data to refer in pretreatment stage, the data operated not over data screening.That is, by screening Operation, Hadoop system think that these data need not perform off-line analysis to it.

The processing for abandoning data abandons data with recovering also offer periodically compression, data are abandoned in periodically deletion, recover data Function.It is according to abandonment data compression interval time (Abandoned data compression that data are abandoned in periodically compression Interval time, ADCIT), it is compressed operation to abandoning data.A compressed data can be safeguarded during compressed data again Table, file name, compression time after recording compressed, abandon the time range of data, abandon the data set scope of data.It is based on The har compression mechanisms that Hadoop is provided, are compressed processing, reduce makes to Hadoop nameNode internal memories to the SF of abandonment With optimization Hadoop entirety file reading performances.The SF of original abandonment is removed after compression and removes corresponding abandonment number According to register information.It is to retain maximum duration (Abandoned according to abandonment data periodically to delete and abandon data Datareservation time, ADRT), operation is purged to the abandonment data of reservation, further reduced to Hadoop's Store pressure.While deleting corresponding compressed file, compressed file data table information corresponding to deletion.Recover data be according to Family requires the deletion data message specified.If the data to be recovered not yet are compressed, data registration is abandoned in retrieval Table information, recover associated data files to the input directory of present system, carry out the pre- of follow-up off-line analysis file data Processing and union operation etc.；If the data recovered have already passed through compression, need to retrieve compressed data information table, find correspondingly Compressed file, solution extrude associated documents to the input directory of present system；If the data to be recovered are to find in Liang Chu, Then recovery operation fails.

S35, based on Map-Reduce computation models, the institute after pre-processing and merging will be performed according to specified merger frequency It is big file to state the merger of column storage file.

The present embodiment using Map-Reduce by after column storage file is pre-processed and merged, being calculated Model to abandoning data regularly compress and recover, and further improves the reading efficiency of system small file.

3rd embodiment

The present embodiment is further provided pre- in offline small documents processing method based on present invention A kind of technical scheme of processing.In the technical scheme, according to the configuration, based on Map-Reduce computation models, to described Column storage file is pre-processed and merged, including：Based on Map-Reduce computation models, in column storage file Hold data to be screened；Based on Map-Reduce computation models, duplicate removal is carried out to the content-data of column storage file；It is based on Map-Reduce computation models, column storage file is merged；And based on Map-Reduce computation models, column is deposited The storage catalogue structure of storage file optimizes.

Referring to Fig. 4, according to the configuration, based on Map-Reduce computation models, the column storage file is carried out pre- Processing and merging, including：

S41, based on Map-Reduce computation models, the content-data of column storage file is screened.

It is understood that before merger is carried out to column storage file, some data in column storage file are deposited Storage content need not carry out off-line analysis.Therefore, in pretreatment stage, it is necessary first to what is stored in column storage file Content-data carries out screening operation.After screening, the data by screening are protected in the form of new column storage file Deposit, be then identified as abandoning data without the data by screening.

At the beginning of the pretreatment of determinant storage file and union operation is entered, it is necessary first to which column storage file is calculated only One value.Calculation basis is weighted for unique field information of the data set of user configuration, generates MD5 data, preserves In the UniqueNum fields of the data.

S42, based on Map-Reduce computation models, duplicate removal is carried out to the content-data of column storage file.

Deduplication operation is the duplicate removal to the storage content of column storage file.Deduplication operation can enter according to MD5 values are calculated Row multilevel iudge, if there is duplicate data, the data of later insertion queue can be deleted, and update relatively early insertion queuing data FIRST_COLLECT_TIME, LAST_COLLECT_TIME, TO RL_FOUND_COUNTER, TOTAL_DAY_COUNTER tetra- Individual field information.

S43, based on Map-Reduce computation models, column storage file is merged.

After the data processing of present lot, the parquet files after generation merges are preserved, while empty the team Column data, to start the buffered of next batch data.

S44, based on Map-Reduce computation models, the storage catalogue structure of column storage file is optimized.

First can be according to processing date date created catalogue, and according to data set information, i.e. Key values when file preserves Corresponding data collection catalogue is created, and the file of merging is named with data set and data processing time section, is saved in correspondingly Data set catalogue under.After every operation success, original off-line analysis SF corresponding to the batch is deleted.

The present embodiment carries out duplicate removal by being screened to the content-data of column storage file, to the above data, Column storage file is merged, and storage catalogue structure is optimized, realizes the pre- place to column storage file Reason and union operation.

Fourth embodiment

The present embodiment is further provided and lost in offline small documents processing method based on present invention Abandon a kind of technical scheme of data processing.In the technical scheme, based on Map-Reduce computation models, carried out to abandoning data Regularly compress and recover, including：Based on Map-Reduce computation models, by distinguishing abandonment data in initial data；It is based on Map-Reduce computation models, periodically compressed to abandoning data, generate HAR files；And calculated based on Map-Reduce Model, regularly recovered to abandoning data.

Referring to Fig. 5, based on Map-Reduce computation models, to abandoning data regularly compress and recover, including：

S51, based on Map-Reduce computation models, by distinguishing abandonment data in initial data, and understand abandonment data Finger daemon.

The process for abandoning data is actually distinguished, is exactly the process of data screening.Distinguish after abandoning data, can open The dynamic special abandonment data finger daemon for being used to be handled abandoning data.Performed by this abandonment data finger daemon to losing Abandon the regular compression of data, periodically delete, the periodically operation such as recovery.

S52, based on Map-Reduce computation models, periodically compressed to abandoning data, generate HAR files.

Compression for abandoning data, is periodically executed.The Deadline of above-mentioned squeeze operation is typically 1 day.When upper After stating time limit arrival, using the Hadoop Archive functions of Hadoop system itself, to identified abandonment number According to HAR files corresponding to generation.

S53, based on Map-Reduce computation models, regularly recovered to abandoning data.

Recover data to refer to recover the data abandoned, worked using follow-up off-line analysis.If the data to be recovered It is to abandon (being namely not yet compressed processing) on the same day, directly retrieves the data registration form of abandonment, and by the data of abandonment File is placed under same day corresponding data collection catalogue, as data prediction and the input data of merging, carries out follow-up operation； If the data to be recovered are not to work as day data, need to retrieve compressed file data corresponding table information, if beyond abandonment number According to retention time (default is 30 days), then recover failure；Otherwise, corresponding HAR files are found, solution extrudes corresponding data File is placed under same day corresponding data collection catalogue, as data prediction and the input data of merging, carries out follow-up operation.

Except the operation of above-mentioned differentiation, compression and recovery, the deletion to abandoning data is also included to the processing for abandoning data. Moreover, it is above-mentioned to abandon data compression, deletion and recovery operation be to be completed by abandonment data finger daemon, they when In sequence and in the absence of inevitable precedence relationship.

The present embodiment is periodically compressed to abandoning data by by distinguishing abandonment data in initial data, and right Abandon data regularly to be recovered, realize the processing to abandoning data in column storage file.

5th embodiment

The present embodiment further provides offline small documents processing method Chinese based on present invention A kind of technical scheme of part merger.In the technical scheme, based on Map-Reduce computation models, according to specified merger frequency It is big file by the column storage file merger after performing pretreatment and merging, including：According to depositing for column storage file Bibliographic structure is stored up, is big file by the column storage file merger.

Specifically, referring to Fig. 6, based on Map-Reduce computation models, pretreatment will be performed according to specified merger frequency And the column storage file merger after merging is big file, including：

S61, order read the column storage file for needing to merge in catalogue.

S62, judge in catalogue whether the column storage file of merging also in need, if it is, perform S63, if not Perform S61.

S63, it is big file by the small documents merger of reading.

S64, whether the size of the big file after merger is judged beyond size threshold value, if it is, big file is abandoned, if It is not to retain big file.

It should be noted that above-mentioned merger operation is performed based on Map-Reduce computation models.It is specifically, above-mentioned Merger is operated based on the column storage file stored in several catalogues on DataNode, based on Map-Reduce computation models The above-mentioned merger operation of parallel execution.

The present embodiment is operated by regularly performing above-mentioned merger, is big file by small documents merger, is greatly improved The reading performance of Hadoop system.

Sixth embodiment

Present embodiments provide a kind of technical scheme of offline small documents processing unit.It is offline small in the technical scheme Document handling apparatus is integrated in Hadoop distributed processing system(DPS)s.Referring to Fig. 7, offline small documents processing unit includes：Read Module 71, configuration module 72, pretreatment module 73, and merger module 75.

Read module 71 is used for by reading column storage file in HDFS, wherein, the size of column storage file is less than in advance Fixed file size setting value.

Configuration module 72, which is used to provide, carries out the configuration that data processing is specified.

Pretreatment module 73 is used for according to the configuration, and based on Map-Reduce computation models, text is stored to the column Part is pre-processed and merged.

Merger module 75 be used for be based on Map-Reduce computation models, according to specified merger frequency will perform pre-process and The column storage file merger after merging is big file.

Further, offline small documents processing unit also includes：Abandon computing module 74.

Abandon computing module 74 to be used for after the column storage file is pre-processed and merged, based on Map- Reduce computation models, to abandoning data regularly compress and recover.

Further, pretreatment module 73 includes：Screening unit, duplicate removal unit, combining unit, and catalogue optimization are single Member.

Screening unit is used to be based on Map-Reduce computation models, and the content-data of column storage file is screened.

Duplicate removal unit is used to be based on Map-Reduce computation models, and duplicate removal is carried out to the content-data of column storage file.

Combining unit is used to be based on Map-Reduce computation models, and column storage file is merged.

Catalogue optimization unit is used to be based on Map-Reduce computation models, and the storage catalogue structure of column storage file is entered Row optimization.

Further, abandoning computing module 74 includes：Discrimination unit, compression unit, and recovery unit.

Discrimination unit is used to be based on Map-Reduce computation models, by distinguishing abandonment data in initial data, wherein, institute It is that need not participate in the data of off-line analysis to state and abandon data.

Compression unit is used to be based on Map-Reduce computation models, is periodically compressed to abandoning data, generation HAR texts Part.

Recovery unit is used to be based on Map-Reduce computation models, is regularly recovered to abandoning data.

Further, merger module is specifically used for：According to the storage catalogue structure of column storage file, the column is deposited It is big file to store up file merger.

Further, the column storage file includes：Parquet storage files.

Will be appreciated by those skilled in the art that above-mentioned each module of the invention or each step can use general meter Device is calculated to realize, they can be concentrated on single computing device, or are distributed in the network that multiple computing devices are formed On, alternatively, they can be realized with the program code that computer installation can perform, so as to be stored in storage Performed in device by computing device, they are either fabricated to each integrated circuit modules respectively or will be more in them Individual module or step are fabricated to single integrated circuit module to realize.So, the present invention be not restricted to any specific hardware and The combination of software.

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for those skilled in the art For, the present invention can have various changes and change.All any modifications made within spirit and principles of the present invention, it is equal Replace, improve etc., it should be included in the scope of the protection.

Claims

1. a kind of offline small documents processing method, runs in Hadoop distributed processing system(DPS)s, it is characterised in that including：

By reading column storage file in HDFS, wherein, the size of column storage file is less than predetermined file size setting value；

According to the configuration, based on Map-Reduce computation models, the column storage file is pre-processed and merged；With And

Based on Map-Reduce computation models, the column performed after pre-processing and merging is deposited according to specified merger frequency It is big file to store up file merger.

2. according to the method for claim 1, it is characterised in that also include：

After the column storage file is pre-processed and merged, based on Map-Reduce computation models, to abandoning number Regularly compress and recover according to carrying out.

3. method according to claim 1 or 2, it is characterised in that according to the configuration, mould is calculated based on Map-Reduce Type, the column storage file is pre-processed and merged, including：

Based on Map-Reduce computation models, the content-data of column storage file is screened；

Based on Map-Reduce computation models, duplicate removal is carried out to the content-data of column storage file；

Based on Map-Reduce computation models, column storage file is merged；And

Based on Map-Reduce computation models, the storage catalogue structure of column storage file is optimized.

4. according to the method for claim 2, it is characterised in that based on Map-Reduce computation models, enter to abandoning data Row regularly compresses and recovered, including：

Based on Map-Reduce computation models, by distinguishing abandonment data in initial data, wherein, the abandonment data is are not required to Participate in the data of off-line analysis；

Based on Map-Reduce computation models, periodically compressed to abandoning data, generate HAR files；And

Based on Map-Reduce computation models, regularly recovered to abandoning data.

5. according to the method for claim 3, it is characterised in that based on Map-Reduce computation models, return according to specified And the column storage file merger performed after pre-processing and merging is big file by frequency, including：

It is big file by the column storage file merger according to the storage catalogue structure of column storage file.

6. method according to claim 1 or 2, it is characterised in that the column storage file includes：Parquet is stored File.

7. a kind of offline small documents processing unit, is integrated in Hadoop distributed processing system(DPS)s, it is characterised in that including：

Read module, for by HDFS read column storage file, wherein, the size of column storage file is less than predetermined text Part size setting value；

Pretreatment module, for according to the configuration, based on Map-Reduce computation models, being carried out to the column storage file Pretreatment and merging；And

Merger module, for based on Map-Reduce computation models, will be performed according to specified merger frequency and pre-process and merge The column storage file merger afterwards is big file.