CN107577809A - Offline small documents processing method and processing device - Google Patents
Offline small documents processing method and processing device Download PDFInfo
- Publication number
- CN107577809A CN107577809A CN201710888790.5A CN201710888790A CN107577809A CN 107577809 A CN107577809 A CN 107577809A CN 201710888790 A CN201710888790 A CN 201710888790A CN 107577809 A CN107577809 A CN 107577809A
- Authority
- CN
- China
- Prior art keywords
- data
- file
- map
- column storage
- storage file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The embodiment of the invention discloses a kind of offline small documents processing method and processing device.Methods described includes:By reading column storage file in HDFS;There is provided and carry out the configuration that data processing is specified;The column storage file is pre-processed and merged;The column storage file merger after execution is pre-processed and merged according to specified merger frequency is big file.Offline small documents processing method and processing device provided in an embodiment of the present invention can effectively reduce the occupancy of internal memory on nameNode in Hadoop system.
Description
Technical field
The present embodiments relate to distributed computing technology field, more particularly to a kind of offline small documents processing method and dress
Put.
Background technology
Worked for the off-line analysis of big data, take and stream data is converted into parquet column storage format files
Mode, and combine the technological means such as spark-sql and carry out off-line analysis.The stream data wherein accessed in real time, can be from Kafka
Obtain and be converted into parquet files in real time, stored using HDFS filesystem manners, finally as off-line analysis instrument
Source file.Hadoop distributed file systems (Hadoop distributed file system, HDFS) are designed to suitable
The distributed file system that conjunction is operated on common hardware (commodity hardware), there is Error Tolerance, high-throughput
Etc. characteristic, the application being especially suitable on large-scale dataset.It is but right mainly towards Stream Processing at the beginning of Hadoop design
When processing is largely much smaller than the small documents of block size values, the problem of due to design mechanism, it may appear that response speed is big
Width decline, have a strong impact on performance, even result in can not normal operation phenomenon.Because data are real-time from each manufacturer, light splitting etc.
Data are accessed, off-line analysis filing system can carry out real-time file conversion work to every log information on Kafka,
Analyzed so as to form substantial amounts of small documents for subsequent product, it is at the same time, caused a large amount of small much smaller than block size
File, leverage the reading performance of Hadoop storages.
The content of the invention
For above-mentioned technical problem, the embodiments of the invention provide a kind of offline small documents processing method and processing device, to carry
The reading performance of high Hadoop system storage, effectively reduce the occupancy of internal memory on nameNode in Hadoop system.
On the one hand, the embodiments of the invention provide a kind of offline small documents processing method, run at Hadoop distributions
In reason system, methods described includes:
By reading column storage file in HDFS, wherein, the size of column storage file is set less than predetermined file size
Definite value;
There is provided and carry out the configuration that data processing is specified;
According to the configuration, based on Map-Reduce computation models, the column storage file is pre-processed and closed
And;
Based on Map-Reduce computation models, the row after pre-processing and merging will be performed according to specified merger frequency
The merger of formula storage file is big file.
On the other hand, the embodiment of the present invention additionally provides a kind of offline small documents processing unit, is integrated in Hadoop distributions
In formula processing system, described device includes:
Read module, for by reading column storage file in HDFS, wherein, the size of column storage file is less than predetermined
File size setting value;
Configuration module, for providing the configuration for carrying out data processing and specifying;
Pretreatment module, for according to the configuration, based on Map-Reduce computation models, to the column storage file
Pre-processed and merged;
Merger module, for based on Map-Reduce computation models, according to specified merger frequency will perform pretreatment and
The column storage file merger after merging is big file.
Offline small documents processing method and processing device provided in an embodiment of the present invention, by reading column storage file, there is provided
The configuration of data processing execution is carried out, column storage file is pre-processed and merged, and according to specified merger frequency
It is big file by the column storage file merger after performing pretreatment and merging, substantially increases the reading of Hadoop system storage
Performance, the occupancy of internal memory on nameNode in Hadoop system can be effectively reduced.
Brief description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, of the invention is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 is the schematic flow sheet for the offline small documents processing method that first embodiment of the invention provides;
Fig. 2 is the logical construction signal of the system for the offline small documents processing method of operation that first embodiment of the invention provides
Figure;
Fig. 3 is the schematic flow sheet for the offline small documents processing method that second embodiment of the invention provides;
Fig. 4 is the schematic flow sheet pre-processed in the offline small documents processing method that third embodiment of the invention provides;
Fig. 5 is the flow signal that data processing is abandoned in the offline small documents processing method that fourth embodiment of the invention provides
Figure;
Fig. 6 is the schematic flow sheet of file merger in the offline small documents processing method that fifth embodiment of the invention provides;
Fig. 7 is the structural representation for the offline small documents processing unit that sixth embodiment of the invention provides.
Embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that in order to just
Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.
First embodiment
Present embodiments provide a kind of technical scheme of offline small documents processing method.It is offline small in the technical scheme
Document handling method is performed by offline small documents processing unit, also, offline small documents processing unit is generally integrated in Hadoop
Among distributed data processing system.
Referring to Fig. 1, offline small documents processing method includes:
S11, by reading column storage file in HDFS.
So-called column storage, the storage mode for exactly storing together the ownership that same data arrange.This mode
On-line analytical processing (OLAP) is especially suitable for, because the inquiry of OLAP types is often only concerned a small amount of several row, when Field Count ratio
When more and whole file has millions of or even billions of rows records, column storage can be significantly reduced scan data volume, add
Fast feedback speed.
Column storage in the present embodiment is preferably parquet files.Parquet files can be the big data ecosystem
In service efficient data be provided supported, it is and all not related in programming language, data model or data framework.Moreover,
Parquet files can be handled by Hadoop system, and Hive, Impala and Spark SQL can be inquired about directly with parquet
Data warehouse based on file.This characteristic allows parquet files to substitute traditional text file type.
In the present embodiment, the upper limit of the size for the parquet files being read without departing from file size.Specifically
, the upper limit of above-mentioned file size is 128MB.Moreover, the storage location of Parquet files is typically in Hadoop system
On DataNode.
System reads the parquet files that source data production system provides, and is the data increase of each data set
UniqueNum fields, the unique data values of the data are calculated by MD5 algorithms for the unique field of data set according to configuration,
Judge for data deduplication.For each data set data increase FIRST_COLLECT _ TIME, LAST_COLLECT_TIME,
TORL_FOUND_COUNTER, TOTAL_DAY_ COUNTER fields, during for data deduplication, calculate what duplicate data occurred
Earliest time, latest time, amount to and find the statistics such as number, accumulative discovery number of days.
S12, there is provided carry out the configuration that data processing is specified.
In Data processings such as follow-up Piece file mergence, merger, data processing needs to run according to specified rule.Separately
Outside, some processing parameters are often used in data handling.These rules and processing parameters will be provided in the form of configuration.
With reference to practical application scene, the system, which provides, passes through profile information, carries out data screening, data uniqueization refers to
Fixed configuration.The step needs user according to configuration information template, there is provided needs the data set scope retained, data intensive data
The configuration such as unique identification information.
S13, according to the configuration, based on Map-Reduce computation models, the column storage file is pre-processed
And merge.
In the present embodiment, pretreatment and merging treatment are primarily referred to as screening, duplicate removal, the merging of data, and storage mesh
The optimization of directory structures.
Screening refers in the parquet small documents by magnanimity, filters out the file for needing to carry out off-line analysis.In this implementation
In example, small documents (Small files, SF) refer to that file total size is not reaching to the file of the file size upper limit of setting.It is excellent
Choosing, the setting value of the above-mentioned upper limit is 128MB.
So-called duplicate removal refers to be removed the data record repeated, and identical data record only retains wherein
The operation of one record.
Specifically, deduplication operation is according to data buffer storage data-quantity threshold (Data buffer count, DBC) and number
Performed according to caching duration threshold value (Data buffer time, DBT).DBC refers to generic data, is cached in computing
The max-thresholds of data volume, system default are arranged to 1000000.DBT refers to generic data, once delays in computing
The period max-thresholds deposited, system default are arranged to 5 minutes.It can be entered in internal memory according to data set generation Map key-value pair
Row caching, data queue is preserved using Queue forms, the data of caching proceed by current class when reaching DBC or DBT
The deduplication operation of other data.
So-called merging refers to that data merging is exactly the data message by reading multiple parquet files, is closed by processing
And preserved for a larger parquet file.By data screening above, data classification caching, data deduplication it
The data amalgamation result generation file formed afterwards is preserved.
The optimization of so-called storage catalogue can reach so that the bibliographic structure of the parquet fileinfos after merging more has
The purpose of sequence.The retrieval and positioning that can significantly facilitate file in order of bibliographic structure.Preferably, in the present embodiment, pairing
And the storage of file uses two layers of bibliographic structure.Further, according to collection date generation one-level storage catalogue, further according to number
Classification storage is carried out according to collection title generation secondary storage catalogue.
Moreover, above-mentioned file pretreatment and the operation merged are the Map-Reduce meters provided based on Hadoop system
Calculate model realization.This will imply that pretreatment and the calculating task merged will be divided into some tiny subtasks, transfer to
Different computing node devices perform respectively.
S14, based on Map-Reduce computation models, the institute after pre-processing and merging will be performed according to specified merger frequency
It is big file to state the merger of column storage file.
Periodic file merger refers to that system at regular intervals carries out the operation that merger is big file to small documents.System is understood
System is set as currently processed the previous day catalogue data) merger operation is carried out to the previous day each catalogue data, for single file not
File to threshold value (default 128MB) is attempted to carry out merger operation, and the file after merger is no more than given threshold (system
The file size upper limit after merging is set as 1024MB), and delete the small documents for having carried out merger.
In the present embodiment, the given threshold that file can not exceed after merger is referred to as merging file size (Merge
file size,MFZ).It refers to the max-thresholds of the big file size after merging, and system default is arranged to 1024MB.
Also above-mentioned file merger operation is performed according to fixed time limit.This fixed time limit is referred to as file and periodically returned
And interval time (Fixed consolidation interval time, FCIT).It specifically refers to the regular merger operation of file
Interval time threshold value, its system default setting value is 1 day.
Fig. 2 provides the overall structure frame for the Hadoop system for performing the offline small documents processing method that the present embodiment provides
Frame.Referring to Fig. 2, the system convergence source data production system and off-line analysis system, the input file of off-line analysis system is done
Further optimization, improves overall performance.Present system mainly includes configuration management, data prediction and the conjunction that data use
And, abandon processing and recovery, the four most of function of periodic file merger of data, source data production system is landed
Parquet small documents to be analyzed carry out excellent according to data set is first pressed temporally, afterwards after data screening, data deduplication, data merge
Change the catalogue after creating and carry out file preservation.Here data caused by source data production system had done the processing of metadata,
Classification Management can be carried out according to the data set in metadata.The configuration file that system provides, for whether being lost to data
Abandon and screened, and going for data is carried out according to the data unique key field (can set multiple fields per data) of configuration
Operate again;The data abandoned can periodically carry out the compression of data, reduce the nameNode internal memories use to Hadoop, and
Certain time data recovery mechanism is provided;The data merged can periodically carry out the merger operation of small documents, further optimization
Hadoop storage, optimize overall system performance.
Different third party derived datas of the present embodiment for clear and definite metadata, there is provided the screening customization work(of data plane
Can, and the data abandoned to screening are compressed management, effectively reduce nameNode internal memory service conditions.For it is to be analyzed from
Line file data have carried out data deduplication, data merge, the operation of catalogue storage optimization, and by periodic file merger function, return
And small documents, the reading efficiency to former small documents is improved, improves overall performance.
Second embodiment
The present embodiment further provides the one of offline small documents processing method based on present invention
Kind technical scheme.In the technical scheme, offline small documents processing method still further comprises:To the column storage file
After being pre-processed and being merged, based on Map-Reduce computation models, to abandoning data regularly compress and recover.
Referring to Fig. 3, offline small documents processing method includes:
S31, by reading column storage file in HDFS, wherein, the size of column storage file is big less than predetermined file
Small setting value.
S32, there is provided carry out the configuration that data processing is specified.
S33, according to the configuration, based on Map-Reduce computation models, the column storage file is pre-processed
And merge.
S34, based on Map-Reduce computation models, to abandoning data regularly compress and recover.
Abandon data to refer in pretreatment stage, the data operated not over data screening.That is, by screening
Operation, Hadoop system think that these data need not perform off-line analysis to it.
The processing for abandoning data abandons data with recovering also offer periodically compression, data are abandoned in periodically deletion, recover data
Function.It is according to abandonment data compression interval time (Abandoned data compression that data are abandoned in periodically compression
Interval time, ADCIT), it is compressed operation to abandoning data.A compressed data can be safeguarded during compressed data again
Table, file name, compression time after recording compressed, abandon the time range of data, abandon the data set scope of data.It is based on
The har compression mechanisms that Hadoop is provided, are compressed processing, reduce makes to Hadoop nameNode internal memories to the SF of abandonment
With optimization Hadoop entirety file reading performances.The SF of original abandonment is removed after compression and removes corresponding abandonment number
According to register information.It is to retain maximum duration (Abandoned according to abandonment data periodically to delete and abandon data
Datareservation time, ADRT), operation is purged to the abandonment data of reservation, further reduced to Hadoop's
Store pressure.While deleting corresponding compressed file, compressed file data table information corresponding to deletion.Recover data be according to
Family requires the deletion data message specified.If the data to be recovered not yet are compressed, data registration is abandoned in retrieval
Table information, recover associated data files to the input directory of present system, carry out the pre- of follow-up off-line analysis file data
Processing and union operation etc.;If the data recovered have already passed through compression, need to retrieve compressed data information table, find correspondingly
Compressed file, solution extrude associated documents to the input directory of present system;If the data to be recovered are to find in Liang Chu,
Then recovery operation fails.
S35, based on Map-Reduce computation models, the institute after pre-processing and merging will be performed according to specified merger frequency
It is big file to state the merger of column storage file.
The present embodiment using Map-Reduce by after column storage file is pre-processed and merged, being calculated
Model to abandoning data regularly compress and recover, and further improves the reading efficiency of system small file.
3rd embodiment
The present embodiment is further provided pre- in offline small documents processing method based on present invention
A kind of technical scheme of processing.In the technical scheme, according to the configuration, based on Map-Reduce computation models, to described
Column storage file is pre-processed and merged, including:Based on Map-Reduce computation models, in column storage file
Hold data to be screened;Based on Map-Reduce computation models, duplicate removal is carried out to the content-data of column storage file;It is based on
Map-Reduce computation models, column storage file is merged;And based on Map-Reduce computation models, column is deposited
The storage catalogue structure of storage file optimizes.
Referring to Fig. 4, according to the configuration, based on Map-Reduce computation models, the column storage file is carried out pre-
Processing and merging, including:
S41, based on Map-Reduce computation models, the content-data of column storage file is screened.
It is understood that before merger is carried out to column storage file, some data in column storage file are deposited
Storage content need not carry out off-line analysis.Therefore, in pretreatment stage, it is necessary first to what is stored in column storage file
Content-data carries out screening operation.After screening, the data by screening are protected in the form of new column storage file
Deposit, be then identified as abandoning data without the data by screening.
At the beginning of the pretreatment of determinant storage file and union operation is entered, it is necessary first to which column storage file is calculated only
One value.Calculation basis is weighted for unique field information of the data set of user configuration, generates MD5 data, preserves
In the UniqueNum fields of the data.
S42, based on Map-Reduce computation models, duplicate removal is carried out to the content-data of column storage file.
Deduplication operation is the duplicate removal to the storage content of column storage file.Deduplication operation can enter according to MD5 values are calculated
Row multilevel iudge, if there is duplicate data, the data of later insertion queue can be deleted, and update relatively early insertion queuing data
FIRST_COLLECT_TIME, LAST_COLLECT_TIME, TO RL_FOUND_COUNTER, TOTAL_DAY_COUNTER tetra-
Individual field information.
S43, based on Map-Reduce computation models, column storage file is merged.
After the data processing of present lot, the parquet files after generation merges are preserved, while empty the team
Column data, to start the buffered of next batch data.
S44, based on Map-Reduce computation models, the storage catalogue structure of column storage file is optimized.
First can be according to processing date date created catalogue, and according to data set information, i.e. Key values when file preserves
Corresponding data collection catalogue is created, and the file of merging is named with data set and data processing time section, is saved in correspondingly
Data set catalogue under.After every operation success, original off-line analysis SF corresponding to the batch is deleted.
The present embodiment carries out duplicate removal by being screened to the content-data of column storage file, to the above data,
Column storage file is merged, and storage catalogue structure is optimized, realizes the pre- place to column storage file
Reason and union operation.
Fourth embodiment
The present embodiment is further provided and lost in offline small documents processing method based on present invention
Abandon a kind of technical scheme of data processing.In the technical scheme, based on Map-Reduce computation models, carried out to abandoning data
Regularly compress and recover, including:Based on Map-Reduce computation models, by distinguishing abandonment data in initial data;It is based on
Map-Reduce computation models, periodically compressed to abandoning data, generate HAR files;And calculated based on Map-Reduce
Model, regularly recovered to abandoning data.
Referring to Fig. 5, based on Map-Reduce computation models, to abandoning data regularly compress and recover, including:
S51, based on Map-Reduce computation models, by distinguishing abandonment data in initial data, and understand abandonment data
Finger daemon.
The process for abandoning data is actually distinguished, is exactly the process of data screening.Distinguish after abandoning data, can open
The dynamic special abandonment data finger daemon for being used to be handled abandoning data.Performed by this abandonment data finger daemon to losing
Abandon the regular compression of data, periodically delete, the periodically operation such as recovery.
S52, based on Map-Reduce computation models, periodically compressed to abandoning data, generate HAR files.
Compression for abandoning data, is periodically executed.The Deadline of above-mentioned squeeze operation is typically 1 day.When upper
After stating time limit arrival, using the Hadoop Archive functions of Hadoop system itself, to identified abandonment number
According to HAR files corresponding to generation.
S53, based on Map-Reduce computation models, regularly recovered to abandoning data.
Recover data to refer to recover the data abandoned, worked using follow-up off-line analysis.If the data to be recovered
It is to abandon (being namely not yet compressed processing) on the same day, directly retrieves the data registration form of abandonment, and by the data of abandonment
File is placed under same day corresponding data collection catalogue, as data prediction and the input data of merging, carries out follow-up operation;
If the data to be recovered are not to work as day data, need to retrieve compressed file data corresponding table information, if beyond abandonment number
According to retention time (default is 30 days), then recover failure;Otherwise, corresponding HAR files are found, solution extrudes corresponding data
File is placed under same day corresponding data collection catalogue, as data prediction and the input data of merging, carries out follow-up operation.
Except the operation of above-mentioned differentiation, compression and recovery, the deletion to abandoning data is also included to the processing for abandoning data.
Moreover, it is above-mentioned to abandon data compression, deletion and recovery operation be to be completed by abandonment data finger daemon, they when
In sequence and in the absence of inevitable precedence relationship.
The present embodiment is periodically compressed to abandoning data by by distinguishing abandonment data in initial data, and right
Abandon data regularly to be recovered, realize the processing to abandoning data in column storage file.
5th embodiment
The present embodiment further provides offline small documents processing method Chinese based on present invention
A kind of technical scheme of part merger.In the technical scheme, based on Map-Reduce computation models, according to specified merger frequency
It is big file by the column storage file merger after performing pretreatment and merging, including:According to depositing for column storage file
Bibliographic structure is stored up, is big file by the column storage file merger.
Specifically, referring to Fig. 6, based on Map-Reduce computation models, pretreatment will be performed according to specified merger frequency
And the column storage file merger after merging is big file, including:
S61, order read the column storage file for needing to merge in catalogue.
S62, judge in catalogue whether the column storage file of merging also in need, if it is, perform S63, if not
Perform S61.
S63, it is big file by the small documents merger of reading.
S64, whether the size of the big file after merger is judged beyond size threshold value, if it is, big file is abandoned, if
It is not to retain big file.
It should be noted that above-mentioned merger operation is performed based on Map-Reduce computation models.It is specifically, above-mentioned
Merger is operated based on the column storage file stored in several catalogues on DataNode, based on Map-Reduce computation models
The above-mentioned merger operation of parallel execution.
The present embodiment is operated by regularly performing above-mentioned merger, is big file by small documents merger, is greatly improved
The reading performance of Hadoop system.
Sixth embodiment
Present embodiments provide a kind of technical scheme of offline small documents processing unit.It is offline small in the technical scheme
Document handling apparatus is integrated in Hadoop distributed processing system(DPS)s.Referring to Fig. 7, offline small documents processing unit includes:Read
Module 71, configuration module 72, pretreatment module 73, and merger module 75.
Read module 71 is used for by reading column storage file in HDFS, wherein, the size of column storage file is less than in advance
Fixed file size setting value.
Configuration module 72, which is used to provide, carries out the configuration that data processing is specified.
Pretreatment module 73 is used for according to the configuration, and based on Map-Reduce computation models, text is stored to the column
Part is pre-processed and merged.
Merger module 75 be used for be based on Map-Reduce computation models, according to specified merger frequency will perform pre-process and
The column storage file merger after merging is big file.
Further, offline small documents processing unit also includes:Abandon computing module 74.
Abandon computing module 74 to be used for after the column storage file is pre-processed and merged, based on Map-
Reduce computation models, to abandoning data regularly compress and recover.
Further, pretreatment module 73 includes:Screening unit, duplicate removal unit, combining unit, and catalogue optimization are single
Member.
Screening unit is used to be based on Map-Reduce computation models, and the content-data of column storage file is screened.
Duplicate removal unit is used to be based on Map-Reduce computation models, and duplicate removal is carried out to the content-data of column storage file.
Combining unit is used to be based on Map-Reduce computation models, and column storage file is merged.
Catalogue optimization unit is used to be based on Map-Reduce computation models, and the storage catalogue structure of column storage file is entered
Row optimization.
Further, abandoning computing module 74 includes:Discrimination unit, compression unit, and recovery unit.
Discrimination unit is used to be based on Map-Reduce computation models, by distinguishing abandonment data in initial data, wherein, institute
It is that need not participate in the data of off-line analysis to state and abandon data.
Compression unit is used to be based on Map-Reduce computation models, is periodically compressed to abandoning data, generation HAR texts
Part.
Recovery unit is used to be based on Map-Reduce computation models, is regularly recovered to abandoning data.
Further, merger module is specifically used for:According to the storage catalogue structure of column storage file, the column is deposited
It is big file to store up file merger.
Further, the column storage file includes:Parquet storage files.
Will be appreciated by those skilled in the art that above-mentioned each module of the invention or each step can use general meter
Device is calculated to realize, they can be concentrated on single computing device, or are distributed in the network that multiple computing devices are formed
On, alternatively, they can be realized with the program code that computer installation can perform, so as to be stored in storage
Performed in device by computing device, they are either fabricated to each integrated circuit modules respectively or will be more in them
Individual module or step are fabricated to single integrated circuit module to realize.So, the present invention be not restricted to any specific hardware and
The combination of software.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for those skilled in the art
For, the present invention can have various changes and change.All any modifications made within spirit and principles of the present invention, it is equal
Replace, improve etc., it should be included in the scope of the protection.
Claims (7)
1. a kind of offline small documents processing method, runs in Hadoop distributed processing system(DPS)s, it is characterised in that including:
By reading column storage file in HDFS, wherein, the size of column storage file is less than predetermined file size setting value;
There is provided and carry out the configuration that data processing is specified;
According to the configuration, based on Map-Reduce computation models, the column storage file is pre-processed and merged;With
And
Based on Map-Reduce computation models, the column performed after pre-processing and merging is deposited according to specified merger frequency
It is big file to store up file merger.
2. according to the method for claim 1, it is characterised in that also include:
After the column storage file is pre-processed and merged, based on Map-Reduce computation models, to abandoning number
Regularly compress and recover according to carrying out.
3. method according to claim 1 or 2, it is characterised in that according to the configuration, mould is calculated based on Map-Reduce
Type, the column storage file is pre-processed and merged, including:
Based on Map-Reduce computation models, the content-data of column storage file is screened;
Based on Map-Reduce computation models, duplicate removal is carried out to the content-data of column storage file;
Based on Map-Reduce computation models, column storage file is merged;And
Based on Map-Reduce computation models, the storage catalogue structure of column storage file is optimized.
4. according to the method for claim 2, it is characterised in that based on Map-Reduce computation models, enter to abandoning data
Row regularly compresses and recovered, including:
Based on Map-Reduce computation models, by distinguishing abandonment data in initial data, wherein, the abandonment data is are not required to
Participate in the data of off-line analysis;
Based on Map-Reduce computation models, periodically compressed to abandoning data, generate HAR files;And
Based on Map-Reduce computation models, regularly recovered to abandoning data.
5. according to the method for claim 3, it is characterised in that based on Map-Reduce computation models, return according to specified
And the column storage file merger performed after pre-processing and merging is big file by frequency, including:
It is big file by the column storage file merger according to the storage catalogue structure of column storage file.
6. method according to claim 1 or 2, it is characterised in that the column storage file includes:Parquet is stored
File.
7. a kind of offline small documents processing unit, is integrated in Hadoop distributed processing system(DPS)s, it is characterised in that including:
Read module, for by HDFS read column storage file, wherein, the size of column storage file is less than predetermined text
Part size setting value;
Configuration module, for providing the configuration for carrying out data processing and specifying;
Pretreatment module, for according to the configuration, based on Map-Reduce computation models, being carried out to the column storage file
Pretreatment and merging;And
Merger module, for based on Map-Reduce computation models, will be performed according to specified merger frequency and pre-process and merge
The column storage file merger afterwards is big file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710888790.5A CN107577809A (en) | 2017-09-27 | 2017-09-27 | Offline small documents processing method and processing device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710888790.5A CN107577809A (en) | 2017-09-27 | 2017-09-27 | Offline small documents processing method and processing device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107577809A true CN107577809A (en) | 2018-01-12 |
Family
ID=61038862
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710888790.5A Pending CN107577809A (en) | 2017-09-27 | 2017-09-27 | Offline small documents processing method and processing device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107577809A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804566A (en) * | 2018-05-22 | 2018-11-13 | 广东技术师范学院 | A kind of mass small documents read method based on Hadoop |
CN111352897A (en) * | 2020-03-02 | 2020-06-30 | 广东科徕尼智能科技有限公司 | Real-time data storage method, equipment and storage medium |
CN111897772A (en) * | 2020-08-05 | 2020-11-06 | 光大兴陇信托有限责任公司 | Big file data importing method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150163A (en) * | 2013-03-01 | 2013-06-12 | 南京理工大学常熟研究院有限公司 | Map/Reduce mode-based parallel relating method |
CN104767795A (en) * | 2015-03-17 | 2015-07-08 | 浪潮通信信息系统有限公司 | LTE MRO data statistical method and system based on HADOOP |
CN105677836A (en) * | 2016-01-05 | 2016-06-15 | 北京汇商融通信息技术有限公司 | Big data processing and solving system simultaneously supporting offline data and real-time online data |
US20160291900A1 (en) * | 2015-03-30 | 2016-10-06 | International Business Machines Corporation | Adaptive map-reduce pipeline with dynamic thread allocations |
CN106855861A (en) * | 2015-12-09 | 2017-06-16 | 北京金山安全软件有限公司 | File merging method and device and electronic equipment |
-
2017
- 2017-09-27 CN CN201710888790.5A patent/CN107577809A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150163A (en) * | 2013-03-01 | 2013-06-12 | 南京理工大学常熟研究院有限公司 | Map/Reduce mode-based parallel relating method |
CN104767795A (en) * | 2015-03-17 | 2015-07-08 | 浪潮通信信息系统有限公司 | LTE MRO data statistical method and system based on HADOOP |
US20160291900A1 (en) * | 2015-03-30 | 2016-10-06 | International Business Machines Corporation | Adaptive map-reduce pipeline with dynamic thread allocations |
CN106855861A (en) * | 2015-12-09 | 2017-06-16 | 北京金山安全软件有限公司 | File merging method and device and electronic equipment |
CN105677836A (en) * | 2016-01-05 | 2016-06-15 | 北京汇商融通信息技术有限公司 | Big data processing and solving system simultaneously supporting offline data and real-time online data |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804566A (en) * | 2018-05-22 | 2018-11-13 | 广东技术师范学院 | A kind of mass small documents read method based on Hadoop |
CN111352897A (en) * | 2020-03-02 | 2020-06-30 | 广东科徕尼智能科技有限公司 | Real-time data storage method, equipment and storage medium |
CN111897772A (en) * | 2020-08-05 | 2020-11-06 | 光大兴陇信托有限责任公司 | Big file data importing method |
CN111897772B (en) * | 2020-08-05 | 2024-02-20 | 光大兴陇信托有限责任公司 | Large file data importing method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11074560B2 (en) | Tracking processed machine data | |
US9922114B2 (en) | Systems and methods for distributing indexer configurations | |
CN111400408A (en) | Data synchronization method, device, equipment and storage medium | |
US10417265B2 (en) | High performance parallel indexing for forensics and electronic discovery | |
DE202012013469U1 (en) | Data Processing Service | |
CN102027457A (en) | Managing storage of individually accessible data units | |
US11429658B1 (en) | Systems and methods for content-aware image storage | |
CN107577809A (en) | Offline small documents processing method and processing device | |
US20240095170A1 (en) | Multi-cache based digital output generation | |
CN112084190A (en) | Big data based acquired data real-time storage and management system and method | |
CN114385760A (en) | Method and device for real-time synchronization of incremental data, computer equipment and storage medium | |
CN112100197B (en) | Quasi-real-time log data analysis and statistics method based on Elasticissearch | |
KR101955376B1 (en) | Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method | |
CN106919566A (en) | A kind of query statistic method and system based on mass data | |
CN106326400A (en) | Multi-dimension data set-based data processing system | |
CN111274316B (en) | Method and device for executing multi-level data stream task, electronic equipment and storage medium | |
CN111125045B (en) | Lightweight ETL processing platform | |
CN113779215A (en) | Data processing platform | |
CN112965939A (en) | File merging method, device and equipment | |
CN111680072A (en) | Social information data-based partitioning system and method | |
US10037155B2 (en) | Preventing write amplification during frequent data updates | |
Singh | NoSQL: A new horizon in big data | |
CN111858480A (en) | Data processing method and device and computer storage medium | |
CN112632020B (en) | Log information type extraction method and mining method based on spark big data platform | |
US20230297878A1 (en) | Metadata-driven feature store for machine learning systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180112 |
|
RJ01 | Rejection of invention patent application after publication |