CN109446165A - The file mergences method and device of big data platform - Google Patents
The file mergences method and device of big data platform Download PDFInfo
- Publication number
- CN109446165A CN109446165A CN201811182327.XA CN201811182327A CN109446165A CN 109446165 A CN109446165 A CN 109446165A CN 201811182327 A CN201811182327 A CN 201811182327A CN 109446165 A CN109446165 A CN 109446165A
- Authority
- CN
- China
- Prior art keywords
- file
- catalogue
- variation
- big data
- small documents
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of file mergences method and devices of big data platform, this method comprises: the catalogue of monitoring big data platform changes, and judge whether the quantity of the file under the catalogue of variation changes;In the case that the quantity of file under the catalogue of the variation changes, the file under the catalogue of the variation with similar features is grouped;Judge in same group of file with the presence or absence of the small documents for being less than setting data block integral multiple size of setting quantity;There are the small documents the in the case where small documents, obtaining same group in same group of file, and merge to same group of the small documents.Small documents can be reduced through the above scheme, optimize the EMS memory occupation of namenode, and big data platform is allow to accommodate more quantity of documents.
Description
Technical field
The present invention relates to field of computer technology more particularly to a kind of file mergences method and devices of big data platform.
Background technique
In big data platform, such as Hadoop cluster, when carrying out data analysis, often there is magnanimity in data directory
The presence of small documents, these small documents causes very big pressure to namenode, and the computational efficiency of cluster is caused to reduce several times
Even decades of times.In the prior art, it needs for each group of data directory or each class target data difference development function component,
To be merged to file.
However, existing file mergences scheme can only be according to the time come allocation schedule plan.This service mode exists very
More drawbacks: first is that exploitation content is more trifling, development cost is high;Second is that operation plan can not be arranged according to real data situation,
When task start may small documents and few, waste PC cluster resource or task execution later has new file write-in again
The count issue of catalogue, small documents is not well solved;Third is that the computing resource of file process can not be according to reality every time
Border situation dynamic is applied.
Summary of the invention
In view of this, the present invention provides a kind of file mergences method and device of big data platform, it is excellent to reduce small documents
The EMS memory occupation for changing namenode, allows big data platform to accommodate more quantity of documents.
To achieve the goals above, the present invention uses following scheme:
In an embodiment of the invention, the file mergences method of big data platform, comprising:
The catalogue variation of big data platform is monitored, and judges whether the quantity of the file under the catalogue of variation changes;
In the case that the quantity of file under the catalogue of the variation changes, to having under the catalogue of the variation
The file of similar features is grouped;
Judge in same group of file with the presence or absence of the small documents for being less than setting data block integral multiple size of setting quantity;
There are the small documents the in the case where small documents, obtaining same group in same group of file, and to same
One group of the small documents merge.
In an embodiment of the invention, the file mergences device of big data platform, comprising:
File monitor unit, the catalogue for monitoring big data platform changes, and judges the file under the catalogue of variation
Whether quantity changes;
File grouping unit, in the case that the quantity for the file under the catalogue of the variation changes, to institute
The file under the catalogue of variation with similar features is stated to be grouped;
Small documents judging unit is less than setting data block with the presence or absence of setting quantity in same group of file for judging
The small documents of integral multiple size;
File mergences unit, for, there are in the case where the small documents, obtaining same group in same group of file
The small documents, and same group of the small documents are merged.
In an embodiment of the invention, computer equipment, including memory, processor and storage are on a memory and can
The computer program run on a processor, the processor realize the step of above-described embodiment the method when executing described program
Suddenly.
In an embodiment of the invention, computer readable storage medium is stored thereon with computer program, the program quilt
The step of processor realizes above-described embodiment the method when executing.
The file mergences method of big data platform of the invention, the file mergences device of big data platform, computer equipment
And computer readable storage medium, by monitor big data platform catalogue change, under the catalogue of variation have similar features
File be grouped, and in same a group of file be less than setting data block integral multiple size small documents merge, can
Small documents are reduced, optimizes the EMS memory occupation of namenode, makes big data platform (such as cluster) that more number of files can be accommodated
Amount, can help the Precise control for realizing big data platform to file size.Mesh based on the big data platform that monitoring obtains
Record variation carries out file mergences, merging can be completed within the shortest time after small documents generation, so as to improve text
The real-time that part merges improves file mergences efficiency.There are conjunction is grouped in the case where small documents in same group of file
And the resource of big data platform can be greatlyd save, keep resource allocation more reasonable.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.In the accompanying drawings:
Fig. 1 is the flow diagram of the file mergences method of the big data platform of one embodiment of the invention;
Fig. 2 is that the catalogue of monitoring big data platform in one embodiment of the invention changes and judges the file under the catalogue of variation
The whether changed method flow schematic diagram of quantity;
Fig. 3 is that the catalogue of monitoring big data platform in another embodiment of the present invention changes and judges the text under the catalogue of variation
The whether changed method flow schematic diagram of the quantity of part;
Fig. 4 is the method stream being grouped in one embodiment of the invention to the file under the catalogue of variation with similar features
Journey schematic diagram;
Fig. 5 is the flow diagram of the file mergences method of the big data platform of another embodiment of the present invention;
Fig. 6 is the method flow schematic diagram merged in one embodiment of the invention to same group of small documents;
Fig. 7 is the flow diagram of the file mergences method of the big data platform of further embodiment of this invention;
Fig. 8 is the method being grouped in another embodiment of the present invention to the file under the catalogue of variation with similar features
Flow diagram;
Fig. 9 is the interaction schematic diagram of the file mergences method of the big data platform of one embodiment of the invention;
Figure 10 is the structural schematic diagram of the file mergences device of the big data platform of one embodiment of the invention.
Specific embodiment
Understand in order to make the object, technical scheme and advantages of the embodiment of the invention clearer, with reference to the accompanying drawing to this hair
Bright embodiment is described in further details.Here, the illustrative embodiments of the present invention and their descriptions are used to explain the present invention, but simultaneously
It is not as a limitation of the invention.
Fig. 1 is the flow diagram of the file mergences method of the big data platform of one embodiment of the invention.As shown in Figure 1,
The file mergences method of the big data platform of some embodiments, it may include:
Step S110: monitor big data platform catalogue variation, and judge variation catalogue under file quantity whether
It changes;
Step S120: in the case that the quantity of the file under the catalogue of the variation changes, to the variation
File under catalogue with similar features is grouped;
Step S130: judge to be less than setting data block integral multiple size with the presence or absence of setting quantity in same group of file
Small documents;
Step S140: there are the small texts the in the case where small documents, obtaining same group in same group of file
Part, and same group of the small documents are merged.
In above-mentioned steps S110, which for example can be Hadoop cluster, can pass through monitoring HDFS prison
Control the catalogue variation of Hadoop cluster.The variation of the trunk catalogue of big data platform can be monitored, or monitors the trunk simultaneously
The variation of the leaf catalogue of catalogue, the catalogue specifically monitored, which can according to need, carries out pre-configuration determination.The catalogue of the variation can
To include the catalogue of modification, for example, having the stabilization time etc. after file modification, file increase, file modification under a certain catalogue, then
The catalogue is the catalogue of modification.The quantity of file under a certain catalogue can be learnt according to listed files.
In above-mentioned steps S120, when the quantity of the file under the catalogue of variation changes, such as quantity of documents increases
Add, is likely to occur redundant file under the catalogue for the variation being somebody's turn to do.Similar features can be according to filename, file content, file mode
Etc. being judged.File with similar features may include that filename, file name suffix, file format etc. have certain general character
File can will be literary for example, file name suffix is identical, filename length is identical, character has specific rule etc. in filename
The identical file of part name suffix is divided into same group, or can file name suffix is identical and filename length be setting length
The file of degree is divided into same group.
In above-mentioned steps S130, data block can be the minimum memory unit of big data platform configuration.The setting quantity
It can according to need and configured with the setting data block integral multiple size, such as determined according to the storage mode of big data platform
The setting data block integral multiple size, in conjunction with the rule, the setting data block integral multiple size and big data platform of file size
Storage mode determine the setting quantity.
In above-mentioned steps S140, the file under the catalogue of variation can have very much, the file under the catalogue of same variation
It is segmented into one or more groups of.Each group of small documents can be returned to, by big data platform with packet combining file data.It can benefit
File mergences is carried out with consolidation procedure that is existing or specially designing.For example, for the small documents of Hadoop cluster, it can be direct
The API (application programming interface) provided using hive/spark/sparksql, disposably reads the data of several small documents
Afterwards, at predetermined regular by rewriting data at file, Lai Shixian file Merge operation.
The main target of conventional file mergences is to reduce quantity of documents and the reduction occupied space of file, and big data
The main target of the file mergences of platform is not to reduce file number or reduce the total occupied space of disk, but refine
The size for controlling each file makes the integral multiple of the size exactly big data platform memory block of each file after merging,
Such as cluster BLOCK size (usually 64M/128M/256M).The BLOCK of cluster is 128M, then the file of a 128M only accounts for
With a BLOCK, and the file of a 129M would take up two.Big data platform is not pursued merely and reduces quantity of documents.
Such as equally be 1G data, if being placed in 1 file, and cluster number of copies is 3, then this part of data exist only in 3 not
With back end on, when calculating can only on these three nodes concurrent, otherwise will consider the net of data pull
Network expense.And if 1G data are placed in the file of 4 256M, perhaps these data are disperse in 10 even more sections
On point, concurrency can be higher when calculating.Certain file excessively will lead to the Load lifting of namenode node memory again,
So the complexity that the small documents of big data platform merge is higher.
Assuming that the storage block size of certain platform configuration is 128M, if some file on platform only has 40M, single file
Much smaller than data block size, but a data block is occupied again, then need to merge.And separately there is a file size to be
150M, single file are greater than data block, but are much smaller than the size of two data blocks again, waste depositing for second data block
Storage, this document are also required to merge.
In the present embodiment, the catalogue by monitoring big data platform changes, the quantity hair of the file under the catalogue of variation
The file under the catalogue of variation with similar features is grouped in the case where changing, and is set to being less than in same a group of file
The small documents for determining data block integral multiple size merge, and can be realized the monitoring to cluster catalogue, and automatically analyze and merge
Small documents reduce small documents, optimize the EMS memory occupation of namenode, accommodate big data platform (such as cluster) can more
Quantity of documents can help the Precise control for realizing big data platform to file size.Moreover, the big number obtained based on monitoring
File mergences is carried out according to the catalogue variation of platform, merging can be completed within the shortest time after small documents generation, thus
It can be improved the real-time of file mergences, improve file mergences efficiency.There are in the case where small documents in same group of file,
Merging is grouped to the file with similar features, rather than is simply divided according to timed task or setting execution interval
File mergences is carried out with operation plan, the resource of big data platform can be greatlyd save with this, keeps resource allocation more reasonable.
Fig. 2 is that the catalogue of monitoring big data platform in one embodiment of the invention changes and judges the file under the catalogue of variation
The whether changed method flow schematic diagram of quantity.As shown in Fig. 2, above-mentioned steps S110, that is, monitor big data platform
Catalogue variation, and judge whether the quantity of the file under the catalogue of variation changes, it may include:
Step S111: catalogue to be monitored is obtained;According to the mesh of the catalogue poll inquiry big data platform to be monitored
Information is recorded, and obtains the current file list of the catalogue of variation;
Step S112: the history file list of the catalogue of the variation is obtained;By compare the current file list and
The history file list, judges whether the quantity of the file under the catalogue of the variation increases, to judge the mesh of the variation
Whether the quantity of the file under record changes.
In above-mentioned steps S111, which, which can according to need, is configured, for example, may include big number
It according to the trunk catalogue of platform, or simultaneously include the trunk catalogue or the leaf catalogue of other trunk catalogues.Pass through poll inquiry
The directory information of big data platform sequentially can be periodically inquired, such as is looked by modes such as namenode/API/HDFS clients
Ask the directory information of Hadoop cluster.The directory information can reflect the current state of catalogue to be monitored, for example, may include
Listed files etc. under the nearest modification time of file directory, catalogue, catalogue.
In above-mentioned steps S112, which can be the list of file names of current directory and its subdirectory.
The history file list can be the list of file names of history catalogue and its subdirectory.The history shape of the catalogue of big data platform
State, including history modification information, history file list etc., can recorde the records center in big data platform, it is possible to root
Corresponding catalogue is found in records center according to the catalogue of the variation, and corresponding history file is then obtained according to the catalogue found
List.For the catalogue of each variation, can be sentenced by comparing corresponding current file list and corresponding history file list
It is disconnected whether to have newly-increased file.In other embodiments, the history that can be registered by comparing current directory state with records center
State judges in catalogue with the presence or absence of new data file.
In the present embodiment, by comparing current file list and history file list, number of files can be easily judged
The variation of amount.By judging whether the quantity of file increases, it can judge that most probable file redundancy occurs or needs file big
The case where small Precise control.
Fig. 3 is that the catalogue of monitoring big data platform in another embodiment of the present invention changes and judges the text under the catalogue of variation
The whether changed method flow schematic diagram of the quantity of part.As shown in figure 3, the catalogue of monitoring big data platform shown in Fig. 2
The whether changed method of quantity for changing and judging the file under the catalogue of variation, may also include that
Step S113: the current modification time and nearest history modification time of the catalogue of the variation are obtained;
Step S114: judge whether the current modification time and the difference of the nearest history modification time are greater than setting
Whether the quantity of duration, the file under catalogue to judge the variation changes.
In above-mentioned steps S113, it can be believed according to the catalogue of the catalogue poll inquiry big data platform to be monitored
Breath, obtains the current modification time of the catalogue of the variation.It can be when obtaining the current file list of catalogue of variation, together
Obtain the current modification time of the catalogue of the variation.It can be obtained after obtaining the current modification time of catalogue of the variation
Take the nearest history modification time of the catalogue of the variation.The nearest history modification time, can refer to the last modification when
Between, it can be previously recorded in records center, and obtain when needed from the records center.
In above-mentioned steps S114, which, which can according to need, is configured, by judging the current modification
Whether the difference of time and the nearest history modification time is greater than setting duration, it can be determined that the last time modification.
In the present embodiment, whether the difference by judging the current modification time and the nearest history modification time is big
In setting duration, it can be determined that the file under variation catalogue is after the last time modification, if a period of time is stabilized, with this
It can avoid executing file mergences in unnecessary situation according to the resource of file modification situation reasonable employment big data platform
It acts and leads to waste of resource.
Fig. 4 is the method stream being grouped in one embodiment of the invention to the file under the catalogue of variation with similar features
Journey schematic diagram.As shown in figure 4, dividing in above-mentioned steps S120 the file under the catalogue of the variation with similar features
Group, it may include:
Step S121: determine whether the file under the catalogue of the variation has similar features according to file designation rule;
Alternatively, the mould for reading the partial data of the file under the catalogue of the variation, and being included according to the partial data of reading
Formula information determines whether the file under the catalogue of the variation has similar features;The file designation rule includes that filename is long
One or more of rule and the consistency of file name suffix of the included character of the rule of degree, filename;
Step S122: the file under the catalogue of the variation with similar features is divided to same group.
In above-mentioned steps S121, the rule of file name length can be certain two file filename length it is identical or
It is setting length.The rule of the included character of file name can be certain two file filename include it is identical letter or
Number.The file name suffix that the consistency of file name suffix can be certain two file is identical.Really according to file designation rule
Whether the file under the catalogue of the fixed variation has similar features, and the file that can merge can be found based on filename.
The partial data of the file under the catalogue of the variation is read, such as has read json, orc this class file and includes
Schema information.When consistent by the schema information of the partial data of analysis reading, it is believed that the mesh of the variation
File under record has similar features, to trigger file mergences movement.With this, it can find and can merge based on file content
File.
In the present embodiment, by being grouped according to filename or file content to file, it can be easily implemented to text
The merging of part.
Fig. 5 is the flow diagram of the file mergences method of the big data platform of another embodiment of the present invention.Such as Fig. 5 institute
Show, the file mergences method of big data platform shown in FIG. 1, before step S130, that is, judge in same group of file whether
Before the small documents less than setting data block integral multiple size of setting quantity, it may also include that
Step S150: determine that setting data block integral multiple is big according to the size of the memory block of big data platform configuration
It is small.
In big data platform, such as cluster, file carries out block storage with the integral multiple of cluster block (BLOCK) size, small
A cluster block is also occupied in the file of a cluster block size, the setting data block integral multiple size is according to the big data
The size of the memory block of platform configuration determines, such as can be determined according to cluster block size, with this, is less than setting number by finding out
According to the small documents of block integral multiple size or greater than monofile block size but not the file of block size integral multiple size carries out
Merge, can will occupy cluster block but underuse the small documents of cluster block and be merged into larger file, text can not only be reduced
Number of packages amount, and the size for each file of control that can be refined.
In the present embodiment, setting data block integer is determined by the size of the memory block configured according to the big data platform
Times size, the size for each file of control that can be refined.
In some embodiments, the file mergences method of big data platform shown in fig. 5 may also include that according to described big
The size of the memory block of data platform configuration splits the small documents after merging.For example, all under specified load xxxx catalogue
Data are rewritten under yyyy catalogue by the data in the consistent orc file of schema with orc format, and every 256M splits a text
Part.Pass through, the small documents after merging are split according to the size of the memory block of big data platform configuration, number of nodes can be reduced
Amount, finely controls the size of file.
Fig. 6 is the method flow schematic diagram merged in one embodiment of the invention to same group of small documents.Such as Fig. 6 institute
Show, above-mentioned steps S140, that is, there are in the case where the small documents, obtain same group described small in same group of file
File, and same group of the small documents are merged, it may include:
Step S141: there are the small texts the in the case where small documents, obtaining same group in same group of file
Part, and according to the quantity of described group of small documents and file size to the big data platform application resource;
Step S142: described group of small documents are merged using the resource transfer file mergences program of application.
In above-mentioned steps S141, the Resources list, the Shen Xiang Jiqun are selected with size according to the quantity of documents in small documents group
It please resource.In above-mentioned steps S142, specified consolidation procedure can be used and execute small documents union operation.If configuration center
It is provided with the processing module of such catalogue or file type, then uses configured module, is otherwise called according to file type
Default module generates consolidation procedure.
In the present embodiment, resource needed for union operation can apply for that automatic selection is most reasonable according to real data situation dynamic
Resource proportion.
Fig. 7 is the flow diagram of the file mergences method of the big data platform of further embodiment of this invention.Such as Fig. 7 institute
Show, the file mergences method of big data platform shown in FIG. 1 may also include that
Step S160: the directory information for the small documents that more new record merges, and record corresponding pooling information;It is described
Pooling information includes the filename before and after file mergences.
In above-mentioned steps S160, after the filename before and after file mergences refers to the filename for merging preceding each file and merging
File filename.It can will treated that directory information is registered in records center again.The pooling information may be used also
To include merging the information such as time.
In the present embodiment, by update directory information can in order to find merge after file.By recording pooling information,
Can in order to find merge file modification history.
Fig. 8 is the method being grouped in another embodiment of the present invention to the file under the catalogue of variation with similar features
Flow diagram.As shown in figure 8, the method that the file under the catalogue shown in Fig. 4 to variation with similar features is grouped,
Before above-mentioned steps S121, that is, it is similar to determine whether the file under the catalogue of the variation has according to file designation rule
Before feature, it may also include that
Step S123: file designation rule is obtained;The file designation rule is that configuration generates, or is closed using history
And information carries out classification or clustering learning generates.
In above-mentioned steps S123, which may include the text after the filename of the file before merging, merging
Filename, merging time, catalogue of part etc. can be obtained after the completion of each merging treatment by recording pooling information.
The file designation rule for merging grouping is custom-configured, several rules can be directly configured, for example, specified XX
It under catalogue and subdirectory, needs to monitor and merge automatically by the file that suffix, filename length are 10 of xxx, this configuration rule
It then needs to support regular expression.
Self study is configured to merge the file designation rule of grouping, can be automatically analyzed by machine learning algorithm similar
Document convention.One kind relying on classification judgement, for example, under certain catalogue, file designation rule close to (for example, suffix is identical, text
Part name length is identical, and certain rule etc. is distributed in the letter, number, symbol in filename), then judge that this group of file can close
And.Another kind relies on history processing analogy judgement, for example, once executing certain class file under A catalogue manually or by configuration
Merge, and detecting B catalogue also includes similar file, then guesses that the file under B catalogue can also merge.Another, file
Content analysis can judge according to the format of file, for example, this kind of file comprising schema information of json, orc, is reading
Partial document data, and after the schema that analyzes file is consistent, it also can trigger file mergences.
To make those skilled in the art be best understood from the present invention, it will illustrate reality of the invention with a specific embodiment below
It applies.For illustrate just, the flat file mergences method of big data is illustrated by taking Hadoop cluster as an example, but and as to the big data
The restriction of platform.
Fig. 9 is the interaction schematic diagram of the file mergences method of the big data platform of one embodiment of the invention.It, should referring to Fig. 9
The file mergences method of the big data platform of embodiment, it may include:
Step 1: primary control program reads the configuration information of monitored directory from configuration center, including but not limited to master to be monitored
Dry catalogue and leaf catalogue, file asterisk wildcard, file block size, quantity of documents threshold value to be combined, catalogue stablize the configuration such as duration.
Step 2: according to config directory, poll inquiry directory information (passes through the modes such as namenode/API/HDFS client
Inquiry), obtain the nearest modification time of catalogue and listed files.
Step 3: comparing the historic state of current directory state and records center registration, judge to whether there is newly in catalogue
Data file.
Step 4: according to the file designation of self study in original configuration and step 8 rule, small documents being grouped, internally
Appearance, format are consistent, and the quantity of documents for being less than block size reaches the small documents group return of configuration threshold value.
Step 5: processing small documents group simultaneously generates amalgamation plan, if configuration center is provided with such catalogue or files classes
The processing module of type then uses configured module, otherwise calls default module to generate consolidation procedure according to file type.
Step 6: the Resources list being selected with size according to the quantity of documents in small documents group, to cluster application resource, and is made
Small documents union operation is executed with the consolidation procedure specified in step 5.
Step 7: by treated, directory information is registered in records center again.
Step 8: records center merges according to history to be recorded, and the data algorithm of calling classification or cluster analyzes file point
Rule-like, and solidify, it is used for step 4.
In the present embodiment, the catalogue situation of change of HDFS is monitored, the quantity of documents under catalogue is changed, and at it
After stabilize X minutes, then judge the file (X, N, M can configure) in catalogue with the presence or absence of N number of less than M size, and according to text
Part name judge this group of small documents whether be identity to data source (can be according to file suffixes, chopping rule and length, asterisk wildcard
Or the methods of the classification analysis of canonical, history processing filename is judged), if small documents reach threshold value, according to file
Quantity and size dynamic application resource merge.It is planned in advance as long as the small documents of cluster merge, without individually opening
Hair and manual intervention reduce maintenance cost, and different merging rules can be respectively specified that according to cold and hot data, match group document
It sets more rationally.The real-time of file mergences is higher, and merging can be completed within the shortest time after small documents generation, greatly save
Cluster resource.Resource needed for union operation can apply according to real data situation dynamic, automatic to select most reasonable resource proportion,
More efficient, resource allocation is more reasonable.
Based on inventive concept identical with the file mergences method of big data platform shown in FIG. 1, the embodiment of the present invention is also
A kind of file mergences device of big data platform is provided, as described in following example.Since the file of the big data platform closes
And the principle that device solves the problems, such as is similar to the file mergences method of big data platform, therefore the file mergences of the big data platform
The implementation of device may refer to the implementation of the file mergences method of big data platform, and overlaps will not be repeated.
Figure 10 is the structural schematic diagram of the file mergences device of the big data platform of one embodiment of the invention.Such as Figure 10 institute
Show, the file mergences device of the big data platform of some embodiments, it may include: file monitor unit 210, file grouping unit
220, small documents judging unit 230 and file mergences unit 240, above-mentioned each unit are linked in sequence.
File monitor unit 210, the catalogue for monitoring big data platform changes, and judges the file under the catalogue of variation
Quantity whether change;
File grouping unit 220, it is right in the case that the quantity for the file under the catalogue of the variation changes
File under the catalogue of the variation with similar features is grouped;
Small documents judging unit 230 is less than setting number with the presence or absence of setting quantity in same group of file for judging
According to the small documents of block integral multiple size;
File mergences unit 240, for, there are in the case where the small documents, obtaining same group in same group of file
The small documents, and same group of the small documents are merged.
In some embodiments, file monitor unit 210, it may include: current file list obtains module and quantity of documents
Judgment module, the two are connected with each other.Current file list obtains module, for obtaining catalogue to be monitored;According to described wait supervise
The directory information of the catalogue poll inquiry big data platform of control, and obtain the current file list of the catalogue of variation;Quantity of documents
Judgment module, the history file list of the catalogue for obtaining the variation;By comparing the current file list and described
History file list, judges whether the quantity of the file under the catalogue of the variation increases, under the catalogue to judge the variation
The quantity of file whether change.
In some embodiments, file monitor unit 210 may also include that modification time obtains module and transformation period is sentenced
Disconnected module, the two are connected with each other.Modification time obtains module, for obtaining the current modification time and most of the catalogue of the variation
Nearly history modification time;Transformation period judgment module, when for judging that the current modification time and the nearest history are modified
Between difference whether be greater than setting duration, whether the quantity of the file under catalogue to judge the variation changes.
In some embodiments, file grouping unit 220, it may include: identity judgment module and file group division module,
The two is connected with each other.Identity judgment module, the file under catalogue for determining the variation according to file designation rule are
It is no that there are similar features;Alternatively, reading the partial data of the file under the catalogue of the variation, and according to the part of reading
The pattern information that data are included determines whether the file under the catalogue of the variation has similar features;The file designation rule
Then including one in the rule of filename length, the rule of the included character of filename and the consistency of file name suffix or
It is multiple;File group division module, for the file under the catalogue of the variation with similar features to be divided to same group.
In some embodiments, the file mergences method of big data platform may also include that setting data block integral multiple size
Determination unit is connect with small documents judging unit 230.Data block integral multiple size determination unit is set, for according to described big
The size of the memory block of data platform configuration determines setting data block integral multiple size.
In some embodiments, file mergences unit 240, it may include: resource bid module and file combination module, the two
It is connected with each other.Resource bid module, for, there are in the case where the small documents, obtaining same group in same group of file
The small documents, and according to the quantity of described group of small documents and file size to the big data platform application resource;File
Merging module, for being merged using the resource transfer file mergences program of application to described group of small documents.
In some embodiments, the file mergences method of big data platform, may also include that merging recording unit, with file
Combining unit 240 connects.Merge recording unit, for the directory information for the small documents that more new record merges, and records phase
The pooling information answered;The pooling information includes the filename before and after file mergences.
In some embodiments, file grouping unit 220 may also include that rule learning module, with identity judgment module
Connection.Rule learning module, for obtaining file designation rule;The file designation rule is that configuration generates, or utilizes
History pooling information carries out classification or clustering learning generates.
The embodiment of the present invention also provides a kind of computer equipment, including memory, processor and storage are on a memory simultaneously
The computer program that can be run on a processor, the processor realize above-described embodiment the method when executing described program
Step.
The embodiment of the present invention also provides a kind of computer readable storage medium, is stored thereon with computer program, the program
The step of above-described embodiment the method is realized when being executed by processor.
In conclusion the file mergences method of the big data platform of the embodiment of the present invention, the file mergences of big data platform
Device, computer equipment and computer readable storage medium, the catalogue by monitoring big data platform changes, in the catalogue of variation
Under file quantity change in the case where be grouped under the catalogue of variation with the files of similar features, and to same
The small documents for being less than setting data block integral multiple size in a group of file merge, and can be realized the monitoring to cluster catalogue,
And automatically analyze and merge small documents, small documents are reduced, the EMS memory occupation of namenode is optimized, collect big data platform (such as
Group) more quantity of documents can be accommodated, the Precise control for realizing big data platform to file size can be helped.Moreover,
Catalogue variation progress file mergences based on the big data platform that monitoring obtains, can be within the shortest time after small documents generation
Merging can be completed, so as to improve the real-time of file mergences, improve file mergences efficiency.It is deposited in same group of file
In the case where small documents, merging is grouped to the file with similar features, rather than according to the time come allocation schedule meter
It draws and carries out file mergences, the resource of big data platform can be greatlyd save with this, keeps resource allocation more reasonable.
In the description of this specification, reference term " one embodiment ", " specific embodiment ", " some implementations
Example ", " such as ", the description of " example ", " specific example " or " some examples " etc. mean it is described in conjunction with this embodiment or example
Particular features, structures, materials, or characteristics are included at least one embodiment or example of the invention.In the present specification,
Schematic expression of the above terms may not refer to the same embodiment or example.Moreover, the specific features of description, knot
Structure, material or feature can be combined in any suitable manner in any one or more of the embodiments or examples.Each embodiment
Involved in the step of sequence be used to schematically illustrate implementation of the invention, sequence of steps therein is not construed as limiting, can be as needed
It appropriately adjusts.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
Particular embodiments described above has carried out further in detail the purpose of the present invention, technical scheme and beneficial effects
Describe in detail it is bright, it should be understood that the above is only a specific embodiment of the present invention, the guarantor being not intended to limit the present invention
Range is protected, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should be included in this
Within the protection scope of invention.
Claims (10)
1. a kind of file mergences method of big data platform characterized by comprising
The catalogue variation of big data platform is monitored, and judges whether the quantity of the file under the catalogue of variation changes;
It is similar to having under the catalogue of the variation in the case that quantity of file under the catalogue of the variation changes
The file of feature is grouped;
Judge in same group of file with the presence or absence of the small documents for being less than setting data block integral multiple size of setting quantity;
There are the small documents the in the case where small documents, obtaining same group in same group of file, and to same group
The small documents merge.
2. the file mergences method of big data platform as described in claim 1, which is characterized in that monitor the mesh of big data platform
Record variation, and judge whether the quantity of the file under the catalogue of variation changes, comprising:
Obtain catalogue to be monitored;According to the directory information of the catalogue poll inquiry big data platform to be monitored, and obtain
The current file list of the catalogue of variation;
Obtain the history file list of the catalogue of the variation;By comparing the current file list and history file column
Table, judges whether the quantity of the file under the catalogue of the variation increases, the number of the file under catalogue to judge the variation
Whether amount changes.
3. the file mergences method of big data platform as claimed in claim 2, which is characterized in that monitor the mesh of big data platform
Record variation, and judge whether the quantity of the file under the catalogue of variation changes, further includes:
Obtain the current modification time and nearest history modification time of the catalogue of the variation;
Judge whether the current modification time and the difference of the nearest history modification time are greater than setting duration, to judge
Whether the quantity for stating the file under the catalogue of variation changes.
4. the file mergences method of big data platform as described in claim 1, which is characterized in that under the catalogue of the variation
File with similar features is grouped, comprising:
Determine whether the file under the catalogue of the variation has similar features according to file designation rule;Alternatively, described in reading
The partial data of file under the catalogue of variation, and according to the pattern information determination that the partial data of reading is included
Whether the file under the catalogue of variation has similar features;The file designation rule includes the rule of filename length, file
One or more of rule and the consistency of file name suffix of the included character of name;
File under the catalogue of the variation with similar features is divided to same group.
5. the file mergences method of big data platform as described in claim 1, which is characterized in that judge in same group of file
With the presence or absence of being less than before the small documents of setting data block integral multiple size for setting quantity, further includes:
Setting data block integral multiple size is determined according to the size of the memory block of big data platform configuration.
6. the file mergences method of big data platform as described in claim 1, which is characterized in that deposited in same group of file
In the case where the small documents, same group of the small documents are obtained, and merge to same group of the small documents, wrapped
It includes:
There are the small documents the in the case where small documents, obtaining same group in same group of file, and according to described
The quantity and file size of the small documents of group are to the big data platform application resource;
Described group of small documents are merged using the resource transfer file mergences program of application.
7. the file mergences method of big data platform as claimed in claim 4, which is characterized in that
The method also includes:
The directory information for the small documents that more new record merges, and record corresponding pooling information;The pooling information includes
Filename before and after file mergences;
Before determining whether the file under the catalogue of the variation has similar features according to file designation rule, to the variation
Catalogue under be grouped with the files of similar features, further includes:
Obtain file designation rule;The file designation rule is that configuration generates, or is divided using history pooling information
Class or clustering learning generate.
8. a kind of file mergences device of big data platform characterized by comprising
File monitor unit, the catalogue for monitoring big data platform changes, and judges the quantity of the file under the catalogue of variation
Whether change;
File grouping unit, in the case that the quantity for the file under the catalogue of the variation changes, to the change
File under the catalogue of change with similar features is grouped;
Small documents judging unit is less than setting data block integer with the presence or absence of setting quantity in same group of file for judging
The small documents of times size;
File mergences unit, for there are in the case where the small documents, obtain same group described in same group of file
Small documents, and same group of the small documents are merged.
9. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor
Calculation machine program, which is characterized in that the step of processor realizes claim 1 to 7 the method when executing described program.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
The step of claim 1 to 7 the method is realized when execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811182327.XA CN109446165B (en) | 2018-10-11 | 2018-10-11 | File merging method and device for big data platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811182327.XA CN109446165B (en) | 2018-10-11 | 2018-10-11 | File merging method and device for big data platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109446165A true CN109446165A (en) | 2019-03-08 |
CN109446165B CN109446165B (en) | 2021-05-07 |
Family
ID=65545321
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811182327.XA Active CN109446165B (en) | 2018-10-11 | 2018-10-11 | File merging method and device for big data platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109446165B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110321329A (en) * | 2019-06-18 | 2019-10-11 | 中盈优创资讯科技有限公司 | Data processing method and device based on big data |
CN111159120A (en) * | 2019-12-16 | 2020-05-15 | 西门子电力自动化有限公司 | Method, device and system for processing files in power system |
CN111352897A (en) * | 2020-03-02 | 2020-06-30 | 广东科徕尼智能科技有限公司 | Real-time data storage method, equipment and storage medium |
CN111881092A (en) * | 2020-06-22 | 2020-11-03 | 武汉绿色网络信息服务有限责任公司 | Method and device for merging files based on cassandra database |
CN112948330A (en) * | 2021-02-26 | 2021-06-11 | 拉卡拉支付股份有限公司 | Data merging method, device, electronic equipment, storage medium and program product |
CN113011798A (en) * | 2021-05-24 | 2021-06-22 | 江苏荣泽信息科技股份有限公司 | Product detection information processing system based on block chain |
CN115843008A (en) * | 2023-02-15 | 2023-03-24 | 慧铁科技有限公司 | Complex data processing method for railway train record carrier |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103679002A (en) * | 2013-12-12 | 2014-03-26 | 小米科技有限责任公司 | Method and device for monitoring file change and server |
US20170208124A1 (en) * | 2011-03-08 | 2017-07-20 | Rackspace Us, Inc. | Higher efficiency storage replication using compression |
-
2018
- 2018-10-11 CN CN201811182327.XA patent/CN109446165B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170208124A1 (en) * | 2011-03-08 | 2017-07-20 | Rackspace Us, Inc. | Higher efficiency storage replication using compression |
CN103679002A (en) * | 2013-12-12 | 2014-03-26 | 小米科技有限责任公司 | Method and device for monitoring file change and server |
Non-Patent Citations (1)
Title |
---|
张呈: "Hadoop集群下海量小文件优化处理", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110321329A (en) * | 2019-06-18 | 2019-10-11 | 中盈优创资讯科技有限公司 | Data processing method and device based on big data |
CN111159120A (en) * | 2019-12-16 | 2020-05-15 | 西门子电力自动化有限公司 | Method, device and system for processing files in power system |
CN111352897A (en) * | 2020-03-02 | 2020-06-30 | 广东科徕尼智能科技有限公司 | Real-time data storage method, equipment and storage medium |
CN111881092A (en) * | 2020-06-22 | 2020-11-03 | 武汉绿色网络信息服务有限责任公司 | Method and device for merging files based on cassandra database |
CN112948330A (en) * | 2021-02-26 | 2021-06-11 | 拉卡拉支付股份有限公司 | Data merging method, device, electronic equipment, storage medium and program product |
CN113011798A (en) * | 2021-05-24 | 2021-06-22 | 江苏荣泽信息科技股份有限公司 | Product detection information processing system based on block chain |
CN115843008A (en) * | 2023-02-15 | 2023-03-24 | 慧铁科技有限公司 | Complex data processing method for railway train record carrier |
Also Published As
Publication number | Publication date |
---|---|
CN109446165B (en) | 2021-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109446165A (en) | The file mergences method and device of big data platform | |
Xie et al. | Real-time prediction of docker container resource load based on a hybrid model of ARIMA and triple exponential smoothing | |
Candan et al. | Frontiers in information and software as services | |
Zhao et al. | Cloud data management | |
CN104978228B (en) | A kind of dispatching method and device of distributed computing system | |
US20140032517A1 (en) | System and methods to configure a profile to rank search results | |
Benson et al. | Survey of automated software deployment for computational and engineering research | |
CN104050042A (en) | Resource allocation method and resource allocation device for ETL (Extraction-Transformation-Loading) jobs | |
US10135703B1 (en) | Generating creation performance metrics for a secondary index of a table | |
US10102230B1 (en) | Rate-limiting secondary index creation for an online table | |
NL2018627B1 (en) | Cloud platform configurator | |
Yin et al. | Opass: Analysis and optimization of parallel data access on distributed file systems | |
Ruíz et al. | Autoscaling pods on an on-premise Kubernetes infrastructure QoS-aware | |
Liu et al. | OnlineElastMan: self-trained proactive elasticity manager for cloud-based storage services | |
US9898614B1 (en) | Implicit prioritization to rate-limit secondary index creation for an online table | |
Anjos et al. | BIGhybrid: a simulator for MapReduce applications in hybrid distributed infrastructures validated with the Grid5000 experimental platform | |
Pingle et al. | Big data processing using apache hadoop in cloud system | |
Seybold | An automation-based approach for reproducible evaluations of distributed DBMS on elastic infrastructures | |
MAALA et al. | Cluster trace analysis for performance enhancement in cloud computing environments | |
Kyryk et al. | Infrastructure as Code and Microservices for Intent-Based Cloud Networking | |
Bottoni et al. | FedUp! cloud federation as a service | |
Gogouvitis et al. | Vision cloud: A cloud storage solution supporting modern media production | |
Rechert et al. | An architecture for community-based curation and presentation of complex digital objects | |
Angelou et al. | Automatic scaling of selective sparql joins using the tiramola system | |
US11704338B1 (en) | Replication of share across deployments in database system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP02 | Change in the address of a patent holder | ||
CP02 | Change in the address of a patent holder |
Address after: Room 702-2, No. 4811, Cao'an Highway, Jiading District, Shanghai Patentee after: CHINA UNITECHS Address before: 100872 5th floor, Renmin culture building, 59 Zhongguancun Street, Haidian District, Beijing Patentee before: CHINA UNITECHS |