CN109446165A - The file mergences method and device of big data platform - Google Patents

The file mergences method and device of big data platform Download PDF

Info

Publication number
CN109446165A
CN109446165A CN201811182327.XA CN201811182327A CN109446165A CN 109446165 A CN109446165 A CN 109446165A CN 201811182327 A CN201811182327 A CN 201811182327A CN 109446165 A CN109446165 A CN 109446165A
Authority
CN
China
Prior art keywords
file
catalogue
variation
big data
small documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811182327.XA
Other languages
Chinese (zh)
Other versions
CN109446165B (en
Inventor
毛恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unihub China Information Technology Co Ltd
Zhongying Youchuang Information Technology Co Ltd
Original Assignee
Unihub China Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unihub China Information Technology Co Ltd filed Critical Unihub China Information Technology Co Ltd
Priority to CN201811182327.XA priority Critical patent/CN109446165B/en
Publication of CN109446165A publication Critical patent/CN109446165A/en
Application granted granted Critical
Publication of CN109446165B publication Critical patent/CN109446165B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of file mergences method and devices of big data platform, this method comprises: the catalogue of monitoring big data platform changes, and judge whether the quantity of the file under the catalogue of variation changes;In the case that the quantity of file under the catalogue of the variation changes, the file under the catalogue of the variation with similar features is grouped;Judge in same group of file with the presence or absence of the small documents for being less than setting data block integral multiple size of setting quantity;There are the small documents the in the case where small documents, obtaining same group in same group of file, and merge to same group of the small documents.Small documents can be reduced through the above scheme, optimize the EMS memory occupation of namenode, and big data platform is allow to accommodate more quantity of documents.

Description

The file mergences method and device of big data platform
Technical field
The present invention relates to field of computer technology more particularly to a kind of file mergences method and devices of big data platform.
Background technique
In big data platform, such as Hadoop cluster, when carrying out data analysis, often there is magnanimity in data directory The presence of small documents, these small documents causes very big pressure to namenode, and the computational efficiency of cluster is caused to reduce several times Even decades of times.In the prior art, it needs for each group of data directory or each class target data difference development function component, To be merged to file.
However, existing file mergences scheme can only be according to the time come allocation schedule plan.This service mode exists very More drawbacks: first is that exploitation content is more trifling, development cost is high;Second is that operation plan can not be arranged according to real data situation, When task start may small documents and few, waste PC cluster resource or task execution later has new file write-in again The count issue of catalogue, small documents is not well solved;Third is that the computing resource of file process can not be according to reality every time Border situation dynamic is applied.
Summary of the invention
In view of this, the present invention provides a kind of file mergences method and device of big data platform, it is excellent to reduce small documents The EMS memory occupation for changing namenode, allows big data platform to accommodate more quantity of documents.
To achieve the goals above, the present invention uses following scheme:
In an embodiment of the invention, the file mergences method of big data platform, comprising:
The catalogue variation of big data platform is monitored, and judges whether the quantity of the file under the catalogue of variation changes;
In the case that the quantity of file under the catalogue of the variation changes, to having under the catalogue of the variation The file of similar features is grouped;
Judge in same group of file with the presence or absence of the small documents for being less than setting data block integral multiple size of setting quantity;
There are the small documents the in the case where small documents, obtaining same group in same group of file, and to same One group of the small documents merge.
In an embodiment of the invention, the file mergences device of big data platform, comprising:
File monitor unit, the catalogue for monitoring big data platform changes, and judges the file under the catalogue of variation Whether quantity changes;
File grouping unit, in the case that the quantity for the file under the catalogue of the variation changes, to institute The file under the catalogue of variation with similar features is stated to be grouped;
Small documents judging unit is less than setting data block with the presence or absence of setting quantity in same group of file for judging The small documents of integral multiple size;
File mergences unit, for, there are in the case where the small documents, obtaining same group in same group of file The small documents, and same group of the small documents are merged.
In an embodiment of the invention, computer equipment, including memory, processor and storage are on a memory and can The computer program run on a processor, the processor realize the step of above-described embodiment the method when executing described program Suddenly.
In an embodiment of the invention, computer readable storage medium is stored thereon with computer program, the program quilt The step of processor realizes above-described embodiment the method when executing.
The file mergences method of big data platform of the invention, the file mergences device of big data platform, computer equipment And computer readable storage medium, by monitor big data platform catalogue change, under the catalogue of variation have similar features File be grouped, and in same a group of file be less than setting data block integral multiple size small documents merge, can Small documents are reduced, optimizes the EMS memory occupation of namenode, makes big data platform (such as cluster) that more number of files can be accommodated Amount, can help the Precise control for realizing big data platform to file size.Mesh based on the big data platform that monitoring obtains Record variation carries out file mergences, merging can be completed within the shortest time after small documents generation, so as to improve text The real-time that part merges improves file mergences efficiency.There are conjunction is grouped in the case where small documents in same group of file And the resource of big data platform can be greatlyd save, keep resource allocation more reasonable.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.In the accompanying drawings:
Fig. 1 is the flow diagram of the file mergences method of the big data platform of one embodiment of the invention;
Fig. 2 is that the catalogue of monitoring big data platform in one embodiment of the invention changes and judges the file under the catalogue of variation The whether changed method flow schematic diagram of quantity;
Fig. 3 is that the catalogue of monitoring big data platform in another embodiment of the present invention changes and judges the text under the catalogue of variation The whether changed method flow schematic diagram of the quantity of part;
Fig. 4 is the method stream being grouped in one embodiment of the invention to the file under the catalogue of variation with similar features Journey schematic diagram;
Fig. 5 is the flow diagram of the file mergences method of the big data platform of another embodiment of the present invention;
Fig. 6 is the method flow schematic diagram merged in one embodiment of the invention to same group of small documents;
Fig. 7 is the flow diagram of the file mergences method of the big data platform of further embodiment of this invention;
Fig. 8 is the method being grouped in another embodiment of the present invention to the file under the catalogue of variation with similar features Flow diagram;
Fig. 9 is the interaction schematic diagram of the file mergences method of the big data platform of one embodiment of the invention;
Figure 10 is the structural schematic diagram of the file mergences device of the big data platform of one embodiment of the invention.
Specific embodiment
Understand in order to make the object, technical scheme and advantages of the embodiment of the invention clearer, with reference to the accompanying drawing to this hair Bright embodiment is described in further details.Here, the illustrative embodiments of the present invention and their descriptions are used to explain the present invention, but simultaneously It is not as a limitation of the invention.
Fig. 1 is the flow diagram of the file mergences method of the big data platform of one embodiment of the invention.As shown in Figure 1, The file mergences method of the big data platform of some embodiments, it may include:
Step S110: monitor big data platform catalogue variation, and judge variation catalogue under file quantity whether It changes;
Step S120: in the case that the quantity of the file under the catalogue of the variation changes, to the variation File under catalogue with similar features is grouped;
Step S130: judge to be less than setting data block integral multiple size with the presence or absence of setting quantity in same group of file Small documents;
Step S140: there are the small texts the in the case where small documents, obtaining same group in same group of file Part, and same group of the small documents are merged.
In above-mentioned steps S110, which for example can be Hadoop cluster, can pass through monitoring HDFS prison Control the catalogue variation of Hadoop cluster.The variation of the trunk catalogue of big data platform can be monitored, or monitors the trunk simultaneously The variation of the leaf catalogue of catalogue, the catalogue specifically monitored, which can according to need, carries out pre-configuration determination.The catalogue of the variation can To include the catalogue of modification, for example, having the stabilization time etc. after file modification, file increase, file modification under a certain catalogue, then The catalogue is the catalogue of modification.The quantity of file under a certain catalogue can be learnt according to listed files.
In above-mentioned steps S120, when the quantity of the file under the catalogue of variation changes, such as quantity of documents increases Add, is likely to occur redundant file under the catalogue for the variation being somebody's turn to do.Similar features can be according to filename, file content, file mode Etc. being judged.File with similar features may include that filename, file name suffix, file format etc. have certain general character File can will be literary for example, file name suffix is identical, filename length is identical, character has specific rule etc. in filename The identical file of part name suffix is divided into same group, or can file name suffix is identical and filename length be setting length The file of degree is divided into same group.
In above-mentioned steps S130, data block can be the minimum memory unit of big data platform configuration.The setting quantity It can according to need and configured with the setting data block integral multiple size, such as determined according to the storage mode of big data platform The setting data block integral multiple size, in conjunction with the rule, the setting data block integral multiple size and big data platform of file size Storage mode determine the setting quantity.
In above-mentioned steps S140, the file under the catalogue of variation can have very much, the file under the catalogue of same variation It is segmented into one or more groups of.Each group of small documents can be returned to, by big data platform with packet combining file data.It can benefit File mergences is carried out with consolidation procedure that is existing or specially designing.For example, for the small documents of Hadoop cluster, it can be direct The API (application programming interface) provided using hive/spark/sparksql, disposably reads the data of several small documents Afterwards, at predetermined regular by rewriting data at file, Lai Shixian file Merge operation.
The main target of conventional file mergences is to reduce quantity of documents and the reduction occupied space of file, and big data The main target of the file mergences of platform is not to reduce file number or reduce the total occupied space of disk, but refine The size for controlling each file makes the integral multiple of the size exactly big data platform memory block of each file after merging, Such as cluster BLOCK size (usually 64M/128M/256M).The BLOCK of cluster is 128M, then the file of a 128M only accounts for With a BLOCK, and the file of a 129M would take up two.Big data platform is not pursued merely and reduces quantity of documents. Such as equally be 1G data, if being placed in 1 file, and cluster number of copies is 3, then this part of data exist only in 3 not With back end on, when calculating can only on these three nodes concurrent, otherwise will consider the net of data pull Network expense.And if 1G data are placed in the file of 4 256M, perhaps these data are disperse in 10 even more sections On point, concurrency can be higher when calculating.Certain file excessively will lead to the Load lifting of namenode node memory again, So the complexity that the small documents of big data platform merge is higher.
Assuming that the storage block size of certain platform configuration is 128M, if some file on platform only has 40M, single file Much smaller than data block size, but a data block is occupied again, then need to merge.And separately there is a file size to be 150M, single file are greater than data block, but are much smaller than the size of two data blocks again, waste depositing for second data block Storage, this document are also required to merge.
In the present embodiment, the catalogue by monitoring big data platform changes, the quantity hair of the file under the catalogue of variation The file under the catalogue of variation with similar features is grouped in the case where changing, and is set to being less than in same a group of file The small documents for determining data block integral multiple size merge, and can be realized the monitoring to cluster catalogue, and automatically analyze and merge Small documents reduce small documents, optimize the EMS memory occupation of namenode, accommodate big data platform (such as cluster) can more Quantity of documents can help the Precise control for realizing big data platform to file size.Moreover, the big number obtained based on monitoring File mergences is carried out according to the catalogue variation of platform, merging can be completed within the shortest time after small documents generation, thus It can be improved the real-time of file mergences, improve file mergences efficiency.There are in the case where small documents in same group of file, Merging is grouped to the file with similar features, rather than is simply divided according to timed task or setting execution interval File mergences is carried out with operation plan, the resource of big data platform can be greatlyd save with this, keeps resource allocation more reasonable.
Fig. 2 is that the catalogue of monitoring big data platform in one embodiment of the invention changes and judges the file under the catalogue of variation The whether changed method flow schematic diagram of quantity.As shown in Fig. 2, above-mentioned steps S110, that is, monitor big data platform Catalogue variation, and judge whether the quantity of the file under the catalogue of variation changes, it may include:
Step S111: catalogue to be monitored is obtained;According to the mesh of the catalogue poll inquiry big data platform to be monitored Information is recorded, and obtains the current file list of the catalogue of variation;
Step S112: the history file list of the catalogue of the variation is obtained;By compare the current file list and The history file list, judges whether the quantity of the file under the catalogue of the variation increases, to judge the mesh of the variation Whether the quantity of the file under record changes.
In above-mentioned steps S111, which, which can according to need, is configured, for example, may include big number It according to the trunk catalogue of platform, or simultaneously include the trunk catalogue or the leaf catalogue of other trunk catalogues.Pass through poll inquiry The directory information of big data platform sequentially can be periodically inquired, such as is looked by modes such as namenode/API/HDFS clients Ask the directory information of Hadoop cluster.The directory information can reflect the current state of catalogue to be monitored, for example, may include Listed files etc. under the nearest modification time of file directory, catalogue, catalogue.
In above-mentioned steps S112, which can be the list of file names of current directory and its subdirectory. The history file list can be the list of file names of history catalogue and its subdirectory.The history shape of the catalogue of big data platform State, including history modification information, history file list etc., can recorde the records center in big data platform, it is possible to root Corresponding catalogue is found in records center according to the catalogue of the variation, and corresponding history file is then obtained according to the catalogue found List.For the catalogue of each variation, can be sentenced by comparing corresponding current file list and corresponding history file list It is disconnected whether to have newly-increased file.In other embodiments, the history that can be registered by comparing current directory state with records center State judges in catalogue with the presence or absence of new data file.
In the present embodiment, by comparing current file list and history file list, number of files can be easily judged The variation of amount.By judging whether the quantity of file increases, it can judge that most probable file redundancy occurs or needs file big The case where small Precise control.
Fig. 3 is that the catalogue of monitoring big data platform in another embodiment of the present invention changes and judges the text under the catalogue of variation The whether changed method flow schematic diagram of the quantity of part.As shown in figure 3, the catalogue of monitoring big data platform shown in Fig. 2 The whether changed method of quantity for changing and judging the file under the catalogue of variation, may also include that
Step S113: the current modification time and nearest history modification time of the catalogue of the variation are obtained;
Step S114: judge whether the current modification time and the difference of the nearest history modification time are greater than setting Whether the quantity of duration, the file under catalogue to judge the variation changes.
In above-mentioned steps S113, it can be believed according to the catalogue of the catalogue poll inquiry big data platform to be monitored Breath, obtains the current modification time of the catalogue of the variation.It can be when obtaining the current file list of catalogue of variation, together Obtain the current modification time of the catalogue of the variation.It can be obtained after obtaining the current modification time of catalogue of the variation Take the nearest history modification time of the catalogue of the variation.The nearest history modification time, can refer to the last modification when Between, it can be previously recorded in records center, and obtain when needed from the records center.
In above-mentioned steps S114, which, which can according to need, is configured, by judging the current modification Whether the difference of time and the nearest history modification time is greater than setting duration, it can be determined that the last time modification.
In the present embodiment, whether the difference by judging the current modification time and the nearest history modification time is big In setting duration, it can be determined that the file under variation catalogue is after the last time modification, if a period of time is stabilized, with this It can avoid executing file mergences in unnecessary situation according to the resource of file modification situation reasonable employment big data platform It acts and leads to waste of resource.
Fig. 4 is the method stream being grouped in one embodiment of the invention to the file under the catalogue of variation with similar features Journey schematic diagram.As shown in figure 4, dividing in above-mentioned steps S120 the file under the catalogue of the variation with similar features Group, it may include:
Step S121: determine whether the file under the catalogue of the variation has similar features according to file designation rule; Alternatively, the mould for reading the partial data of the file under the catalogue of the variation, and being included according to the partial data of reading Formula information determines whether the file under the catalogue of the variation has similar features;The file designation rule includes that filename is long One or more of rule and the consistency of file name suffix of the included character of the rule of degree, filename;
Step S122: the file under the catalogue of the variation with similar features is divided to same group.
In above-mentioned steps S121, the rule of file name length can be certain two file filename length it is identical or It is setting length.The rule of the included character of file name can be certain two file filename include it is identical letter or Number.The file name suffix that the consistency of file name suffix can be certain two file is identical.Really according to file designation rule Whether the file under the catalogue of the fixed variation has similar features, and the file that can merge can be found based on filename.
The partial data of the file under the catalogue of the variation is read, such as has read json, orc this class file and includes Schema information.When consistent by the schema information of the partial data of analysis reading, it is believed that the mesh of the variation File under record has similar features, to trigger file mergences movement.With this, it can find and can merge based on file content File.
In the present embodiment, by being grouped according to filename or file content to file, it can be easily implemented to text The merging of part.
Fig. 5 is the flow diagram of the file mergences method of the big data platform of another embodiment of the present invention.Such as Fig. 5 institute Show, the file mergences method of big data platform shown in FIG. 1, before step S130, that is, judge in same group of file whether Before the small documents less than setting data block integral multiple size of setting quantity, it may also include that
Step S150: determine that setting data block integral multiple is big according to the size of the memory block of big data platform configuration It is small.
In big data platform, such as cluster, file carries out block storage with the integral multiple of cluster block (BLOCK) size, small A cluster block is also occupied in the file of a cluster block size, the setting data block integral multiple size is according to the big data The size of the memory block of platform configuration determines, such as can be determined according to cluster block size, with this, is less than setting number by finding out According to the small documents of block integral multiple size or greater than monofile block size but not the file of block size integral multiple size carries out Merge, can will occupy cluster block but underuse the small documents of cluster block and be merged into larger file, text can not only be reduced Number of packages amount, and the size for each file of control that can be refined.
In the present embodiment, setting data block integer is determined by the size of the memory block configured according to the big data platform Times size, the size for each file of control that can be refined.
In some embodiments, the file mergences method of big data platform shown in fig. 5 may also include that according to described big The size of the memory block of data platform configuration splits the small documents after merging.For example, all under specified load xxxx catalogue Data are rewritten under yyyy catalogue by the data in the consistent orc file of schema with orc format, and every 256M splits a text Part.Pass through, the small documents after merging are split according to the size of the memory block of big data platform configuration, number of nodes can be reduced Amount, finely controls the size of file.
Fig. 6 is the method flow schematic diagram merged in one embodiment of the invention to same group of small documents.Such as Fig. 6 institute Show, above-mentioned steps S140, that is, there are in the case where the small documents, obtain same group described small in same group of file File, and same group of the small documents are merged, it may include:
Step S141: there are the small texts the in the case where small documents, obtaining same group in same group of file Part, and according to the quantity of described group of small documents and file size to the big data platform application resource;
Step S142: described group of small documents are merged using the resource transfer file mergences program of application.
In above-mentioned steps S141, the Resources list, the Shen Xiang Jiqun are selected with size according to the quantity of documents in small documents group It please resource.In above-mentioned steps S142, specified consolidation procedure can be used and execute small documents union operation.If configuration center It is provided with the processing module of such catalogue or file type, then uses configured module, is otherwise called according to file type Default module generates consolidation procedure.
In the present embodiment, resource needed for union operation can apply for that automatic selection is most reasonable according to real data situation dynamic Resource proportion.
Fig. 7 is the flow diagram of the file mergences method of the big data platform of further embodiment of this invention.Such as Fig. 7 institute Show, the file mergences method of big data platform shown in FIG. 1 may also include that
Step S160: the directory information for the small documents that more new record merges, and record corresponding pooling information;It is described Pooling information includes the filename before and after file mergences.
In above-mentioned steps S160, after the filename before and after file mergences refers to the filename for merging preceding each file and merging File filename.It can will treated that directory information is registered in records center again.The pooling information may be used also To include merging the information such as time.
In the present embodiment, by update directory information can in order to find merge after file.By recording pooling information, Can in order to find merge file modification history.
Fig. 8 is the method being grouped in another embodiment of the present invention to the file under the catalogue of variation with similar features Flow diagram.As shown in figure 8, the method that the file under the catalogue shown in Fig. 4 to variation with similar features is grouped, Before above-mentioned steps S121, that is, it is similar to determine whether the file under the catalogue of the variation has according to file designation rule Before feature, it may also include that
Step S123: file designation rule is obtained;The file designation rule is that configuration generates, or is closed using history And information carries out classification or clustering learning generates.
In above-mentioned steps S123, which may include the text after the filename of the file before merging, merging Filename, merging time, catalogue of part etc. can be obtained after the completion of each merging treatment by recording pooling information.
The file designation rule for merging grouping is custom-configured, several rules can be directly configured, for example, specified XX It under catalogue and subdirectory, needs to monitor and merge automatically by the file that suffix, filename length are 10 of xxx, this configuration rule It then needs to support regular expression.
Self study is configured to merge the file designation rule of grouping, can be automatically analyzed by machine learning algorithm similar Document convention.One kind relying on classification judgement, for example, under certain catalogue, file designation rule close to (for example, suffix is identical, text Part name length is identical, and certain rule etc. is distributed in the letter, number, symbol in filename), then judge that this group of file can close And.Another kind relies on history processing analogy judgement, for example, once executing certain class file under A catalogue manually or by configuration Merge, and detecting B catalogue also includes similar file, then guesses that the file under B catalogue can also merge.Another, file Content analysis can judge according to the format of file, for example, this kind of file comprising schema information of json, orc, is reading Partial document data, and after the schema that analyzes file is consistent, it also can trigger file mergences.
To make those skilled in the art be best understood from the present invention, it will illustrate reality of the invention with a specific embodiment below It applies.For illustrate just, the flat file mergences method of big data is illustrated by taking Hadoop cluster as an example, but and as to the big data The restriction of platform.
Fig. 9 is the interaction schematic diagram of the file mergences method of the big data platform of one embodiment of the invention.It, should referring to Fig. 9 The file mergences method of the big data platform of embodiment, it may include:
Step 1: primary control program reads the configuration information of monitored directory from configuration center, including but not limited to master to be monitored Dry catalogue and leaf catalogue, file asterisk wildcard, file block size, quantity of documents threshold value to be combined, catalogue stablize the configuration such as duration.
Step 2: according to config directory, poll inquiry directory information (passes through the modes such as namenode/API/HDFS client Inquiry), obtain the nearest modification time of catalogue and listed files.
Step 3: comparing the historic state of current directory state and records center registration, judge to whether there is newly in catalogue Data file.
Step 4: according to the file designation of self study in original configuration and step 8 rule, small documents being grouped, internally Appearance, format are consistent, and the quantity of documents for being less than block size reaches the small documents group return of configuration threshold value.
Step 5: processing small documents group simultaneously generates amalgamation plan, if configuration center is provided with such catalogue or files classes The processing module of type then uses configured module, otherwise calls default module to generate consolidation procedure according to file type.
Step 6: the Resources list being selected with size according to the quantity of documents in small documents group, to cluster application resource, and is made Small documents union operation is executed with the consolidation procedure specified in step 5.
Step 7: by treated, directory information is registered in records center again.
Step 8: records center merges according to history to be recorded, and the data algorithm of calling classification or cluster analyzes file point Rule-like, and solidify, it is used for step 4.
In the present embodiment, the catalogue situation of change of HDFS is monitored, the quantity of documents under catalogue is changed, and at it After stabilize X minutes, then judge the file (X, N, M can configure) in catalogue with the presence or absence of N number of less than M size, and according to text Part name judge this group of small documents whether be identity to data source (can be according to file suffixes, chopping rule and length, asterisk wildcard Or the methods of the classification analysis of canonical, history processing filename is judged), if small documents reach threshold value, according to file Quantity and size dynamic application resource merge.It is planned in advance as long as the small documents of cluster merge, without individually opening Hair and manual intervention reduce maintenance cost, and different merging rules can be respectively specified that according to cold and hot data, match group document It sets more rationally.The real-time of file mergences is higher, and merging can be completed within the shortest time after small documents generation, greatly save Cluster resource.Resource needed for union operation can apply according to real data situation dynamic, automatic to select most reasonable resource proportion, More efficient, resource allocation is more reasonable.
Based on inventive concept identical with the file mergences method of big data platform shown in FIG. 1, the embodiment of the present invention is also A kind of file mergences device of big data platform is provided, as described in following example.Since the file of the big data platform closes And the principle that device solves the problems, such as is similar to the file mergences method of big data platform, therefore the file mergences of the big data platform The implementation of device may refer to the implementation of the file mergences method of big data platform, and overlaps will not be repeated.
Figure 10 is the structural schematic diagram of the file mergences device of the big data platform of one embodiment of the invention.Such as Figure 10 institute Show, the file mergences device of the big data platform of some embodiments, it may include: file monitor unit 210, file grouping unit 220, small documents judging unit 230 and file mergences unit 240, above-mentioned each unit are linked in sequence.
File monitor unit 210, the catalogue for monitoring big data platform changes, and judges the file under the catalogue of variation Quantity whether change;
File grouping unit 220, it is right in the case that the quantity for the file under the catalogue of the variation changes File under the catalogue of the variation with similar features is grouped;
Small documents judging unit 230 is less than setting number with the presence or absence of setting quantity in same group of file for judging According to the small documents of block integral multiple size;
File mergences unit 240, for, there are in the case where the small documents, obtaining same group in same group of file The small documents, and same group of the small documents are merged.
In some embodiments, file monitor unit 210, it may include: current file list obtains module and quantity of documents Judgment module, the two are connected with each other.Current file list obtains module, for obtaining catalogue to be monitored;According to described wait supervise The directory information of the catalogue poll inquiry big data platform of control, and obtain the current file list of the catalogue of variation;Quantity of documents Judgment module, the history file list of the catalogue for obtaining the variation;By comparing the current file list and described History file list, judges whether the quantity of the file under the catalogue of the variation increases, under the catalogue to judge the variation The quantity of file whether change.
In some embodiments, file monitor unit 210 may also include that modification time obtains module and transformation period is sentenced Disconnected module, the two are connected with each other.Modification time obtains module, for obtaining the current modification time and most of the catalogue of the variation Nearly history modification time;Transformation period judgment module, when for judging that the current modification time and the nearest history are modified Between difference whether be greater than setting duration, whether the quantity of the file under catalogue to judge the variation changes.
In some embodiments, file grouping unit 220, it may include: identity judgment module and file group division module, The two is connected with each other.Identity judgment module, the file under catalogue for determining the variation according to file designation rule are It is no that there are similar features;Alternatively, reading the partial data of the file under the catalogue of the variation, and according to the part of reading The pattern information that data are included determines whether the file under the catalogue of the variation has similar features;The file designation rule Then including one in the rule of filename length, the rule of the included character of filename and the consistency of file name suffix or It is multiple;File group division module, for the file under the catalogue of the variation with similar features to be divided to same group.
In some embodiments, the file mergences method of big data platform may also include that setting data block integral multiple size Determination unit is connect with small documents judging unit 230.Data block integral multiple size determination unit is set, for according to described big The size of the memory block of data platform configuration determines setting data block integral multiple size.
In some embodiments, file mergences unit 240, it may include: resource bid module and file combination module, the two It is connected with each other.Resource bid module, for, there are in the case where the small documents, obtaining same group in same group of file The small documents, and according to the quantity of described group of small documents and file size to the big data platform application resource;File Merging module, for being merged using the resource transfer file mergences program of application to described group of small documents.
In some embodiments, the file mergences method of big data platform, may also include that merging recording unit, with file Combining unit 240 connects.Merge recording unit, for the directory information for the small documents that more new record merges, and records phase The pooling information answered;The pooling information includes the filename before and after file mergences.
In some embodiments, file grouping unit 220 may also include that rule learning module, with identity judgment module Connection.Rule learning module, for obtaining file designation rule;The file designation rule is that configuration generates, or utilizes History pooling information carries out classification or clustering learning generates.
The embodiment of the present invention also provides a kind of computer equipment, including memory, processor and storage are on a memory simultaneously The computer program that can be run on a processor, the processor realize above-described embodiment the method when executing described program Step.
The embodiment of the present invention also provides a kind of computer readable storage medium, is stored thereon with computer program, the program The step of above-described embodiment the method is realized when being executed by processor.
In conclusion the file mergences method of the big data platform of the embodiment of the present invention, the file mergences of big data platform Device, computer equipment and computer readable storage medium, the catalogue by monitoring big data platform changes, in the catalogue of variation Under file quantity change in the case where be grouped under the catalogue of variation with the files of similar features, and to same The small documents for being less than setting data block integral multiple size in a group of file merge, and can be realized the monitoring to cluster catalogue, And automatically analyze and merge small documents, small documents are reduced, the EMS memory occupation of namenode is optimized, collect big data platform (such as Group) more quantity of documents can be accommodated, the Precise control for realizing big data platform to file size can be helped.Moreover, Catalogue variation progress file mergences based on the big data platform that monitoring obtains, can be within the shortest time after small documents generation Merging can be completed, so as to improve the real-time of file mergences, improve file mergences efficiency.It is deposited in same group of file In the case where small documents, merging is grouped to the file with similar features, rather than according to the time come allocation schedule meter It draws and carries out file mergences, the resource of big data platform can be greatlyd save with this, keeps resource allocation more reasonable.
In the description of this specification, reference term " one embodiment ", " specific embodiment ", " some implementations Example ", " such as ", the description of " example ", " specific example " or " some examples " etc. mean it is described in conjunction with this embodiment or example Particular features, structures, materials, or characteristics are included at least one embodiment or example of the invention.In the present specification, Schematic expression of the above terms may not refer to the same embodiment or example.Moreover, the specific features of description, knot Structure, material or feature can be combined in any suitable manner in any one or more of the embodiments or examples.Each embodiment Involved in the step of sequence be used to schematically illustrate implementation of the invention, sequence of steps therein is not construed as limiting, can be as needed It appropriately adjusts.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
Particular embodiments described above has carried out further in detail the purpose of the present invention, technical scheme and beneficial effects Describe in detail it is bright, it should be understood that the above is only a specific embodiment of the present invention, the guarantor being not intended to limit the present invention Range is protected, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should be included in this Within the protection scope of invention.

Claims (10)

1. a kind of file mergences method of big data platform characterized by comprising
The catalogue variation of big data platform is monitored, and judges whether the quantity of the file under the catalogue of variation changes;
It is similar to having under the catalogue of the variation in the case that quantity of file under the catalogue of the variation changes The file of feature is grouped;
Judge in same group of file with the presence or absence of the small documents for being less than setting data block integral multiple size of setting quantity;
There are the small documents the in the case where small documents, obtaining same group in same group of file, and to same group The small documents merge.
2. the file mergences method of big data platform as described in claim 1, which is characterized in that monitor the mesh of big data platform Record variation, and judge whether the quantity of the file under the catalogue of variation changes, comprising:
Obtain catalogue to be monitored;According to the directory information of the catalogue poll inquiry big data platform to be monitored, and obtain The current file list of the catalogue of variation;
Obtain the history file list of the catalogue of the variation;By comparing the current file list and history file column Table, judges whether the quantity of the file under the catalogue of the variation increases, the number of the file under catalogue to judge the variation Whether amount changes.
3. the file mergences method of big data platform as claimed in claim 2, which is characterized in that monitor the mesh of big data platform Record variation, and judge whether the quantity of the file under the catalogue of variation changes, further includes:
Obtain the current modification time and nearest history modification time of the catalogue of the variation;
Judge whether the current modification time and the difference of the nearest history modification time are greater than setting duration, to judge Whether the quantity for stating the file under the catalogue of variation changes.
4. the file mergences method of big data platform as described in claim 1, which is characterized in that under the catalogue of the variation File with similar features is grouped, comprising:
Determine whether the file under the catalogue of the variation has similar features according to file designation rule;Alternatively, described in reading The partial data of file under the catalogue of variation, and according to the pattern information determination that the partial data of reading is included Whether the file under the catalogue of variation has similar features;The file designation rule includes the rule of filename length, file One or more of rule and the consistency of file name suffix of the included character of name;
File under the catalogue of the variation with similar features is divided to same group.
5. the file mergences method of big data platform as described in claim 1, which is characterized in that judge in same group of file With the presence or absence of being less than before the small documents of setting data block integral multiple size for setting quantity, further includes:
Setting data block integral multiple size is determined according to the size of the memory block of big data platform configuration.
6. the file mergences method of big data platform as described in claim 1, which is characterized in that deposited in same group of file In the case where the small documents, same group of the small documents are obtained, and merge to same group of the small documents, wrapped It includes:
There are the small documents the in the case where small documents, obtaining same group in same group of file, and according to described The quantity and file size of the small documents of group are to the big data platform application resource;
Described group of small documents are merged using the resource transfer file mergences program of application.
7. the file mergences method of big data platform as claimed in claim 4, which is characterized in that
The method also includes:
The directory information for the small documents that more new record merges, and record corresponding pooling information;The pooling information includes Filename before and after file mergences;
Before determining whether the file under the catalogue of the variation has similar features according to file designation rule, to the variation Catalogue under be grouped with the files of similar features, further includes:
Obtain file designation rule;The file designation rule is that configuration generates, or is divided using history pooling information Class or clustering learning generate.
8. a kind of file mergences device of big data platform characterized by comprising
File monitor unit, the catalogue for monitoring big data platform changes, and judges the quantity of the file under the catalogue of variation Whether change;
File grouping unit, in the case that the quantity for the file under the catalogue of the variation changes, to the change File under the catalogue of change with similar features is grouped;
Small documents judging unit is less than setting data block integer with the presence or absence of setting quantity in same group of file for judging The small documents of times size;
File mergences unit, for there are in the case where the small documents, obtain same group described in same group of file Small documents, and same group of the small documents are merged.
9. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor Calculation machine program, which is characterized in that the step of processor realizes claim 1 to 7 the method when executing described program.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The step of claim 1 to 7 the method is realized when execution.
CN201811182327.XA 2018-10-11 2018-10-11 File merging method and device for big data platform Active CN109446165B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811182327.XA CN109446165B (en) 2018-10-11 2018-10-11 File merging method and device for big data platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811182327.XA CN109446165B (en) 2018-10-11 2018-10-11 File merging method and device for big data platform

Publications (2)

Publication Number Publication Date
CN109446165A true CN109446165A (en) 2019-03-08
CN109446165B CN109446165B (en) 2021-05-07

Family

ID=65545321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811182327.XA Active CN109446165B (en) 2018-10-11 2018-10-11 File merging method and device for big data platform

Country Status (1)

Country Link
CN (1) CN109446165B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110321329A (en) * 2019-06-18 2019-10-11 中盈优创资讯科技有限公司 Data processing method and device based on big data
CN111159120A (en) * 2019-12-16 2020-05-15 西门子电力自动化有限公司 Method, device and system for processing files in power system
CN111352897A (en) * 2020-03-02 2020-06-30 广东科徕尼智能科技有限公司 Real-time data storage method, equipment and storage medium
CN111881092A (en) * 2020-06-22 2020-11-03 武汉绿色网络信息服务有限责任公司 Method and device for merging files based on cassandra database
CN112948330A (en) * 2021-02-26 2021-06-11 拉卡拉支付股份有限公司 Data merging method, device, electronic equipment, storage medium and program product
CN113011798A (en) * 2021-05-24 2021-06-22 江苏荣泽信息科技股份有限公司 Product detection information processing system based on block chain
CN115843008A (en) * 2023-02-15 2023-03-24 慧铁科技有限公司 Complex data processing method for railway train record carrier

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679002A (en) * 2013-12-12 2014-03-26 小米科技有限责任公司 Method and device for monitoring file change and server
US20170208124A1 (en) * 2011-03-08 2017-07-20 Rackspace Us, Inc. Higher efficiency storage replication using compression

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170208124A1 (en) * 2011-03-08 2017-07-20 Rackspace Us, Inc. Higher efficiency storage replication using compression
CN103679002A (en) * 2013-12-12 2014-03-26 小米科技有限责任公司 Method and device for monitoring file change and server

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张呈: "Hadoop集群下海量小文件优化处理", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110321329A (en) * 2019-06-18 2019-10-11 中盈优创资讯科技有限公司 Data processing method and device based on big data
CN111159120A (en) * 2019-12-16 2020-05-15 西门子电力自动化有限公司 Method, device and system for processing files in power system
CN111352897A (en) * 2020-03-02 2020-06-30 广东科徕尼智能科技有限公司 Real-time data storage method, equipment and storage medium
CN111881092A (en) * 2020-06-22 2020-11-03 武汉绿色网络信息服务有限责任公司 Method and device for merging files based on cassandra database
CN112948330A (en) * 2021-02-26 2021-06-11 拉卡拉支付股份有限公司 Data merging method, device, electronic equipment, storage medium and program product
CN113011798A (en) * 2021-05-24 2021-06-22 江苏荣泽信息科技股份有限公司 Product detection information processing system based on block chain
CN115843008A (en) * 2023-02-15 2023-03-24 慧铁科技有限公司 Complex data processing method for railway train record carrier

Also Published As

Publication number Publication date
CN109446165B (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN109446165A (en) The file mergences method and device of big data platform
Xie et al. Real-time prediction of docker container resource load based on a hybrid model of ARIMA and triple exponential smoothing
Candan et al. Frontiers in information and software as services
Zhao et al. Cloud data management
CN104978228B (en) A kind of dispatching method and device of distributed computing system
US20140032517A1 (en) System and methods to configure a profile to rank search results
Benson et al. Survey of automated software deployment for computational and engineering research
CN104050042A (en) Resource allocation method and resource allocation device for ETL (Extraction-Transformation-Loading) jobs
US10135703B1 (en) Generating creation performance metrics for a secondary index of a table
US10102230B1 (en) Rate-limiting secondary index creation for an online table
NL2018627B1 (en) Cloud platform configurator
Yin et al. Opass: Analysis and optimization of parallel data access on distributed file systems
Ruíz et al. Autoscaling pods on an on-premise Kubernetes infrastructure QoS-aware
Liu et al. OnlineElastMan: self-trained proactive elasticity manager for cloud-based storage services
US9898614B1 (en) Implicit prioritization to rate-limit secondary index creation for an online table
Anjos et al. BIGhybrid: a simulator for MapReduce applications in hybrid distributed infrastructures validated with the Grid5000 experimental platform
Pingle et al. Big data processing using apache hadoop in cloud system
Seybold An automation-based approach for reproducible evaluations of distributed DBMS on elastic infrastructures
MAALA et al. Cluster trace analysis for performance enhancement in cloud computing environments
Kyryk et al. Infrastructure as Code and Microservices for Intent-Based Cloud Networking
Bottoni et al. FedUp! cloud federation as a service
Gogouvitis et al. Vision cloud: A cloud storage solution supporting modern media production
Rechert et al. An architecture for community-based curation and presentation of complex digital objects
Angelou et al. Automatic scaling of selective sparql joins using the tiramola system
US11704338B1 (en) Replication of share across deployments in database system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: Room 702-2, No. 4811, Cao'an Highway, Jiading District, Shanghai

Patentee after: CHINA UNITECHS

Address before: 100872 5th floor, Renmin culture building, 59 Zhongguancun Street, Haidian District, Beijing

Patentee before: CHINA UNITECHS