CN112434000A - Small file merging method, device and equipment based on HDFS - Google Patents

Small file merging method, device and equipment based on HDFS Download PDF

Info

Publication number
CN112434000A
CN112434000A CN202011310846.7A CN202011310846A CN112434000A CN 112434000 A CN112434000 A CN 112434000A CN 202011310846 A CN202011310846 A CN 202011310846A CN 112434000 A CN112434000 A CN 112434000A
Authority
CN
China
Prior art keywords
small
file
files
merging
storage medium
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011310846.7A
Other languages
Chinese (zh)
Other versions
CN112434000B (en
Inventor
李勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202011310846.7A priority Critical patent/CN112434000B/en
Publication of CN112434000A publication Critical patent/CN112434000A/en
Application granted granted Critical
Publication of CN112434000B publication Critical patent/CN112434000B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method, a device and equipment for merging small files based on an HDFS (Hadoop distributed File System), wherein the method comprises the following steps: merging file directories of a plurality of small files with the storage capacity within a preset threshold range in the temporary storage medium according to the same level to obtain a plurality of attached directories; inquiring the storage capacity of small files of each of the attached directories at preset time intervals, respectively collecting the small files under the attached directories of which the sum of the storage capacities is larger than a first threshold value as small file collections, screening and merging each small file collection, and transferring merged block files obtained after the processing to an HDFS (Hadoop distributed file system); and if the memory occupancy rate of the temporary storage medium exceeds a second threshold value, performing hierarchical merging processing and forced storage processing on each small file collection and the remaining small files after screening and merging processing until the memory occupancy rate of the temporary storage medium is smaller than the second threshold value. The invention effectively reduces the number of blocks occupied by small files in the HDFS and saves the storage space.

Description

Small file merging method, device and equipment based on HDFS
Technical Field
The invention relates to the technical field of distributed file systems, in particular to a method, a device and equipment for merging small files based on an HDFS (Hadoop distributed File System).
Background
A Distributed File System (HDFS) is widely applied to the field of large-scale computing by virtue of the characteristics of high reliability, high efficiency and scalability; the distributed file system comprises a NameNode and a plurality of DataNodes, and is an important component of a cluster structure. With the increasing of cluster data scale, the NameNode resident memory also increases with the increasing of data volume, so the size of the NameNode heap memory needs to be adjusted continuously to adapt to the increasing memory space. However, the NameNode heap space cannot be increased without a limit. The total memory of the NameNode of 2 hundred million Blocks of the cluster accounts for 113G; each small file occupies 1 block; if the small files are combined, the number of blocks can be effectively reduced.
The currently common small file merging schemes include the following two types:
(1) the method has the advantages that a large amount of system resources occupied by a large amount of combining and decomposing operations of the Hbase are caused along with increase of files, the performance of a system is seriously influenced, the Hbase only supports simple character types, and users need to independently process other types of support such as other pictures and the like if the support is not good.
(2) The most important disadvantage of this method is that the key values are unordered files, which have low random reading efficiency, and need to traverse the whole File for reading, and this method does not support File addition operation, so the small files before merging need to be cached in the server, and thus the security of the files cannot be guaranteed.
Disclosure of Invention
In view of this, the present invention provides a method, an apparatus, and a device for merging small files based on an HDFS, so as to solve the problem that a memory space is easily full due to 1 block occupied by a single small file in the prior art.
Based on the above purpose, the present invention provides a method for merging small files based on HDFS, which comprises the following steps:
merging file directories of a plurality of small files with the storage capacity within a preset threshold range in the temporary storage medium according to the same level to obtain a plurality of attached directories;
inquiring the storage capacity of small files of each of the attached directories at preset time intervals, respectively collecting the small files under the attached directories of which the sum of the storage capacities is larger than a first threshold value as small file collections, screening and merging each small file collection, and transferring merged block files obtained after the processing to an HDFS (Hadoop distributed file system);
if the memory occupancy rate of the temporary storage medium exceeds a second threshold value, performing hierarchical merging processing on the remaining small files after screening and merging processing is performed on each small file collection;
and if the memory occupancy rate of the temporary storage medium exceeds a second threshold value after the layering merging processing, performing forced storage processing on all other small files in each small file collection according to the longest unprocessed principle until the memory occupancy rate of the temporary storage medium is less than the second threshold value.
In some embodiments, the screening merge process comprises:
adding the storage capacities of the small files in the small file aggregate in sequence and comparing the storage capacities with a first threshold value;
if the sum of the storage capacity of every continuous n small files and the storage capacity of the (n + 1) th small file is larger than a first threshold value, the n small files are combined into a combined block file, and the operation is repeated until no (n + 1) th small file or the sum of the (n + 1) th small file and the storage capacity of the small file behind the n +1 th small file is smaller than the first threshold value;
and if the sum of the storage capacities of the (n + 1) th small file and the small files after the small file is smaller than the first threshold, taking the (n + 1) th small file and the small file set after the small file with the sum of the storage capacities smaller than the first threshold as the residual small file.
In some embodiments, the hierarchical merging process includes:
dividing the remaining small files with the storage capacity larger than a third threshold into m capacity levels in sequence from large to small;
sequentially judging whether the memory occupancy rate of the temporary storage medium is smaller than a second threshold value when the small files in the first k capacity levels are transferred to the HDFS, if so, combining the small files under each same attached directory in the first k capacity levels and transferring the small files to the HDFS;
and if the memory occupancy rate of the temporary storage medium is not less than the second threshold when k is m, merging and storing the small files under each same attached directory in the m capacity levels to the HDFS.
In some embodiments, the forcibly storing all other small files in each small file collection in the temporary storage medium according to the longest unprocessed rule until the memory occupancy rate of the temporary storage medium is less than the second threshold includes:
and in each time interval, sequentially and forcibly storing all the small files with the storage capacity smaller than a third threshold in each small file aggregate to the HDFS from long to short according to the time of staying in the temporary storage medium until the memory occupancy rate of the temporary storage medium is smaller than the second threshold.
In some embodiments, the first threshold is a capacity value of a block of the HDFS.
In some embodiments, the first threshold is 128M.
In some embodiments, the method further comprises: and updating the metadata information of the small files which are transferred to the HDFS from the temporary storage medium.
In some embodiments, the temporary storage medium includes a master node and a slave node that synchronize data with each other.
In another aspect of the present invention, a small file merging apparatus based on HDFS is further provided, including:
the file directory merging module is configured to merge file directories of a plurality of small files with storage capacity within a preset threshold range in the temporary storage medium in a same level to obtain a plurality of attached directories;
the screening and merging processing module is configured to query the storage capacity of the small files of each of the attached directories at preset time intervals, respectively collect the small files under the attached directories of which the sum of the storage capacities is greater than a first threshold value as small file collections, screen and merge each small file collection, and forward the merged block files obtained after the screening and merging processing to the HDFS;
the layering merging processing module is configured to perform layering merging processing on the remaining small files after screening and merging processing is performed on each small file collection if the memory occupancy rate of the temporary storage medium exceeds a second threshold; and
and the forced storage processing module is configured to perform forced storage processing on all other small files in each small file collection according to the longest unprocessed principle until the memory occupancy rate of the temporary storage medium is smaller than the second threshold value if the memory occupancy rate of the temporary storage medium exceeds the second threshold value after hierarchical merging processing.
In yet another aspect of the present invention, a computer device is provided, which includes a memory and a processor, the memory storing a computer program, the computer program executing any one of the above methods when executed by the processor.
The invention has at least the following beneficial technical effects:
according to the invention, by merging the file directories of the small files, screening and merging processing, layering and merging processing and forced storage processing are carried out on the small files by taking the attached directory as a unit, so that the problem that the memory space is easy to fill due to 1 block occupied by a single small file is solved; the occupied number of blocks by small files in the HDFS is effectively reduced, the storage space is saved, and hardware resources are saved; by distributing the small files in the temporary storage medium and the storage medium of the HDFS, the speed of the HDFS for responding the small files can be increased, the storage efficiency can be improved, and the service with a large number of small files can be more stable and efficient.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
FIG. 1 is a schematic diagram of an embodiment of a method for merging small files based on an HDFS provided in the present invention;
FIG. 2 is a schematic diagram of an embodiment of a small file merging apparatus based on an HDFS provided in the present invention;
fig. 3 is a schematic hardware configuration diagram of an embodiment of a computer device for executing the HDFS-based small file merging method according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two non-identical entities with the same name or different parameters, and it is understood that "first" and "second" are only used for convenience of expression and should not be construed as limiting the embodiments of the present invention. Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements does not include all of the other steps or elements inherent in the list.
In view of the above object, a first aspect of the embodiments of the present invention provides an embodiment of a small file merging method based on an HDFS. Fig. 1 is a schematic diagram illustrating an embodiment of a small file merging method based on an HDFS provided in the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:
step S10, merging the file directories of a plurality of small files with the storage capacity within the range of the preset threshold value in the temporary storage medium in a peer-to-peer manner to obtain a plurality of attached directories;
step S20, inquiring the storage capacity of the small files of each of the attached directories at preset time intervals, respectively collecting the small files under the attached directories of which the sum of the storage capacities is larger than a first threshold value as small file collections, screening and merging each small file collection, and transferring the merged block files obtained after the screening and merging processing to an HDFS;
step S30, if the memory occupancy rate of the temporary storage medium exceeds a second threshold value, the residual small files after screening and merging processing are subjected to layering and merging processing on each small file collection;
and step S40, if the memory occupancy rate of the temporary storage medium exceeds the second threshold after the layering and merging processing, performing forced storage processing on all other small files in each small file collection according to the longest unprocessed principle until the memory occupancy rate of the temporary storage medium is less than the second threshold.
In this embodiment, the maximum value of the preset threshold range is smaller than the first threshold. In this embodiment, a nonvolatile memory may be used as a temporary storage medium for processing the small files, and the nonvolatile memory has the characteristics of large capacity, high memory durability, high-speed storage, and high cost performance, and provides support for rapidly realizing real-time merging of the small files.
In this embodiment, the small files are merged in the same level by the routing service unit, and the routing service unit directs the initial small file to be stored in the temporary storage medium. Taking the example of storing 6 small files of 10M, the file directories are merged as follows:
the data/smallfile/a. file// storage capacity is 10M
File// storage capacity is 10M
File// storage capacity of 10M
File// storage capacity of 10M
The storage capacity of the file// data/smallfile/subdir/sc is 10M
File// storage capacity is 10M
The a.file and the b.file can merge into one file, the sa.file, the sb.file and the sc.file can merge into one file, and the directory/data/bigfile does not have the same level of file directory, so the bc.file cannot be processed.
The merged file directory structure is as follows:
file// storage capacity of 0
File// storage capacity of 0
The storage capacity of the storage device is 20M
File// storage capacity of 0
File// storage capacity of 0
The storage capacity of the file// data/smallfile/subdir/sc is 0
The storage capacity of the storage device is 30M
File// storage capacity is 10M
In some embodiments, the screening merge process comprises: adding the storage capacities of the small files in the small file aggregate in sequence and comparing the storage capacities with a first threshold value; if the sum of the storage capacity of every continuous n small files and the storage capacity of the (n + 1) th small file is larger than a first threshold value, the n small files are combined into a combined block file, and the operation is repeated until no (n + 1) th small file or the sum of the (n + 1) th small file and the storage capacity of the small file behind the n +1 th small file is smaller than the first threshold value; and if the sum of the storage capacities of the (n + 1) th small file and the small files after the small file is smaller than the first threshold, taking the (n + 1) th small file and the small file set after the small file with the sum of the storage capacities smaller than the first threshold as the residual small file. For example, assuming that the first threshold is 128M, if there are 3 small files of 60M each under the/a file directory, the sum of all file capacities under the/a file directory exceeds 128M, and at this time, the first 2 files under the/a file directory are selected to be merged; if there are 6 small files under the/a file directory, which are all 60M, and the sum of their storage capacities exceeds 128M, then 2 files smaller than 128M are merged and formed.
In some embodiments, the hierarchical merging process includes: dividing the remaining small files with the storage capacity larger than a third threshold into m capacity levels in sequence from large to small; sequentially judging whether the memory occupancy rate of the temporary storage medium is smaller than a second threshold value when the small files in the first k capacity levels are transferred to the HDFS, if so, combining the small files under each same attached directory in the first k capacity levels and transferring the small files to the HDFS; and if the memory occupancy rate of the temporary storage medium is not less than the second threshold when k is m, merging and storing the small files under each same attached directory in the m capacity levels to the HDFS. In a preferred embodiment, the third threshold is 8M; in this embodiment, assuming that the first threshold is 128M, M is 4, and k is 2, the remaining small files may be divided into 4 capacity levels, which are: 128M-64M, 64M-32M, 32M-16M, 16M-8M, when judging that the memory occupancy rate of the temporary storage medium is less than a second threshold value when the small files in the previous 2 capacity levels are transferred to the HDFS, combining the small files under each same attached directory in the 128M-64M and 64M-32M capacity levels and transferring the small files to the HDFS.
In some embodiments, the forcibly storing all other small files in each small file collection in the temporary storage medium according to the longest unprocessed rule until the memory occupancy rate of the temporary storage medium is less than the second threshold includes: and in each time interval, sequentially and forcibly storing all the small files with the storage capacity smaller than a third threshold in each small file aggregate to the HDFS from long to short according to the time of staying in the temporary storage medium until the memory occupancy rate of the temporary storage medium is smaller than the second threshold.
In some embodiments, the first threshold is a capacity value of a block of the HDFS. In a preferred embodiment, the first threshold is 128M. In this embodiment, the block is the minimum storage unit of the HDFS, and the block size of the HDFS is generally 128M.
In some embodiments, the method further comprises: and updating the metadata information of the small files which are transferred to the HDFS from the temporary storage medium. In this embodiment, the metadata information includes a file number, a storage location, a storage type, a size, an offset, and the like, and therefore, the metadata information of a small file that is transferred from the temporary storage medium to the HDFS needs to be updated.
In some embodiments, the temporary storage medium includes a master node and a slave node that synchronize data with each other. In this embodiment, the temporary storage medium is generally deployed in a mode of one master and multiple slaves, and the initial states of the master and slave roles are the same, and different master node functions or slave node functions are assumed by being selected as different roles; the master node provides a data reading and writing function, the slave node provides a data backup function, disaster recovery is supported, and data can be guaranteed to provide service continuously.
In a second aspect of the embodiments of the present invention, a device for merging small files based on an HDFS is also provided. Fig. 2 is a schematic diagram illustrating an embodiment of a small file merging apparatus based on an HDFS according to the present invention. An HDFS-based small file merging device comprises: the file directory merging module 10 is configured to merge file directories of a plurality of small files with storage capacities within a preset threshold range in the temporary storage medium in the same level to obtain a plurality of attached directories; the screening and merging processing module 20 is configured to query the storage capacities of the respective small files of the multiple attached directories at preset time intervals, respectively collect the small files under the attached directories of which the sum of the storage capacities is greater than a first threshold as small file collections, perform screening and merging processing on each small file collection, and forward the merged block files obtained after the processing to the HDFS; the hierarchical merging processing module 30 is configured to perform hierarchical merging processing on the remaining small files after the screening and merging processing is performed on each small file aggregation if the memory occupancy rate of the temporary storage medium exceeds a second threshold; and a forced storage processing module 40 configured to, if the memory occupancy rate of the temporary storage medium exceeds the second threshold after the hierarchical merging processing, perform forced storage processing on all other small files in each small file aggregation according to the longest unprocessed rule until the memory occupancy rate of the temporary storage medium is less than the second threshold.
The small file merging device based on the HDFS of the embodiment performs screening and merging processing, layering and merging processing and forced storage processing on small files by taking attached directories as units through merging file directories of the small files, effectively reduces the number of blocks occupied by the small files in the HDFS, can effectively reduce storage space, and saves hardware resources; by distributing the small files in the temporary storage medium and the storage medium of the HDFS, the speed of the HDFS for responding the small files can be increased, the storage efficiency can be improved, and the service with a large number of small files can be more stable and efficient.
In a third aspect of the embodiments of the present invention, there is also provided a computer device, including a memory 302 and a processor 301, where the memory stores therein a computer program, and the computer program, when executed by the processor, implements any one of the above-mentioned method embodiments.
Fig. 3 is a schematic diagram of a hardware structure of an embodiment of a computer device for executing the HDFS-based small file merging method according to the present invention. Taking the computer device shown in fig. 3 as an example, the computer device includes a processor 301 and a memory 302, and may further include: an input device 303 and an output device 304. The processor 301, the memory 302, the input device 303 and the output device 304 may be connected by a bus or other means, and fig. 3 illustrates the connection by a bus as an example. The input device 303 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the HDFS-based small file merging device. The output means 304 may comprise a display device such as a display screen. The processor 301 executes various functional applications and data processing of the server by running the nonvolatile software programs, instructions and modules stored in the memory 302, that is, implements the HDFS-based small file merging method of the above-described method embodiment.
Finally, it should be noted that, as one of ordinary skill in the art can appreciate, all or part of the processes of the methods of the above embodiments may be implemented by a computer program to instruct related hardware, and the program of the HDFS-based doclet merging method may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods as described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (10)

1. A small file merging method based on HDFS is characterized by comprising the following steps:
merging file directories of a plurality of small files with the storage capacity within a preset threshold range in the temporary storage medium according to the same level to obtain a plurality of attached directories;
inquiring the storage capacity of the small files of each attached directory at preset time intervals, respectively collecting the small files under the attached directories of which the sum of the storage capacities is larger than a first threshold value as a small file collection, screening and merging each small file collection, and transferring the merged block files obtained after the processing to an HDFS (Hadoop distributed file system);
if the memory occupancy rate of the temporary storage medium exceeds a second threshold value, performing hierarchical merging processing on the remaining small files after screening and merging processing is performed on each small file collection;
and if the memory occupancy rate of the temporary storage medium exceeds the second threshold value after the layering merging processing, performing forced storage processing on all other small files in each small file collection according to the longest unprocessed principle until the memory occupancy rate of the temporary storage medium is less than the second threshold value.
2. The method of claim 1, wherein the screening merge process comprises:
adding the storage capacities of the small files in the small file aggregate in sequence and comparing the storage capacities with the first threshold value;
if the sum of the storage capacity of every continuous n small files and the storage capacity of the (n + 1) th small file is larger than the first threshold, merging the n small files into the merged block file, and repeating until no (n + 1) th small file exists or the sum of the storage capacity of the (n + 1) th small file and the storage capacity of the small file behind the n +1 th small file is smaller than the first threshold;
and if the sum of the storage capacities of the (n + 1) th small file and the small files after the small file is smaller than the first threshold, taking the (n + 1) th small file and the small file set after the small file with the sum of the storage capacities smaller than the first threshold as the residual small file.
3. The method of claim 1, wherein the hierarchical merging process comprises:
dividing the remaining small files with the storage capacity larger than a third threshold into m capacity levels in sequence from large to small;
sequentially judging whether the memory occupancy rate of the temporary storage medium is smaller than the second threshold value when the small files in the first k capacity levels are transferred to the HDFS, if so, combining the small files under each same attached directory in the first k capacity levels and transferring the small files to the HDFS;
if the memory occupancy rate of the temporary storage medium is not smaller than the second threshold when k is m, combining and transferring the small files under each same attached directory in the m capacity levels to the HDFS.
4. The method according to claim 3, wherein the step of performing forced storage processing on all other small files in each small file collection in the temporary storage medium according to a longest-pending principle until the memory occupancy rate of the temporary storage medium is smaller than the second threshold value comprises:
and in each time interval, sequentially and forcibly storing all the small files with the storage capacity smaller than a third threshold in each small file aggregate to the HDFS from long to short according to the time staying in the temporary storage medium until the memory occupancy rate of the temporary storage medium is smaller than the second threshold.
5. The method of claim 1, wherein the first threshold is a capacity value of a block of the HDFS.
6. The method of claim 5, wherein the first threshold is 128M.
7. The method of claim 1, further comprising:
and updating the metadata information of the small files which are transferred to the HDFS by the temporary storage medium.
8. The method of claim 1, wherein the temporary storage medium comprises a master node and a slave node that synchronize data with each other.
9. An HDFS-based small file merging apparatus, comprising:
the file directory merging module is configured to merge file directories of a plurality of small files with storage capacity within a preset threshold range in the temporary storage medium in a same level to obtain a plurality of attached directories;
the screening and merging processing module is configured to query the storage capacities of the small files of the multiple attached directories at preset time intervals, respectively collect the small files under the attached directories of which the sum of the storage capacities is larger than a first threshold value as small file collections, screen and merge each small file collection, and forward the merged block files obtained after processing to the HDFS;
the layering merging processing module is configured to perform layering merging processing on the remaining small files after screening and merging processing is performed on each small file collection if the memory occupancy rate of the temporary storage medium exceeds a second threshold; and
and the forced storage processing module is configured to perform forced storage processing on all other small files in each small file collection according to a longest unprocessed principle until the memory occupancy rate of the temporary storage medium is smaller than the second threshold value if the memory occupancy rate of the temporary storage medium exceeds the second threshold value after the hierarchical merging processing.
10. A computer device comprising a memory and a processor, characterized in that the memory has stored therein a computer program which, when executed by the processor, performs the method according to any one of claims 1-8.
CN202011310846.7A 2020-11-20 2020-11-20 Small file merging method, device and equipment based on HDFS Active CN112434000B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011310846.7A CN112434000B (en) 2020-11-20 2020-11-20 Small file merging method, device and equipment based on HDFS

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011310846.7A CN112434000B (en) 2020-11-20 2020-11-20 Small file merging method, device and equipment based on HDFS

Publications (2)

Publication Number Publication Date
CN112434000A true CN112434000A (en) 2021-03-02
CN112434000B CN112434000B (en) 2022-12-27

Family

ID=74693193

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011310846.7A Active CN112434000B (en) 2020-11-20 2020-11-20 Small file merging method, device and equipment based on HDFS

Country Status (1)

Country Link
CN (1) CN112434000B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113448938A (en) * 2021-07-20 2021-09-28 恒安嘉新(北京)科技股份公司 Data processing method and device, electronic equipment and storage medium
CN113448946A (en) * 2021-07-05 2021-09-28 星辰天合(北京)数据科技有限公司 Data migration method and device and electronic equipment
CN113836224A (en) * 2021-09-07 2021-12-24 南方电网大数据服务有限公司 Method and device for processing synchronous files from OGG (one glass solution) to HDFS (Hadoop distributed File System) and computer equipment
CN114564149A (en) * 2022-02-25 2022-05-31 上海英方软件股份有限公司 Data storage method, device, equipment and storage medium
CN115499426A (en) * 2022-07-29 2022-12-20 天翼云科技有限公司 Method, device, equipment and medium for transmitting mass small files
CN116069741A (en) * 2023-02-20 2023-05-05 北京集度科技有限公司 File processing method, apparatus and computer program product

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679177A (en) * 2017-09-29 2018-02-09 郑州云海信息技术有限公司 A kind of small documents storage optimization method based on HDFS, device, equipment
US20180121127A1 (en) * 2016-02-06 2018-05-03 Huawei Technologies Co., Ltd. Distributed storage method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180121127A1 (en) * 2016-02-06 2018-05-03 Huawei Technologies Co., Ltd. Distributed storage method and device
CN107679177A (en) * 2017-09-29 2018-02-09 郑州云海信息技术有限公司 A kind of small documents storage optimization method based on HDFS, device, equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YANG ZHANG 等: "《Improving the Efficiency of Storing for Small Files in HDFS》", 《 2012 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND SERVICE SYSTEM》 *
潘灏 等: "《一种小文件存储优化方法在云存储平台中的应用》", 《轻工科技》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113448946A (en) * 2021-07-05 2021-09-28 星辰天合(北京)数据科技有限公司 Data migration method and device and electronic equipment
CN113448946B (en) * 2021-07-05 2024-01-12 北京星辰天合科技股份有限公司 Data migration method and device and electronic equipment
CN113448938A (en) * 2021-07-20 2021-09-28 恒安嘉新(北京)科技股份公司 Data processing method and device, electronic equipment and storage medium
CN113836224A (en) * 2021-09-07 2021-12-24 南方电网大数据服务有限公司 Method and device for processing synchronous files from OGG (one glass solution) to HDFS (Hadoop distributed File System) and computer equipment
CN114564149A (en) * 2022-02-25 2022-05-31 上海英方软件股份有限公司 Data storage method, device, equipment and storage medium
CN114564149B (en) * 2022-02-25 2024-03-26 上海英方软件股份有限公司 Data storage method, device, equipment and storage medium
CN115499426A (en) * 2022-07-29 2022-12-20 天翼云科技有限公司 Method, device, equipment and medium for transmitting mass small files
CN116069741A (en) * 2023-02-20 2023-05-05 北京集度科技有限公司 File processing method, apparatus and computer program product

Also Published As

Publication number Publication date
CN112434000B (en) 2022-12-27

Similar Documents

Publication Publication Date Title
CN112434000B (en) Small file merging method, device and equipment based on HDFS
CN102142032B (en) Method and system for reading and writing data of distributed file system
CN104809182A (en) Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)
CN103399941A (en) Distributed file processing method, device and system
CN111723073B (en) Data storage processing method, device, processing system and storage medium
CN105528454A (en) Log treatment method and distributed cluster computing device
WO2021027331A1 (en) Graph data-based full relationship calculation method and apparatus, device, and storage medium
CN111930924A (en) Data duplicate checking system and method based on bloom filter
WO2017095413A1 (en) Incremental automatic update of ranked neighbor lists based on k-th nearest neighbors
CN112866406B (en) Data storage method, system, device, equipment and storage medium
WO2023083241A1 (en) Graph data division
RU2624558C2 (en) Method, terminal and server for file fields adjustment
CN111061719B (en) Data collection method, device, equipment and storage medium
CN109032804B (en) Data processing method and device and server
CN113760898A (en) Method and device for processing table connection operation
CN113900886A (en) Abnormal log monitoring method
EP3793171A1 (en) Message processing method, apparatus, and system
CN116910051B (en) Data processing method, device, electronic equipment and computer readable storage medium
Krechowicz et al. Hierarchical clustering in scalable distributed two-layer datastore for big data as a service
CN112861030B (en) CDN refreshing method and device, cache server and storage medium
CN117729176B (en) Method and device for aggregating application program interfaces based on network address and response body
CN115618050B (en) Video data storage and analysis method, device, system, communication equipment and storage medium
Noh et al. An efficient data access method exploiting quadtrees on mapreduce frameworks
CN117014442A (en) Load balancing method, device and equipment of HBase table and readable storage medium
CN116243855A (en) Index establishment and data access method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant