CN116069741A - File processing method, apparatus and computer program product - Google Patents

File processing method, apparatus and computer program product Download PDF

Info

Publication number
CN116069741A
CN116069741A CN202310184421.3A CN202310184421A CN116069741A CN 116069741 A CN116069741 A CN 116069741A CN 202310184421 A CN202310184421 A CN 202310184421A CN 116069741 A CN116069741 A CN 116069741A
Authority
CN
China
Prior art keywords
file
files
identification information
value
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310184421.3A
Other languages
Chinese (zh)
Inventor
王范
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jidu Technology Co Ltd
Original Assignee
Beijing Jidu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jidu Technology Co Ltd filed Critical Beijing Jidu Technology Co Ltd
Priority to CN202310184421.3A priority Critical patent/CN116069741A/en
Publication of CN116069741A publication Critical patent/CN116069741A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a file processing method, a file processing device and a computer program product, and relates to the technical field of data management, wherein the method comprises the following steps: acquiring a first file directory of a file storage system, wherein the first file directory comprises file identification information and file size information of N files; and performing first processing on the file identification information of the N files by adopting a backtracking algorithm, wherein the first processing is used for carrying out split-stacking on the file identification information of the N files to obtain M stacks, and the total size value of the files corresponding to all the file identification information of each stack of at least M-1 stacks meets the following conditions: the difference value between the first target value and the second target value is smaller than or equal to a first threshold value and smaller than or equal to a storage space value corresponding to the minimum storage unit of the file storage system. According to the file merging method and device, the total size of the files corresponding to each heap can be close to the target size, and after the files corresponding to each heap are merged, the merged file size can be controlled.

Description

File processing method, apparatus and computer program product
Technical Field
The present application relates to the field of data management technologies, and in particular, to a file processing method, apparatus, and computer program product.
Background
Currently in the big data age, data management has been an important aspect in big data technology. The data quality directly reflects the effect of data governance. For the file storage system, the data governance is mainly implemented on governance of small files in the file storage system. If a large number of small files exist in the file storage system, a large burden is brought to management of the file storage system, storage resources are consumed, and the file inquiry speed is slow, so that the files in the file storage system are required to be combined.
In the prior art, file merging is generally implemented in a sequential scanning manner. Although the method can reduce the number of small files to a certain extent, the size of the combined files is not controllable, so that the effect of small file management is poor.
Disclosure of Invention
The application provides a file processing method, a file processing device and a computer program product, which are used for solving the problem that the size of a combined file is uncontrollable in the existing file combining mode.
According to a first aspect of the present application, there is provided a file processing method applied to a file storage system, the method comprising:
acquiring a first file directory of the file storage system, wherein the first file directory comprises file identification information and file size information of N files, the size of each file of the N files is smaller than or equal to a first target value, and N is an integer larger than 1;
performing first processing on file identification information of the N files by adopting a backtracking algorithm, wherein the first processing is used for carrying out split-stacking on the file identification information of the N files to obtain M stacks, and M is an integer greater than 1;
wherein, the total size value of the files corresponding to all the file identification information of each of at least M-1 stacks in the M stacks meets the following conditions: the difference value between the first target value and the second target value is smaller than or equal to a first threshold value and smaller than or equal to a storage space value corresponding to the minimum storage unit of the file storage system.
According to a second aspect of the present application, there is provided a document processing apparatus for use in a document storage system, the apparatus comprising:
the file storage system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring a first file directory of the file storage system, the first file directory comprises file identification information and file size information of N files, the size of each file of the N files is smaller than or equal to a first target value, and N is an integer larger than 1;
the first processing module is used for carrying out first processing on the file identification information of the N files by adopting a backtracking algorithm, wherein the first processing is used for carrying out pile separation on the file identification information of the N files to obtain M piles, and M is an integer larger than 1;
wherein, the total size value of the files corresponding to all the file identification information of each of at least M-1 stacks in the M stacks meets the following conditions: the difference value between the first target value and the second target value is smaller than or equal to a first threshold value and smaller than or equal to a storage space value corresponding to the minimum storage unit of the file storage system.
According to a third aspect of the present application, there is provided a computer program product comprising a computer program or instructions which, when executed by a processor, implement the method of the first aspect.
In the embodiment of the application, the backtracking algorithm is adopted to perform the heap splitting processing on the file identification information of the plurality of files, so that the difference value between the total file size value corresponding to all the file identification information of each heap and the preset target value is smaller than or equal to a preset threshold value and smaller than or equal to the storage space value corresponding to the minimum storage unit of the file storage system, and the total file size corresponding to each heap is close to the target size. After the backtracking algorithm is adopted to carry out the heap splitting processing on the files, the subsequent file merging processing can be carried out according to the heap splitting result, and the size of the files after the merging processing is close to the target size. It can be seen that the combined file size is controllable, which can improve the small file governance effect.
It should be understood that the description of this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.
Drawings
The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:
FIG. 1 is a flowchart of a method for processing a file according to an embodiment of the present application;
FIG. 2 is a complete flowchart of a file processing method provided in an embodiment of the present application;
fig. 3 is a block diagram of a document processing apparatus according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Currently in the big data age, data management has been an important aspect in big data technology. The data quality directly reflects the effect of data governance. For the file storage system, the data governance is mainly implemented on governance of small files in the file storage system. Taking the open source big data storage platform "sea Du Pu distributed file system (Hadoop Distributed File System, HDFS)" as an example, the presence of a large number of small files in the HDFS brings the following problems:
firstly, the HDFS is used for storing big data, is not suitable for storing small files, and if the small files are too many, the addressing time of a hard disk can be increased, and the data query speed is slowed down;
secondly, too many small files can cause too small data blocks and too large data blocks, so that data block information to be maintained is increased, great burden is brought to a named node (Namenode) of Hadoop, memory consumption of the Namenode is increased, and restarting speed of the Namenode is slow. The function of Namenode in the HDFS is to maintain metadata information of the HDFS cluster, including directory tree of files, data block list corresponding to each file, authority setting parameters of each file, copy number of each file and other information.
The method is characterized in that small files are combined into large files, and currently, the small files are combined in a sequential scanning mode, and the method can reduce the number of the small files to a certain extent, but the combined file size depends on the file reading sequence, and the file size is uncontrollable, so that the small file treatment effect is poor.
In view of this, the present inventors propose a small file governance scheme applied to a file storage system to improve the small file governance effect.
The document processing method, apparatus and computer program product provided in the embodiments of the present application are described below with reference to the accompanying drawings and detailed description.
Referring to fig. 1, fig. 1 is a flowchart of a file processing method according to an embodiment of the present application, where the file processing method is applied to a file storage system.
As shown in fig. 1, the file processing method includes the steps of:
step 101: acquiring a first file directory of a file storage system, wherein the first file directory comprises file identification information and file size information of N files, the size of each file of the N files is smaller than or equal to a first target value, and N is an integer larger than 1;
step 102: performing first processing on file identification information of the N files by adopting a backtracking algorithm, wherein the first processing is used for carrying out split-stacking on the file identification information of the N files to obtain M stacks, and M is an integer greater than 1;
wherein, the total size value of the files corresponding to all the file identification information of each of at least M-1 stacks in the M stacks meets the following conditions: the difference value between the first target value and the second target value is smaller than or equal to a first threshold value and smaller than or equal to a storage space value corresponding to the minimum storage unit of the file storage system.
The embodiment of the application relates to a backtracking algorithm for carrying out a heap splitting process on files with smaller file sizes (such as file sizes smaller than or equal to a first target value) in a file storage system. The backtracking algorithm may be referred to as a search and backtracking method, also known as a heuristic, and is a preferred search method, i.e., searching forward according to preferred conditions to achieve the goal. However, when a certain step is explored, the original selection is not optimal or the target is not reached, the selection is carried out again, the technology that the user cannot walk and can return again is called a backtracking method, and a point in a certain state meeting the backtracking condition is called a backtracking point.
In step 101, the first file directory of the file storage system may be obtained, or the file directory of a part of the files of the file storage system may be obtained, or the file directory of all the files of the file storage system may be obtained. Specifically, the corresponding file directory may be obtained according to the processing capability or the processing range. N file identification information is recorded in the first file directory, and can be understood as identification information of N files, wherein each file identification information corresponds to one file, and the N file identification information corresponds to N files to be combined.
In step 102, the first process may be understood as a sub-heap processing, which aims at: the difference between the total size value of the files corresponding to the entire file identification information of each heap and the first target value is made smaller than or equal to the first threshold value in order to control the total size of the files corresponding to each heap to be close to the size of the first target value. Here, the first target value and the first threshold value are both preset values, where the first target value may be set according to a requirement of the file storage system for the file size. The first threshold may be used to measure the accuracy requirement of the first process, where the smaller the first threshold, the closer the total size value of the files corresponding to all the file identification information of each heap after the first process is to the first target value, and the higher the accuracy requirement of the first process is. Correspondingly, the smaller the first threshold value is, the higher the controllability of the file merging size is, the better the file merging effect is, and the better the file treatment effect is.
Here, the total size value of the files corresponding to the entire file identification information of each heap needs to satisfy a "storage space value corresponding to the minimum storage unit of the file storage system (which may be abbreviated as minimum storage condition)" condition in addition to the above-described "difference value from the first target value is smaller than or equal to the first threshold value" (which may be abbreviated as difference value condition).
The minimum storage conditions are described as follows: taking a file storage system as an HDFS as an example, in order to facilitate management and backup of files, a block (block) concept is introduced, where block is the smallest storage unit in the HDFS. According to the storage and reading principle of the Hadoop file, the closer the file size is to the block size, the higher the storage and reading efficiency is. Therefore, if the files can be merged according to the size close to the block, the merging effect is optimal, and thousands of files are required to be divided into a plurality of stacks with the file size close to the size of the block, so that better file merging effect is achieved.
By enabling the total size value of the files after being piled to meet the minimum storage condition, the combined size value of the files can be close to a storage space value corresponding to the minimum storage unit of the file storage system, the number of the whole files can be greatly reduced, the Namenode pressure of Hadoop can be reduced, the file reading throughput rate can be improved, and the data query efficiency is improved.
In view of the split, there may be the following special cases: among the M stacks obtained by the stacking, the total size value of the files corresponding to all the file identification information of the M-1 stacks satisfies the difference condition and the minimum storage condition, but the total size value of the files corresponding to all the file identification information of only one stack does not satisfy at least one of the difference condition and the minimum storage condition, for example, the total size value of the files corresponding to all the file identification information of only one stack does not satisfy the difference condition, and the stacking can be regarded as successful.
In the embodiment of the application, the backtracking algorithm is adopted to perform the heap splitting processing on the file identification information of the plurality of files, so that the difference value between the total file size value corresponding to all the file identification information of each heap and the preset target value is smaller than or equal to a preset threshold value and smaller than or equal to the storage space value corresponding to the minimum storage unit of the file storage system, and the total file size corresponding to each heap is close to the target size. After the backtracking algorithm is adopted to carry out the heap splitting processing on the files, the subsequent file merging processing can be carried out according to the heap splitting result, and the size of the files after the merging processing is close to the target size. It can be seen that the combined file size is controllable, which can improve the small file governance effect.
In the embodiment of the application, the file storage system may be, but is not limited to, HDFS.
In some embodiments, the total size value of the files corresponding to all the file identification information of at most one of the M stacks satisfies the following condition: the difference value between the first target value and the second target value is smaller than or equal to a first threshold value and is larger than a storage space value corresponding to the minimum storage unit of the file storage system.
In this embodiment, the difference condition may be used as a complete requirement, and the minimum storage condition may be used as a non-complete requirement, that is, on the basis of ensuring that the difference between the total size value of the files corresponding to all the file identification information of each heap and the first target value is less than or equal to the first threshold value, the total size value of the files corresponding to all the file identification information of one heap may be allowed to be larger than the storage space value corresponding to the minimum storage unit of the file storage system, where the requirement of the precision of the heap splitting can be completely satisfied, and the successful heap splitting may be considered.
In this embodiment of the present application, the first target value may be a value equal to or close to a storage space value corresponding to a minimum storage unit of the file storage system, which is not limited in this embodiment of the present application. Taking the Hadoop default block size as an example, the first target value may be 125M, for example, and the first threshold value may be 3M, for example, then the total size value of the files corresponding to all the file identification information of each heap obtained through the first processing may be between 122M and 128M.
In some embodiments, the first target value is a storage space value corresponding to a minimum storage unit of the file storage system.
Taking the Hadoop default block size as an example, the first target value may be 128M, and the first threshold value may be 5M, for example, so that the total size value of the files corresponding to all the file identification information of each heap obtained through the first processing may be between 123M and 128M.
It should be noted that, when the file size stored in the file storage system is smaller, the number of files is larger, and the method is more suitable for performing the heap splitting processing on the small files by adopting the backtracking algorithm provided by the embodiment of the application. In view of this, in some embodiments, the size of each of the N files is less than or equal to 10% of the first target value.
In order to make the setting of the stacking precision more reasonable, that is, the setting of the first threshold value more reasonable, the first threshold value can be determined according to the file size under the condition that the file sizes are smaller. In some embodiments, the first threshold is a file size average or a preset multiple of a file size median of the N files.
It should be noted that, for the preset first threshold, if the value set by the first threshold is too small, the first processing is performed by adopting the backtracking algorithm, so that the target cannot be achieved in any way, that is, the total size value of the files corresponding to all the file identification information of each of at least M-1 stacks cannot be achieved in any way, and the difference condition and the minimum storage condition are satisfied at the same time. In this case, the first process using the backtracking algorithm may be performed indefinitely and cannot be completed.
In view of the above problems, the present examples propose the following embodiments.
In some embodiments, the method further comprises:
and under the condition that the duration of the first processing does not exceed the first preset duration, merging the N files according to the M stacks.
In this embodiment, the problem of infinite execution of the first process is eliminated by setting a timeout period (i.e., a first preset period) of the backtracking algorithm. As an example, the first preset duration may be 5 seconds or more than 5 seconds.
In this embodiment, the duration of the first process does not exceed the first preset duration, which means that the backtracking algorithm successfully completes the splitting of the N files of identification information according to the precision defined by the first threshold within the first preset duration, thereby obtaining M stacks. Thus, further, the merging process may be performed on N files according to M stacks. It should be noted that, the files are combined according to M stacks, which is understood to mean that the files corresponding to one or more file identification information separated in the same stack are combined to obtain a large file.
In the embodiment, the rationality of the related parameters according to the first processing can be effectively verified by setting the timeout duration of the backtracking algorithm, so that the feasibility of the set first processing target is effectively verified, and the merging processing of the files can be realized.
In some embodiments, the method further comprises:
controlling the first processing to finish under the condition that the duration of the first processing exceeds a first preset duration;
performing second processing on the file identification information of the N files by adopting a backtracking algorithm, wherein the second processing is used for carrying out split-stacking on the file identification information of the N files to obtain K stacks, and K is an integer greater than 1;
wherein, in the K stacks, the total size value of the files corresponding to all the file identification information of each stack of at least K-1 stacks satisfies the following conditions: the difference value between the first target value and the second target value is smaller than or equal to a second threshold value, and smaller than or equal to a storage space value corresponding to the minimum storage unit of the file storage system, wherein the second threshold value is larger than the first threshold value.
In this embodiment, the duration of the first process exceeds the first preset duration, which means that the backtracking algorithm cannot complete the stacking of the N pieces of file identification information with the accuracy defined by the first threshold within the first preset duration, which is caused by the first threshold being set too small on a certain probability, that is, the preset first threshold may not be suitable for the stacking process of the batch of files. In this case, the first process may be stopped, i.e., the first process is controlled to end, so as to avoid excessively consuming time of the split-heap process.
Further, the first threshold value can be adjusted to be a second threshold value, the second threshold value is larger than the first threshold value, and the backtracking algorithm is adopted to re-perform the heap separation processing, namely the second processing, on the file identification information of the N files based on the second threshold value. The goal of the second treatment is to: the difference between the total size value of the files corresponding to all the file identification information of each heap and the first target value is smaller than or equal to a second threshold value.
As an example, the second threshold may be twice the first threshold.
In this embodiment, since the second threshold value is larger than the first threshold value, the accuracy of the second process performed based on the second threshold value is reduced, and the processing efficiency and success probability of the second process can be improved by moderately reducing the accuracy of the second process.
In addition, in view of the split, there may be the following special cases: among the K stacks obtained by the stacking, the total size value of the files corresponding to all the file identification information of the K-1 stacks satisfies the difference condition and the minimum storage condition, but the total size value of the files corresponding to all the file identification information of only one stack does not satisfy at least one of the difference condition and the minimum storage condition, for example, the total size value of the files corresponding to all the file identification information of only one stack does not satisfy the difference condition, and the stacking can be regarded as successful.
In some embodiments, the total size value of the files corresponding to all the file identification information of at most one of the K stacks satisfies the following condition: the difference value between the first target value and the second target value is smaller than or equal to a second threshold value and is larger than a storage space value corresponding to the minimum storage unit of the file storage system.
Regarding the above-mentioned second process, reference may be made to the above-mentioned description of the first process, and for avoiding repetition, description thereof will be omitted.
In some embodiments, the method further comprises:
and under the condition that the duration of the second processing does not exceed a second preset duration, merging the N files according to the K stacks.
In this embodiment, the problem of infinite execution of the second process is eliminated by setting a timeout period (i.e., a second preset period) of the backtracking algorithm. As an example, the second preset time period may be the same as the first preset time period or may be different from the first preset time period.
In this embodiment, the duration of the second processing does not exceed the second preset duration, which means that the backtracking algorithm successfully completes the splitting of the N files of identification information according to the precision defined by the second threshold within the second preset duration, thereby obtaining K piles. Therefore, further, the N files can be combined according to the obtained K stacks.
In the embodiment, the rationality of the related parameters according to the second processing can be effectively verified by setting the timeout duration of the backtracking algorithm, so that the feasibility of the set second processing target is effectively verified, and the merging processing of the files can be realized.
In some embodiments, the method further comprises:
controlling the second processing to finish under the condition that the duration of the second processing exceeds a second preset duration;
and performing third processing on the file identification information of the N files, wherein the third processing is used for carrying out pile separation on the file identification information of the N files according to the arrangement sequence of the file identification information, and the total size value of the files corresponding to all the file identification information of each pile is smaller than or equal to a second target value.
In this embodiment, the duration of the second processing exceeds a second preset duration, which indicates that the backtracking algorithm cannot complete the stacking of the N file identification information within the second preset duration according to the accuracy defined by the second threshold. In this case, the second process may be stopped, i.e., the second process is controlled to end, so as to avoid excessively consuming time of the split-heap process.
In the prior art, a backtracking algorithm is adopted to try two times of splitting, and splitting of N file identification information cannot be completed according to a preset target after the two times of splitting. In this case, a sequential algorithm may be further employed to sort the N file identification information. Specifically, the N file identification information is piled according to the arrangement sequence of the N file identification information, if the size of the file corresponding to the next file identification information exceeds a second target value, the next file identification information is piled, otherwise, the next file identification information is piled.
In this embodiment of the present application, the first target value and the second target value may be equal or unequal.
It should be noted that, regarding the first process, the second process, and the third process, the effect of the first process is better than the effect of the second process in terms of the effect of the file combination, and the effect of the second process is better than the effect of the third process.
It should be further noted that, through the test, in most cases, the first process can successfully realize the splitting of the N file identification information, and even if the first process cannot successfully split the N file identification information, the second process can also basically successfully realize the splitting of the N file identification information. Therefore, the probability of the third process being required is substantially zero.
In some embodiments, the file storage system includes a plurality of partitions, and the first file directory is a file directory of a first partition of the file storage system.
In this embodiment, for a file storage system including multiple partitions, such as HDFS, there may be thousands of files in the same partition directory, and files in the same partition may be merged, so the first file directory is a file directory of the first partition of the file storage system.
A specific embodiment is provided below in conjunction with fig. 2 to illustrate the overall flow of the document processing method.
As shown in fig. 2, the file processing method includes the steps of:
step 201: acquiring a file directory of a first partition;
step 202: a backtracking algorithm is adopted to sort file identification information of the file catalogue according to the set target value, the set threshold value and the set timeout time; if successful, step 205 is performed, and if unsuccessful, step 203 is performed;
step 203: a backtracking algorithm is adopted to sort file identification information of the file catalogue according to the target value, the double threshold value and the overtime time; if successful, step 205 is performed, and if unsuccessful, step 204 is performed;
step 204: the file identification information of the file catalogue is piled by adopting a sequence algorithm;
step 205: and merging the files associated with the file catalogue according to the split stack.
In summary, through the above-mentioned process of this application embodiment, the file size after the merger can be controllable to can improve little file treatment effect.
Referring to fig. 3, fig. 3 is a block diagram of a document processing apparatus according to an embodiment of the present application.
As shown in fig. 3, the file processing apparatus 300 includes:
an obtaining module 301, configured to obtain a first file directory of the file storage system, where the first file directory includes file identification information and file size information of N files, where a size of each file of the N files is smaller than or equal to a first target value, and N is an integer greater than 1;
the first processing module 302 is configured to perform a first process on file identification information of the N files by using a backtracking algorithm, where the first process is configured to split the file identification information of the N files to obtain M stacks, where M is an integer greater than 1;
wherein, the total size value of the files corresponding to all the file identification information of each of at least M-1 stacks in the M stacks meets the following conditions: the difference value between the first target value and the second target value is smaller than or equal to a first threshold value and smaller than or equal to a storage space value corresponding to the minimum storage unit of the file storage system.
Optionally, the file processing device 300 further includes:
and the third processing module is used for merging the N files according to the M stacks under the condition that the duration of the first processing does not exceed the first preset duration.
Optionally, the file processing device 300 further includes:
the first control module is used for controlling the first processing to be ended when the duration of the first processing exceeds a first preset duration;
a fourth processing module, configured to perform a second process on file identification information of the N files by using a backtracking algorithm, where the second process is configured to split the file identification information of the N files to obtain K stacks, where K is an integer greater than 1;
wherein, in the K stacks, the total size value of the files corresponding to all the file identification information of each stack of at least K-1 stacks satisfies the following conditions: the difference value between the first target value and the second target value is smaller than or equal to a second threshold value, and smaller than or equal to a storage space value corresponding to the minimum storage unit of the file storage system, wherein the second threshold value is larger than the first threshold value.
Optionally, the file processing device 300 further includes:
and the fifth processing module is used for merging the N files according to the K stacks under the condition that the duration of the second processing does not exceed the second preset duration.
Optionally, the file processing device 300 further includes:
the second control module is used for controlling the second processing to be ended when the duration of the second processing exceeds a second preset duration;
and a sixth processing module, configured to perform third processing on the file identification information of the N files, where the third processing is configured to split the file identification information of the N files according to the arrangement sequence of the file identification information, and a total size value of the files corresponding to all the file identification information of each stack is less than or equal to a second target value.
Optionally, the total size value of the files corresponding to all the file identification information of at most one of the M stacks satisfies the following condition: the difference value between the first target value and the second target value is smaller than or equal to a first threshold value and is larger than a storage space value corresponding to the minimum storage unit of the file storage system.
Optionally, the size of each of the N files is less than or equal to 10% of the first target value.
Optionally, the first threshold is a file size average value or a preset multiple of a file size median of the N files.
Optionally, the file storage system includes a plurality of partitions, and the first file directory is a file directory of a first partition of the file storage system.
Optionally, the file storage system is a Hadoop distributed file system HDFS.
Optionally, the first target value is a storage space value corresponding to a minimum storage unit of the file storage system.
The file processing device in the embodiment of the present application can implement each process of the embodiment of the file processing method and achieve the same beneficial effects, and in order to avoid repetition, the description is omitted here.
The methods in this application may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described herein are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a subscriber device, a core network device, operations, administration, and maintenance (OAM) or other programmable device.
The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, e.g., floppy disk, hard disk, tape; but also optical media such as digital video discs; but also semiconductor media such as solid state disks. The computer readable storage medium may be volatile or nonvolatile storage medium, or may include both volatile and nonvolatile types of storage medium.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.
The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims (12)

1. A method of file processing, for application to a file storage system, the method comprising:
acquiring a first file directory of the file storage system, wherein the first file directory comprises file identification information and file size information of N files, the size of each file of the N files is smaller than or equal to a first target value, and N is an integer larger than 1;
performing first processing on file identification information of the N files by adopting a backtracking algorithm, wherein the first processing is used for carrying out split-stacking on the file identification information of the N files to obtain M stacks, and M is an integer greater than 1;
wherein, the total size value of the files corresponding to all the file identification information of each of at least M-1 stacks in the M stacks meets the following conditions: the difference value between the first target value and the second target value is smaller than or equal to a first threshold value and smaller than or equal to a storage space value corresponding to the minimum storage unit of the file storage system.
2. The method according to claim 1, wherein the method further comprises:
and under the condition that the duration of the first processing does not exceed the first preset duration, merging the N files according to the M stacks.
3. The method according to claim 1, wherein the method further comprises:
controlling the first processing to finish under the condition that the duration of the first processing exceeds a first preset duration;
performing second processing on the file identification information of the N files by adopting a backtracking algorithm, wherein the second processing is used for carrying out split-stacking on the file identification information of the N files to obtain K stacks, and K is an integer greater than 1;
wherein, in the K stacks, the total size value of the files corresponding to all the file identification information of each stack of at least K-1 stacks satisfies the following conditions: the difference value between the first target value and the second target value is smaller than or equal to a second threshold value, and smaller than or equal to a storage space value corresponding to the minimum storage unit of the file storage system, wherein the second threshold value is larger than the first threshold value.
4. A method according to claim 3, characterized in that the method further comprises:
and under the condition that the duration of the second processing does not exceed a second preset duration, merging the N files according to the K stacks.
5. A method according to claim 3, characterized in that the method further comprises:
controlling the second processing to finish under the condition that the duration of the second processing exceeds a second preset duration;
and performing third processing on the file identification information of the N files, wherein the third processing is used for carrying out pile separation on the file identification information of the N files according to the arrangement sequence of the file identification information, and the total size value of the files corresponding to all the file identification information of each pile is smaller than or equal to a second target value.
6. The method of claim 1, wherein the total size value of the files corresponding to all file identification information of at most one of the M stacks satisfies the following condition: the difference value between the first target value and the second target value is smaller than or equal to a first threshold value and is larger than a storage space value corresponding to the minimum storage unit of the file storage system.
7. The method of claim 1, wherein the size of each of the N files is less than or equal to 10% of the first target value.
8. The method of claim 7, wherein the first threshold is a file size average or a preset multiple of a file size median of the N files.
9. The method of claim 1, wherein the file storage system comprises a plurality of partitions, and wherein the first file directory is a file directory of a first partition of the file storage system.
10. The method of claim 1, wherein the first target value is a storage space value corresponding to a minimum storage unit of the file storage system.
11. A document processing apparatus for use with a document storage system, the apparatus comprising:
the file storage system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring a first file directory of the file storage system, the first file directory comprises file identification information and file size information of N files, the size of each file of the N files is smaller than or equal to a first target value, and N is an integer larger than 1;
the first processing module is used for carrying out first processing on the file identification information of the N files by adopting a backtracking algorithm, wherein the first processing is used for carrying out pile separation on the file identification information of the N files to obtain M piles, and M is an integer larger than 1;
wherein, the total size value of the files corresponding to all the file identification information of each of at least M-1 stacks in the M stacks meets the following conditions: the difference value between the first target value and the second target value is smaller than or equal to a first threshold value and smaller than or equal to a storage space value corresponding to the minimum storage unit of the file storage system.
12. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the method of any one of claims 1 to 10.
CN202310184421.3A 2023-02-20 2023-02-20 File processing method, apparatus and computer program product Pending CN116069741A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310184421.3A CN116069741A (en) 2023-02-20 2023-02-20 File processing method, apparatus and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310184421.3A CN116069741A (en) 2023-02-20 2023-02-20 File processing method, apparatus and computer program product

Publications (1)

Publication Number Publication Date
CN116069741A true CN116069741A (en) 2023-05-05

Family

ID=86183706

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310184421.3A Pending CN116069741A (en) 2023-02-20 2023-02-20 File processing method, apparatus and computer program product

Country Status (1)

Country Link
CN (1) CN116069741A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090083344A1 (en) * 2007-09-26 2009-03-26 Hitachi, Ltd. Computer system, management computer, and file management method for file consolidation
CN110321329A (en) * 2019-06-18 2019-10-11 中盈优创资讯科技有限公司 Data processing method and device based on big data
JP2019204473A (en) * 2018-05-22 2019-11-28 広東技術師範学院 Method for writing plurality of small files of 2 mb or smaller to hdfs having data merge module and hbase cash module on the basis of hadoop
CN112434000A (en) * 2020-11-20 2021-03-02 苏州浪潮智能科技有限公司 Small file merging method, device and equipment based on HDFS
CN113177024A (en) * 2021-06-29 2021-07-27 南京烽火星空通信发展有限公司 Data global merging method under mass data scene
CN113468128A (en) * 2021-07-21 2021-10-01 上海浦东发展银行股份有限公司 Data processing method and device, electronic equipment and storage medium
CN113568877A (en) * 2020-04-28 2021-10-29 杭州海康威视数字技术股份有限公司 File merging method and device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090083344A1 (en) * 2007-09-26 2009-03-26 Hitachi, Ltd. Computer system, management computer, and file management method for file consolidation
JP2019204473A (en) * 2018-05-22 2019-11-28 広東技術師範学院 Method for writing plurality of small files of 2 mb or smaller to hdfs having data merge module and hbase cash module on the basis of hadoop
CN110321329A (en) * 2019-06-18 2019-10-11 中盈优创资讯科技有限公司 Data processing method and device based on big data
CN113568877A (en) * 2020-04-28 2021-10-29 杭州海康威视数字技术股份有限公司 File merging method and device, electronic equipment and storage medium
CN112434000A (en) * 2020-11-20 2021-03-02 苏州浪潮智能科技有限公司 Small file merging method, device and equipment based on HDFS
CN113177024A (en) * 2021-06-29 2021-07-27 南京烽火星空通信发展有限公司 Data global merging method under mass data scene
CN113468128A (en) * 2021-07-21 2021-10-01 上海浦东发展银行股份有限公司 Data processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US8983952B1 (en) System and method for partitioning backup data streams in a deduplication based storage system
US8126997B2 (en) Hot data management method based on hit counter
US8489612B2 (en) Identifying similar files in an environment having multiple client computers
US20130290929A1 (en) Method of managing script, server performing the same and storage media storing the same
US20190140902A1 (en) Centralized configuration data in a distributed file system
US20170031948A1 (en) File synchronization method, server, and terminal
CN109726177A (en) A kind of mass file subregion indexing means based on HBase
WO2016169237A1 (en) Data processing method and device
US9552407B1 (en) Log-based synchronization with conditional append
US11265182B2 (en) Messaging to enforce operation serialization for consistency of a distributed data structure
CN114077602B (en) Data migration method and device, electronic equipment and storage medium
US10956446B1 (en) Log-based synchronization with inferred context
CN111857890A (en) Service processing method, system, device and medium
CN108228432A (en) A kind of distributed link tracking, analysis method and server, global scheduler
EP4209919A1 (en) Data deduplication method, node, and computer readable storage medium
CN116069741A (en) File processing method, apparatus and computer program product
CN112241396A (en) Spark-based method and Spark-based system for merging small files of Delta
CN116795296A (en) Data storage method, storage device and computer readable storage medium
CN112448979A (en) Cache information updating method, device and medium
KR20160145250A (en) Shuffle Embedded Distributed Storage System Supporting Virtual Merge and Method Thereof
US20130218851A1 (en) Storage system, data management device, method and program
CN111090530B (en) Distributed inter-process communication bus system
CN113886350A (en) Data processing method and system
CN111563123A (en) Live warehouse metadata real-time synchronization method
US20240045847A1 (en) Data replication and recursive tree structure searching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20230505

RJ01 Rejection of invention patent application after publication