CN112965939A - File merging method, device and equipment - Google Patents

File merging method, device and equipment Download PDF

Info

Publication number
CN112965939A
CN112965939A CN202110174031.9A CN202110174031A CN112965939A CN 112965939 A CN112965939 A CN 112965939A CN 202110174031 A CN202110174031 A CN 202110174031A CN 112965939 A CN112965939 A CN 112965939A
Authority
CN
China
Prior art keywords
merged
file
files
target
merging
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110174031.9A
Other languages
Chinese (zh)
Inventor
张中源
李超
王佳典
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110174031.9A priority Critical patent/CN112965939A/en
Publication of CN112965939A publication Critical patent/CN112965939A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Abstract

The embodiment of the specification provides a file merging method, a file merging device and file merging equipment, wherein the method comprises the following steps: acquiring an information set of files to be merged of a target partition; the information set of the files to be merged comprises the sizes and paths of a plurality of files to be merged, and the size of the files to be merged is smaller than or equal to a preset threshold value; determining the number of merging partitions according to the size of each file to be merged; merging the files to be merged by using a distributed stream data stream engine based on the path of each file to be merged and the merging partition number to obtain a target file set; wherein the number of files contained in the target file set is equal to the number of the merged partitions; and writing the target file set into the target partition. In the embodiment of the specification, the files to be merged in the target partition can be efficiently merged, so that the number of the files in the target partition is effectively reduced, and the reading and writing efficiency can be improved.

Description

File merging method, device and equipment
Technical Field
The embodiment of the specification relates to the technical field of big data, in particular to a file merging method, device and equipment.
Background
In the field of computer big data, corresponding data files are increased in geometric level along with the continuous increase of data, so that the files in a distributed file system are different in size. For a distributed file system, when a certain file is much smaller than the size of a set BLOCK (BLOCK), the file is considered to be a small file, and since the reading paths of different files in the distributed file system may be different servers, the more the small files are in the system, the slower the corresponding reading and writing speed is, thereby affecting the cluster expansibility. In the prior art, an effective mode for combining small files in a distributed file system does not exist.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the specification provides a file merging method, a file merging device and file merging equipment, and aims to solve the problem that small files in a distributed file system cannot be effectively merged in the prior art.
An embodiment of the present specification provides a file merging method, including: acquiring an information set of files to be merged of a target partition; the information set of the files to be merged comprises the sizes and paths of a plurality of files to be merged, and the size of the files to be merged is smaller than or equal to a preset threshold value; determining the number of merging partitions according to the size of each file to be merged; merging the files to be merged by using a distributed stream data stream engine based on the path of each file to be merged and the merging partition number to obtain a target file set; wherein the number of files contained in the target file set is equal to the number of the merged partitions; and writing the target file set into the target partition.
An embodiment of the present specification further provides a file merging device, including: the acquisition module is used for acquiring the information set of the files to be merged of the target partition; the information set of the files to be merged comprises the sizes and paths of a plurality of files to be merged, and the size of the files to be merged is smaller than or equal to a preset threshold value; the determining module is used for determining the number of the merging partitions according to the size of each file to be merged; the merging module is used for merging the files to be merged by utilizing a distributed stream data stream engine based on the path of each file to be merged and the number of the merging partitions to obtain a target file set; wherein the number of files contained in the target file set is equal to the number of the merged partitions; and the processing module is used for writing the target file set into the target partition.
The embodiment of the specification also provides a file merging device, which comprises a processor and a memory for storing processor executable instructions, wherein the processor executes the instructions to realize the steps of the file merging method.
The present specification also provides a computer readable storage medium, on which computer instructions are stored, and when executed, the instructions implement the steps of the file merging method.
The embodiment of the present specification provides a file merging method, which may obtain an information set of files to be merged of a target partition, where the information set of files to be merged includes sizes and paths of a plurality of files to be merged, and the size of the files to be merged is smaller than or equal to a preset threshold. In order to ensure that the merged file has a moderate size, the number of the merged partitions can be determined according to the size of each file to be merged. And merging the plurality of files to be merged by using a distributed stream data stream engine based on the path of each file to be merged and the merging partition number to obtain a target file set, wherein the efficiency of file merging can be effectively improved by merging the distributed stream data stream engine. Further, the target set of files may be written to the target partition so that data may be subsequently read based on the merged file. Therefore, the files to be merged in the target partition can be efficiently merged, the number of the files in the target partition is effectively reduced, and the reading and writing efficiency can be improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the disclosure, are incorporated in and constitute a part of this specification, and are not intended to limit the embodiments of the disclosure. In the drawings:
FIG. 1 is a schematic diagram illustrating steps of a file merging method provided in an embodiment of the present disclosure;
FIG. 2 is a schematic structural diagram of a file merging device provided in an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a file merging device provided in an embodiment of the present specification.
Detailed Description
The principles and spirit of the embodiments of the present specification will be described with reference to a number of exemplary embodiments. It should be understood that these embodiments are presented merely to enable those skilled in the art to better understand and to implement the embodiments of the present description, and are not intended to limit the scope of the embodiments of the present description in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, implementations of the embodiments of the present description may be embodied as a system, an apparatus, a method, or a computer program product. Therefore, the disclosure of the embodiments of the present specification can be embodied in the following forms: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
Although the flow described below includes operations that occur in a particular order, it should be appreciated that the processes may include more or less operations that are performed sequentially or in parallel (e.g., using parallel processors or a multi-threaded environment).
Referring to fig. 1, the present embodiment can provide a file merging method. The file merging method can be used for merging the files with the size smaller than or equal to the preset threshold value in the distributed file system by taking the partitions as units, so that the operation efficiency of the distributed file system is effectively improved. The file merging method may include the following steps.
S101: acquiring an information set of files to be merged of a target partition; the information set of the files to be merged comprises the sizes and paths of the files to be merged, and the size of the files to be merged is smaller than or equal to a preset threshold value.
In this embodiment, the information set of the file to be merged of the target partition may be acquired. The information set of the files to be merged may include a plurality of groups of data, each group of data includes a size and a path of one file to be merged, and the file to be merged may be a file whose size is smaller than or equal to a preset threshold.
In this embodiment, the target partition may be a partition for merging small files in a Hadoop Distributed File System (HDFS), the HDFS employs a master-slave structure model, one HDFS cluster is composed of a NameNode and a plurality of datanodes, and the NameNode serves as a master server to manage a namespace of the file system and access operations of clients to files; the DataNode in the cluster manages the stored data. For the HDFS, each file needs to record its metadata (description information of the file) in the memory of the NameNode no matter how large, one is about 150 BYTEs, the HDFS only has one NameNode node, and the memory is extremely limited. The more files, the more content to be recorded, the more memory of the NameNode is increased by technical means, and the cost is also increased. And each file in the distributed file system needs to be read from different servers (DataNodes), and the more the file amount is, the slower the read-write speed is.
In the present embodiment, since the amount of data in the system is large, the data is usually stored in a partitioned manner for the convenience of storage and search. The partitioning manner may be preset, and partitioning may be performed according to time in general, for example: service data generated in the same day by certain service type data is written into a folder, the folder can be used as a partition, and the partition can also comprise a plurality of subfolders. One folder of the HDFS may correspond to a table of HIVEs and a partition may be understood as a subfolder under the current folder, e.g., a subfolder of 2020-01-01 under the file of the loan service, with all loan service data of 2020-01-01 in the subfolder of 2020-01-01. The HIVE is a data warehouse tool based on Hadoop, is used for data extraction, conversion and loading, is a mechanism capable of storing, inquiring and analyzing large-scale data stored in the Hadoop, and is a distributed system infrastructure.
In this embodiment, the data stored in the HDFS is stored in the DataNode in the form of BLOCKs (BLOCK), and the default size of the BLOCK is 128M, which may be adjusted according to actual requirements. The preset threshold may be determined according to a preset BLOCK size of the HDFS, and the file size far smaller than the BLOCK size may be considered as a small file, that is, a file to be merged. For example, when the BLOCK size is 128M, the preset threshold may be set to 64M, but it is understood that other possible values may also be set, for example, 32M, which may be determined according to actual situations, and this is not limited in this embodiment of the present specification.
In this embodiment, the size and the path of each file in the target partition may be queried by using a structured query statement, and for convenience of comparing sizes of the files, the size may be uniformly in units of bytes, and of course, other units may be used, which may be determined specifically according to actual situations, and this is not limited in this specification.
S102: and determining the number of merging partitions according to the size of each file to be merged.
In this embodiment, in order to avoid an oversize synthesized file, the merging partition number may be determined according to the size of each file to be merged, and the merging partition number may be used to represent how many folders need to be merged.
S103: merging a plurality of files to be merged by using a distributed stream data stream engine based on the path and merging partition number of each file to be merged to obtain a target file set; and the number of files contained in the target file set is equal to the number of the merged partitions.
In this embodiment, a plurality of files to be merged may be merged by using a distributed stream data stream engine based on the path and the number of merging partitions of each file to be merged, so as to obtain a target file set. The target file set may include at least one merged file, and the number of files included in the target file set is equal to the number of merged partitions.
In this embodiment, the core of Flink is a distributed stream data stream engine written in Java and Scala, Flink executes arbitrary stream data programs in a data parallel and pipelined manner, and Flink's pipelined runtime system can execute batch and stream processing programs. HIVE is a data warehouse based on Hadoop, and can MAP the structural data of Hadoop into a data table, and form a MAP-REDUCE program processing file in a Structured Query Language (SQL) manner. The FILNK is used as a stream engine, and has a difference with the concept of the HIVE data warehouse, the FLINKSQL is calculated based on the memory, and is faster compared with HIVESSQL, and the FLINKSQL is 10-100 times of that of HIVESSQL in the same segment of SQL. In essence, HIVESQL converts SQL into MAP-REDUCE, which is hard disk based, while the fllnksql underlying operations are memory based. The above-mentioned MAP-REDUCE is a programming model for parallel operations of large-scale data sets.
In this embodiment, the merge partition may refer to a partition of the FLINK, and the number of partition merges may refer to how many partitions are used to calculate data in the FLINK to achieve the merge. When data merging is performed, the maximum size of the merged file may be equal to the size of the BLOCK, and when merging, a principle that the size of the merged file is equal to the size of the BLOCK is adopted as much as possible is adopted. For example: the number of merging partitions is 2, and the total size of the files to be merged is 200M, then the size of one merging partition can be equal to 128M during merging, and the rest of the files to be merged are written into the other merging partition.
S104: the target file set is written to the target partition.
In this embodiment, after the files to be merged are successfully merged, the target file set may be written into the target partition, so that data may be subsequently read based on the merged files, thereby effectively reducing the number of files in the target partition, and correspondingly reducing the total amount of metadata that needs to be recorded in the NameNode.
In this embodiment, after writing the target file set into the target partition, in order to avoid data duplicate storage, the file to be merged originally stored in the target partition may be deleted. In some embodiments, it may be verified that the merged target file set is lost, and if the verification is passed, the file to be merged, which is previously stored in the target partition, may be deleted, so as to complete the merging of the files.
From the above description, it can be seen that the embodiments of the present specification achieve the following technical effects: the method includes the steps that an information set of files to be merged of a target partition can be obtained, wherein the information set of the files to be merged includes the sizes and paths of a plurality of files to be merged, and the sizes of the files to be merged are smaller than or equal to a preset threshold. In order to ensure that the merged file has a moderate size, the number of the merged partitions can be determined according to the size of each file to be merged. And merging the plurality of files to be merged by using a distributed stream data stream engine based on the path of each file to be merged and the merging partition number to obtain a target file set, wherein the efficiency of file merging can be effectively improved by merging the distributed stream data stream engine. Further, the target set of files may be written to the target partition so that data may be subsequently read based on the merged file. Therefore, the files to be merged in the target partition can be efficiently merged, the number of the files in the target partition is effectively reduced, and the reading and writing efficiency can be improved.
In one embodiment, obtaining the information set of the file to be merged of the target partition may include: determining a target partition; and the merging state of the target partition is uncombined. The size and the path of each file in the target partition can be obtained, and the file with the file size smaller than or equal to the preset threshold value is used as the file to be merged according to the size of each file in the target partition. Further, the information set of the files to be merged may be generated based on the size and path of each file to be merged.
In the embodiment, the merging state of each partition can be marked in the database of the distributed file system, so that the same partition cannot be repeatedly operated, and resources can be effectively saved. The merge state may include: both un-merged and merged states may be marked with numbers or characters for easy computer recognition. For example, the merging status may be marked with 0 without merging, and the merging status is marked with 1, but of course, the marking manner of the merging status is not limited to the above example, and other modifications are possible for those skilled in the art in light of the technical spirit of the embodiments of the present disclosure, and all that can be achieved are intended to be covered by the scope of the embodiments of the present disclosure as long as the functions and effects achieved by the embodiments of the present disclosure are the same or similar to the embodiments of the present disclosure.
In this embodiment, the identified uncombined partitions may be used as target partitions, where the target partitions may be one or more, and whether to perform small file combining on each target partition in parallel or in series may be determined according to the current resource amount, and may be determined specifically according to an actual situation, which is not limited in this embodiment of the present specification.
In this embodiment, the files in the target partition may be screened according to the sizes of the files, and since the files with the file sizes smaller than or equal to the preset threshold may be regarded as small files, the files with the file sizes smaller than or equal to the preset threshold may be regarded as files to be merged. In order to perform file merging efficiently, a file merging operation is performed only in the case where the number of files to be merged is 2 or more. The path of the file may be: the file path is not limited to the above examples, and other modifications are possible for those skilled in the art in light of the technical spirit of the embodiments of the present disclosure, but are intended to be within the scope of the embodiments of the present disclosure as long as the functions and effects achieved by the file path are the same as or similar to the embodiments of the present disclosure.
In this embodiment, the files can be screened for the target partition that is not merged, and the file to be merged with the file size smaller than or equal to the preset threshold value is screened out, so that the information set of the file to be merged can be accurately obtained, and a data basis is provided for subsequent file merging.
In one embodiment, after writing the target set of files to the target partition, the method further comprises: and updating the merging state of the target partition into merged state. Therefore, the merging state of the target partition can be updated in time, the repeated file merging operation of the target partition is effectively avoided, and the resource waste is avoided.
In an embodiment, after obtaining the information set of the file to be merged of the target partition, the method may further include: and backing up the files to be merged to a target directory according to the path of each file to be merged.
In this embodiment, in order to avoid that data cannot be retrieved when data is lost in the process of merging files, the file to be merged in the target partition may be backed up in advance. In some embodiments, the whole folder of the target partition may also be directly backed up, so as to ensure the integrity of the data, which may be determined according to actual situations, and this is not limited in this specification.
In this embodiment, when performing backup, the file may be stored in the original storage format of the file, or the file may be stored after being compressed, for example: the storage can be done in the format of an ORC File (Optimized Row column File), which is self-describing, where the data in the File is compressed as much as possible to reduce the consumption of storage space. The target directory may be a newly created directory or an existing directory for backup in the system, which may be determined specifically according to an actual situation, and this is not limited in this embodiment of the specification.
In the embodiment, the file to be merged of the target partition can be backed up in advance, so that the situation that the data cannot be retrieved when the data is lost in the process of merging the files is effectively avoided, and the integrity of the data is ensured.
In one embodiment, after writing the target set of files to the target partition, the method may further include: and deleting the backups of the files to be merged under the target directory under the condition that the total size of the files in the target file set is determined to be equal to the total size of the files to be merged. Further, the file name of each file to be merged may be determined according to the path of each file to be merged, and the plurality of files to be merged in the target partition may be deleted based on the file name of each file to be merged.
In this embodiment, whether file merging is successfully completed or not may be verified by determining whether the total size of the files in the target file set is equal to the total size of the files to be merged, and the backup in the target directory and the original files to be merged in the target partition may be deleted if the determination is successful, so that data loss after file merging may be effectively avoided, file duplicate storage may be avoided, and a storage space is saved.
In an embodiment, determining the number of merging partitions according to the size of each file to be merged may include: and determining the total size of the files to be merged according to the size of each file to be merged, so as to obtain the size of a preset block. Further, the value of an integer part obtained by dividing the total size of the files to be merged by the preset block size and adding 1 may be used as the merged partition number.
In this embodiment, the data stored in the HDFS is stored in the DataNode in the form of a BLOCK (BLOCK), and the default size of the preset BLOCK size is 128M, which may be adjusted according to actual requirements. The specific situation can be determined according to actual situations, and the embodiment of the present specification does not limit the specific situation.
In this embodiment, the merged partition data may be calculated according to the following formula, where the merged partition number is (total size/preset block size of the plurality of files to be merged) +1, and the total size/preset block size of the plurality of files to be merged may be a rounded value. In some embodiments, the number of merged partitions may be calculated using a structured query language, such as: num/(1024 × 128) +1, where is num merge partition number, and sumLen is total size of multiple files to be merged.
In this embodiment, the merged partition data may be determined by combining the total size of the plurality of files to be merged and the preset block size, so that the determined number of merged partitions is more reasonable.
In an embodiment, merging, by using a distributed stream data streaming engine, the multiple files to be merged based on the path of each file to be merged and the number of merging partitions to obtain a target file set may include: designating at least one distributed stream data flow engine partition; wherein the number of designated distributed stream data flow engine partitions is equal to the number of merged partitions. Further, according to the path of each file to be merged, an API operator of the distributed stream data stream engine may be used to read and register the files to be merged as a temporary table in the memory. And writing the plurality of files to be merged into the at least one distributed stream data flow engine partition by using a structured query statement based on the temporary table to obtain the target file set.
In this embodiment, Flink (distributed stream data stream engine) provides 3 different APIs (application program interfaces) according to abstraction level hierarchy, each API has different emphasis on simplicity and expressiveness, and for different application scenarios, a corresponding API operator can be called according to actual requirements. The partition of the distributed stream data engine may be used for partitioning for merging, and each partition may obtain a merged file after processing. The distributed stream data flow engine partition and the target partition are two different concepts, one is a resource for data processing, and the other is a unit for data storage.
In the embodiment, the file merging can be performed by using the distributed stream data flow engine, so that the efficiency of file merging is effectively improved.
Based on the same inventive concept, the embodiment of the present specification further provides a file merging device, as described in the following embodiments. Because the principle of solving the problem of the file merging device is similar to that of the file merging method, the implementation of the file merging device can refer to the implementation of the file merging method, and repeated parts are not described again. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated. Fig. 2 is a block diagram of a structure of a file merging apparatus according to an embodiment of the present disclosure, and as shown in fig. 2, the file merging apparatus may include: the structure of the acquisition module 201, the determination module 202, the merging module 203, and the processing module 204 will be described below.
An obtaining module 201, configured to obtain an information set of a file to be merged of a target partition; the information set of the files to be merged comprises the sizes and paths of a plurality of files to be merged, and the size of the files to be merged is smaller than or equal to a preset threshold value;
the determining module 202 may be configured to determine the number of merging partitions according to the size of each file to be merged;
the merging module 203 may be configured to merge the multiple files to be merged by using a distributed stream data stream engine based on the path of each file to be merged and the number of merging partitions, so as to obtain a target file set; wherein the number of files contained in the target file set is equal to the number of the merged partitions;
the processing module 204 may be configured to write the target set of files to the target partition.
In one embodiment, the obtaining module 201 may include: a first determination unit configured to determine a target partition; wherein the merging state of the target partition is not merged; the acquisition unit is used for acquiring the size and the path of each file in the target partition; the first processing unit is used for taking the file with the file size smaller than or equal to the preset threshold value as a file to be merged according to the size of each file in the target partition; and the generating unit is used for generating the information set of the files to be merged based on the size and the path of each file to be merged.
In one embodiment, the file merging apparatus may further include: and the backup unit is used for backing up the files to be merged to a target directory according to the paths of the files to be merged.
In one embodiment, the file merging apparatus may further include: a first deleting unit, configured to delete the backups of the multiple files to be merged in the target directory if it is determined that the total size of the files in the target file set is equal to the total size of the multiple files to be merged; a second determining unit, configured to determine, according to the path of each file to be merged, a file name of each file to be merged; and the second deleting unit is used for deleting the files to be merged in the target partition based on the file names of the files to be merged.
In one embodiment, the merge module 203 comprises: a specifying unit for specifying at least one distributed stream data flow engine partition; wherein the number of designated distributed stream data flow engine partitions is equal to the number of merged partitions; the second processing unit is used for reading and registering the files to be merged into a temporary table in the memory by using an API operator of a distributed stream data stream engine according to the path of each file to be merged; and the writing unit is used for writing the plurality of files to be merged into the at least one distributed stream data flow engine partition by using a structured query statement based on the temporary table to obtain the target file set.
The embodiment of the present specification further provides an electronic device, which may specifically refer to a schematic structural diagram of an electronic device based on the file merging method provided by the embodiment of the present specification, shown in fig. 3, where the electronic device may specifically include an input device 31, a processor 32, and a memory 33. The input device 31 may specifically be configured to input a file information set to be merged of a target partition; the information set of the files to be merged comprises the sizes and paths of the files to be merged, and the size of the files to be merged is smaller than or equal to a preset threshold value. The processor 32 may be specifically configured to determine the number of merging partitions according to the size of each file to be merged; merging the files to be merged by using a distributed stream data stream engine based on the path of each file to be merged and the merging partition number to obtain a target file set; wherein the number of files contained in the target file set is equal to the number of the merged partitions; and writing the target file set into the target partition. The memory 33 may be specifically configured to store parameters such as the number of merging partitions.
In this embodiment, the input device may be one of the main apparatuses for information exchange between a user and a computer system. The input device may include a keyboard, a mouse, a camera, a scanner, a light pen, a handwriting input board, a voice input device, etc.; the input device is used to input raw data and a program for processing the data into the computer. The input device can also acquire and receive data transmitted by other modules, units and devices. The processor may be implemented in any suitable way. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The memory may in particular be a memory device used in modern information technology for storing information. The memory may include multiple levels, and in a digital system, the memory may be any memory as long as it can store binary data; in an integrated circuit, a circuit without a physical form and with a storage function is also called a memory, such as a RAM, a FIFO and the like; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card and the like.
In this embodiment, the functions and effects specifically realized by the electronic device can be explained by comparing with other embodiments, and are not described herein again.
Embodiments of the present specification further provide a computer storage medium based on a file merging method, where the computer storage medium stores computer program instructions, and when the computer program instructions are executed, the computer storage medium may implement: acquiring an information set of files to be merged of a target partition; the information set of the files to be merged comprises the sizes and paths of a plurality of files to be merged, and the size of the files to be merged is smaller than or equal to a preset threshold value; determining the number of merging partitions according to the size of each file to be merged; merging the files to be merged by using a distributed stream data stream engine based on the path of each file to be merged and the merging partition number to obtain a target file set; wherein the number of files contained in the target file set is equal to the number of the merged partitions; and writing the target file set into the target partition.
In this embodiment, the storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk Drive (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.
In this embodiment, the functions and effects specifically realized by the program instructions stored in the computer storage medium can be explained by comparing with other embodiments, and are not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the embodiments of the present specification described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed over a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different from that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the present description are not limited to any specific combination of hardware and software.
Although the embodiments herein provide the method steps as described in the above embodiments or flowcharts, more or fewer steps may be included in the method based on conventional or non-inventive efforts. In the case of steps where no causal relationship is logically necessary, the order of execution of the steps is not limited to that provided by the embodiments of the present description. When the method is executed in an actual device or end product, the method can be executed sequentially or in parallel according to the embodiment or the method shown in the figure (for example, in the environment of a parallel processor or a multi-thread processing).
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of embodiments of the present specification should, therefore, be determined not with reference to the above description, but should instead be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
The above description is only a preferred embodiment of the embodiments of the present disclosure, and is not intended to limit the embodiments of the present disclosure, and it will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present disclosure. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the embodiments of the present disclosure should be included in the protection scope of the embodiments of the present disclosure.

Claims (10)

1. A method for merging files, comprising:
acquiring an information set of files to be merged of a target partition; the information set of the files to be merged comprises the sizes and paths of a plurality of files to be merged, and the size of the files to be merged is smaller than or equal to a preset threshold value;
determining the number of merging partitions according to the size of each file to be merged;
merging the files to be merged by using a distributed stream data stream engine based on the path of each file to be merged and the merging partition number to obtain a target file set; wherein the number of files contained in the target file set is equal to the number of the merged partitions;
and writing the target file set into the target partition.
2. The method of claim 1, wherein obtaining the information set of the file to be merged of the target partition comprises:
determining a target partition; wherein the merging state of the target partition is not merged;
acquiring the size and the path of each file in the target partition;
taking the file with the file size smaller than or equal to the preset threshold value as a file to be merged according to the size of each file in the target partition;
and generating the information set of the files to be merged based on the size and the path of each file to be merged.
3. The method of claim 2, further comprising, after writing the target set of files to the target partition: and updating the merging state of the target partition into merged state.
4. The method according to claim 1, further comprising, after obtaining the information set of the file to be merged of the target partition:
and backing up the files to be merged to a target directory according to the path of each file to be merged.
5. The method of claim 4, further comprising, after writing the target set of files to the target partition:
deleting the backups of the files to be merged under the target directory under the condition that the total size of the files in the target file set is determined to be equal to the total size of the files to be merged;
determining the file name of each file to be merged according to the path of each file to be merged;
deleting the plurality of files to be merged in the target partition based on the file names of the files to be merged.
6. The method according to claim 1, wherein determining the number of merged partitions according to the size of each file to be merged comprises:
determining the total size of the files to be merged according to the size of each file to be merged;
acquiring a preset block size;
and taking the value of an integer part obtained by dividing the total size of the files to be merged by the preset block size and adding 1 as the merging partition number.
7. The method of claim 1, wherein merging the plurality of files to be merged using a distributed stream data streaming engine based on the path of each file to be merged and the number of merging partitions to obtain a target file set comprises:
designating at least one distributed stream data flow engine partition; wherein the number of designated distributed stream data flow engine partitions is equal to the number of merged partitions;
reading and registering the plurality of files to be merged as a temporary table in a memory by using an API operator of a distributed stream data stream engine according to the path of each file to be merged;
and writing the plurality of files to be merged into the at least one distributed stream data flow engine partition by using a structured query statement based on the temporary table to obtain the target file set.
8. A file merging apparatus, comprising:
the acquisition module is used for acquiring the information set of the files to be merged of the target partition; the information set of the files to be merged comprises the sizes and paths of a plurality of files to be merged, and the size of the files to be merged is smaller than or equal to a preset threshold value;
the determining module is used for determining the number of the merging partitions according to the size of each file to be merged;
the merging module is used for merging the files to be merged by utilizing a distributed stream data stream engine based on the path of each file to be merged and the number of the merging partitions to obtain a target file set; wherein the number of files contained in the target file set is equal to the number of the merged partitions;
and the processing module is used for writing the target file set into the target partition.
9. A file merging device comprising a processor and a memory for storing processor-executable instructions, the processor implementing the steps of the method of any one of claims 1 to 7 when executing the instructions.
10. A computer-readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the method of any one of claims 1 to 7.
CN202110174031.9A 2021-02-07 2021-02-07 File merging method, device and equipment Pending CN112965939A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110174031.9A CN112965939A (en) 2021-02-07 2021-02-07 File merging method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110174031.9A CN112965939A (en) 2021-02-07 2021-02-07 File merging method, device and equipment

Publications (1)

Publication Number Publication Date
CN112965939A true CN112965939A (en) 2021-06-15

Family

ID=76284259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110174031.9A Pending CN112965939A (en) 2021-02-07 2021-02-07 File merging method, device and equipment

Country Status (1)

Country Link
CN (1) CN112965939A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117632860A (en) * 2024-01-25 2024-03-01 云粒智慧科技有限公司 Method and device for merging small files based on Flink engine and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947712A (en) * 2019-03-08 2019-06-28 北京京东尚科信息技术有限公司 Automatically merge method, system, equipment and the medium of file in Computational frame
CN110321329A (en) * 2019-06-18 2019-10-11 中盈优创资讯科技有限公司 Data processing method and device based on big data
CN111488323A (en) * 2020-04-14 2020-08-04 中国农业银行股份有限公司 Data processing method and device and electronic equipment
CN112231293A (en) * 2020-09-14 2021-01-15 杭州数梦工场科技有限公司 File reading method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947712A (en) * 2019-03-08 2019-06-28 北京京东尚科信息技术有限公司 Automatically merge method, system, equipment and the medium of file in Computational frame
CN110321329A (en) * 2019-06-18 2019-10-11 中盈优创资讯科技有限公司 Data processing method and device based on big data
CN111488323A (en) * 2020-04-14 2020-08-04 中国农业银行股份有限公司 Data processing method and device and electronic equipment
CN112231293A (en) * 2020-09-14 2021-01-15 杭州数梦工场科技有限公司 File reading method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117632860A (en) * 2024-01-25 2024-03-01 云粒智慧科技有限公司 Method and device for merging small files based on Flink engine and electronic equipment

Similar Documents

Publication Publication Date Title
CN109254733B (en) Method, device and system for storing data
US11169978B2 (en) Distributed pipeline optimization for data preparation
US10558615B2 (en) Atomic incremental load for map-reduce systems on append-only file systems
US10891264B2 (en) Distributed, scalable key-value store
US11461304B2 (en) Signature-based cache optimization for data preparation
DE102016013248A1 (en) Reference block accumulation in a reference quantity for deduplication in storage management
CN111324610A (en) Data synchronization method and device
CN107665219B (en) Log management method and device
US10642815B2 (en) Step editor for data preparation
EP3362808B1 (en) Cache optimization for data preparation
CN112965939A (en) File merging method, device and equipment
CN115114370B (en) Master-slave database synchronization method and device, electronic equipment and storage medium
US11803525B2 (en) Selection and movement of data between nodes of a distributed storage system
CN114297196A (en) Metadata storage method and device, electronic equipment and storage medium
CN114490509A (en) Tracking change data capture log history
US10706012B2 (en) File creation
US11288447B2 (en) Step editor for data preparation
US11755538B2 (en) Distributed management of file modification-time field
US20220335030A1 (en) Cache optimization for data preparation
CN117891796A (en) HDFS mass small file storage method suitable for multi-read-less-write scene
WO2017007511A1 (en) Data management using index change events
CN116301597A (en) Data storage method, device, equipment and storage medium
CN117194337A (en) Method, device, computer equipment and storage medium for selecting new source file
CN115994148A (en) Multi-table data updating method and device, electronic equipment and readable storage medium
CN116303259A (en) Method, device, equipment and medium for avoiding data repetition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination