CN114661668A - File management method and related device - Google Patents

File management method and related device Download PDF

Info

Publication number
CN114661668A
CN114661668A CN202210270375.4A CN202210270375A CN114661668A CN 114661668 A CN114661668 A CN 114661668A CN 202210270375 A CN202210270375 A CN 202210270375A CN 114661668 A CN114661668 A CN 114661668A
Authority
CN
China
Prior art keywords
file
directory
file set
task
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210270375.4A
Other languages
Chinese (zh)
Inventor
申鹏
付庆午
黄伟益
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huantai Technology Co Ltd
Original Assignee
Shenzhen Huantai Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huantai Technology Co Ltd filed Critical Shenzhen Huantai Technology Co Ltd
Priority to CN202210270375.4A priority Critical patent/CN114661668A/en
Publication of CN114661668A publication Critical patent/CN114661668A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1727Details of free space management performed by the file system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a file management method and a related device, wherein the method comprises the following steps: starting a main thread through a driving end to traverse a temporary folder generated by an execution end to obtain a task directory; starting a first sub Job sends a traversal task directory to obtain a first file set; performing filtering operation on the first file set to obtain a second file set; and starting a second sub Job to concurrently execute merging operation on the second file set to obtain a third file set, deleting the second file set, and creating a _ SUCCESS marker file after the target file is moved. Therefore, by the scheme of the embodiment of the application, the number of files in the storage system can be reduced in a self-adaptive manner, and the data life cycle management is facilitated; meanwhile, the probability of OOM appearing at the driving end can be reduced, and the stability and the execution efficiency of Spark tasks are improved.

Description

File management method and related device
Technical Field
The present application relates to the field of electronic technologies, and in particular, to a file management method and a related apparatus.
Background
In the existing file management scheme, a periodic timing task is generally configured to perform a merging task by means of a timing execution tool or system, so as to manage the data file.
However, the above method requires additional tools or systems and requires manual operation, which is inefficient in system stability and performance. Meanwhile, the file merging and deleting process and the file calling state of the user are mutually isolated, so that the problem of data asynchronism exists, and the experience of the user is influenced. Therefore, a document management method is needed to improve the above-mentioned problems.
Disclosure of Invention
The embodiment of the application provides a file management method and a related device, small data files needing to be merged are determined by screening a data file set and are grouped and merged according to partition information, so that the small data files can be ensured to occupy continuous storage space in a storage system, the number of the files in the storage system is reduced in a self-adaptive manner, and the data life cycle management is facilitated; meanwhile, the merged data file can improve the utilization rate of the memory space of the driving end, reduce the probability of OOM occurring at the driving end and improve the stability and the execution efficiency of Spark tasks.
In a first aspect, an embodiment of the present application provides a file management method, which is applied to a driving end of a file management system, and the method includes:
starting a main thread to traverse the Job temporary folder to obtain a task directory, wherein the task directory is generated by executing task submission operation by the execution end;
starting a first sub Job and traversing the task directory to obtain a first file set, wherein the first file set comprises data files to be written;
performing filtering operation on the first file set to obtain a second file set, wherein the second file set comprises a target file to be merged;
starting a second sub Job to concurrently execute a merging operation on the second file set to obtain a third file set;
and deleting the second file set after the third file set is sequentially moved to a target directory.
In a second aspect, an embodiment of the present application provides a file management method, which is applied to an execution end of a file management system, and the method includes:
and executing task submitting operation on the temporary folder to generate a task directory.
In a third aspect, an embodiment of the present application provides a file management method, which is applied to a driving end of a file management system, and the method includes:
starting a main thread to traverse a small file merging working directory under a user directory to obtain a clone directory, wherein the clone directory is generated by an execution end by copying a task directory, and the task directory is generated by the execution end executing task submitting operation;
starting a third sub Job and traversing the clone directory to obtain a fourth file set, wherein the fourth file set comprises clone files to be written in data files;
analyzing the fourth file set to obtain a fifth file set, wherein the fifth file set comprises real files of target data files to be merged;
starting a fourth sub Job to concurrently execute a merging operation on the fifth file set to obtain a sixth file set;
and deleting the fifth file set after the sixth file set is sequentially moved to a target directory.
In a fourth aspect, an embodiment of the present application provides a file management method, which is applied to an execution end of a file management system, and the method includes:
executing the writing operation of the data file to generate a temporary folder;
executing task submitting operation on the temporary folder to generate a task directory;
and copying the task directory to obtain a clone directory.
In a fifth aspect, an embodiment of the present application provides a file management apparatus, applied to a driving end of a file management system, where the apparatus includes:
the acquisition unit is used for the driving end to start a main thread to traverse the Job temporary directory and acquire a task directory generated by the execution end executing the task submitting operation;
the acquiring unit is further configured to start a first sub Job to traverse the task directory by the main thread, and acquire a first file set;
the traversing unit is used for traversing the first file set to obtain a second file set, and the second file set comprises one or more files to be merged;
a merging unit, configured to start a second sub Job by the main thread to perform merging operation on the one or more files to be merged to obtain a third file set;
and the processing unit is used for deleting the second file set after the third file set is moved to the target directory.
In a sixth aspect, an embodiment of the present application provides a file management apparatus, which is applied to an execution end of a file management system, and the apparatus includes:
and the submitting unit is used for executing task submitting operation on the temporary folder and generating a task directory.
In a seventh aspect, an embodiment of the present application provides a file management apparatus, applied to a driving end of a file management system, where the apparatus includes:
the starting unit is used for starting a main thread to traverse small file merging work under a user directory and obtain a clone directory, wherein the clone directory is generated by an execution end through copying a task directory, and the task directory is generated by the execution end executing task submitting operation;
the starting unit is further configured to start a third child Job and traverse the clone directory to obtain a fourth file set, where the fourth file set includes a clone file to be written in a data file;
the starting unit is further configured to start a fourth sub Job to concurrently execute a merge operation on the fifth file set, so as to obtain a sixth file set;
the analysis unit is used for analyzing the fourth file set to obtain a fifth file set, and the fifth file set comprises real files of the target data files to be merged;
and the processing unit is used for deleting the fifth file set after the sixth file set is sequentially moved to the target directory.
In an eighth aspect, an embodiment of the present application provides a file management apparatus, which is applied to an execution end of a file management system, and the apparatus includes:
a writing unit for performing a writing operation of the data file to generate a temporary folder;
the submitting unit is used for executing task submitting operation on the temporary folder and generating a task directory;
and the replication unit is used for obtaining the clone directory by copying the task directory.
In a ninth aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more executable program codes, wherein the one or more programs are stored in the memory and configured to be executed by the processor, and the executable program codes include instructions for performing any of the steps in the first aspect or the second aspect of the embodiment of the present application.
In a tenth aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more executable program codes, where the one or more executable program codes are stored in the memory and configured to be executed by the processor, and the program includes instructions for performing any step of the third aspect or the fourth aspect of the embodiment of the present application.
In an eleventh aspect, the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program for electronic data exchange, where the computer program makes a computer perform some or all of the steps described in the first aspect or the second aspect of the present application.
In a twelfth aspect, the present invention provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program for electronic data exchange, where the computer program makes a computer perform some or all of the steps described in the third or fourth aspect of the embodiments of the present application.
In a thirteenth aspect, the present application provides a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, where the computer program is operable to cause a computer to perform some or all of the steps described in any of the methods of the first or second aspects of the embodiments of the present application. The computer program product may be a software installation package.
In a fourteenth aspect, the present application provides a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to perform some or all of the steps of the method according to any one of the third aspect and the fourth aspect of the present application. The computer program product may be a software installation package.
The embodiment of the application discloses a file management method and a related device, and the method can start a main thread through a driving end to traverse a temporary folder generated by an execution end and acquire a task directory; starting a first sub Job concurrent traversal task directory to obtain a first file set; performing filtering operation on the first file set to obtain a second file set; and starting a second sub Job to concurrently execute merging operation on the second file set to obtain a third file set, deleting the second file set, and creating a mark file for the moved files. Therefore, by the scheme of the embodiment of the application, the number of files in the storage system can be reduced in a self-adaptive manner, and the data life cycle management is facilitated; meanwhile, the probability of OOM appearing at the driving end can be reduced, and the stability and the execution efficiency of Spark tasks are improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1A is a schematic diagram of a process for merging files in common use according to an embodiment of the present application;
fig. 1B is a schematic flowchart of a hadoop file merging method according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a file management method according to an embodiment of the present application;
FIG. 3 is a schematic interaction flow diagram of a file management method according to an embodiment of the present application;
FIG. 4 is a schematic flow chart diagram of another document processing method provided in the embodiments of the present application;
FIG. 5 is a block diagram illustrating functional units of a file management apparatus according to an embodiment of the present disclosure;
FIG. 6 is a block diagram illustrating functional units of another file management apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
1) The electronic device may be a portable electronic device, such as a cell phone, a tablet computer, a wearable electronic device with wireless communication capabilities (e.g., a smart watch), etc., that also contains other functionality, such as personal digital assistant and/or music player functionality. Exemplary embodiments of portable electronic devices include, but are not limited to, portable electronic devices that carry an IOS system, an Android system, a Microsoft system, or other operating system. The portable electronic device may also be other portable electronic devices such as a Laptop computer (Laptop) or the like. It should also be understood that in other embodiments, the electronic device may not be a portable electronic device, but may be a desktop computer.
2) Distributed File System (DFS) means that physical storage resources managed by a File System are not necessarily directly connected to a local node, but are connected to a node (which may be simply understood as a computer) through a computer network; or a complete hierarchical file system formed by combining several different logical disk partitions or volume labels. DFS provides a logical tree file system structure for resources distributed at any position on the network, so that users can access shared files distributed on the network more conveniently. The role of the individual DFS shared folders is relative to the access points through other shared folders on the network.
3) Out Of Memory (OOM) Memory overflow (OOM) refers to that there is Memory that can not be recovered in the application system or the used Memory is too much, and finally the Memory used for program operation is larger than the maximum Memory that can be provided. At this time, the program cannot run, the system prompts that the memory overflows, sometimes the software is automatically closed, the computer is restarted or the software is restarted, a part of the memory is released, the software can run normally, and the memory overflow error caused by system configuration, data flow, user codes and the like can not be avoided even if the user re-executes the task.
4) Apache Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing. Spark is a Hadoop MapReduce-like universal parallel framework sourced by UC Berkeley AMP lab (AMP laboratories, burkeley, university, ca). Spark is a very popular computing engine in the field of big data, the number of files of Spark mainly depends on the number of partition partitions and the number of table partitions in RDD, it should be noted that the two partitions mentioned here are different concepts, the partition partitions of RDD are related to the concurrency degree of Spark task execution, and the table partitions are concepts in Hive. The number of files generated by the Spark program is generally the product of the RDD partition number and the Hive table partition number, and when the Spark task parallelism is too high or the table partition number is too large, the problem of too many small files is easily caused, thereby causing a series of other problems. Spark, which has the advantages of Hadoop MapReduce; but different from MapReduce, the intermediate output result of Job can be stored in a memory, so that HDFS reading and writing is not needed any more, and Spark can be better suitable for algorithms of MapReduce requiring iteration, such as data mining, machine learning and the like.
The Spark frame mainly comprises the following parts:
application: a user-written Spark application.
Drive end Driver: the Driver in the Spark runs the main function of the Application and creates Spark Context, the purpose of creating the Spark Context is to prepare the running environment of the Spark Application, the Spark Context is responsible for communicating with the Cluster Manager to apply for resources, distribute and monitor tasks and the like, and after the execution part is run, the Driver is responsible for closing the Spark Context at the same time.
Execution side Executor: is a process running on a work node (WorkerNode) and is responsible for running Task.
RDD: the elastic distributed data set is an abstract concept of distributed memory and provides a highly limited shared memory model.
A DAG: and the directed acyclic graph reflects the dependency relationship between the RDDs.
Task: a work unit running on the Executor.
Job: a Job contains multiple RDDs and various operations that act on the respective RDDs.
Stage: the Task set is a basic scheduling unit of a Job, and one Job is divided into a plurality of groups of tasks, wherein each group of tasks is called Stage or Task set, and represents a group of associated tasks without any Shuffle dependency relationship.
The Cluter Manager: refers to an external service that acquires resources on the cluster. There are currently three types:
1) standalon, namely, the raw resource management of spark, wherein Master is responsible for the allocation of resources;
2) apache meso, a resource scheduling framework with good compatibility with hadoop MR;
3) hadoop Yarn is mainly the resource manager in Yarn.
At present, a method for merging small files in the conventional field is to merge small data files in a specified directory by an offline Spark task, and the main implementation idea is to refer to a flowchart of a commonly used file merging shown in fig. 1A, where the specific process includes:
using a timing execution tool or system to configure a periodic timing task, starting a Spark task at a specified time to traverse and filter a specified directory after an original data file is written and subjected to persistent tray drop to obtain small data files to be merged, then performing merging operation, generating a large data file in an intermediate directory, and if the small files are successfully merged and executed, transferring the large data file generated by merging to the target directory and deleting the original small data file; if the Spark task fails to be executed due to the exception in the merging process, the temporary data directory and the file generated in the merging process need to be cleared manually, and then the Spark task is rescheduled and executed to merge the small data files. In order to better understand the technical solution of the present application, the following detailed description is given with reference to specific embodiments.
In the above scheme, an additional tool or system is needed, and meanwhile, the file merging and deleting process and the state of the user for file calling are isolated from each other, so that the problem of data asynchronization exists. Therefore, the scheme of the application realizes the merging operation of the files based on the spark framework on the basis of the conventional file merging scheme.
In the process of performing calculation processing by Spark, a fileoutputcommit V1 file submission algorithm of Hadoop is generally adopted, and in brief, data is written into a final target directory through two rename operations, specifically referring to fig. 1B, where fig. 1B is a flow diagram of a Hadoop file merging method, and specifically includes the following steps.
Each task writes data to the following path: (data parallel write temporary directory)
tableDir/_temporary/appAttemptDir/_temporary/taskAttemptDir/dataFile。
After each task finishes reading and writing data, a commit task method is executed to do a first rename operation, and the data file is transferred from the temporary directory of the task to the following path: (task target directory)
tableDir/_temporary/appAttemptDir/taskDir/dataFile。
When all task tasks complete commit the commit task method to submit the data file, the driver end will execute the commit Job method to do the second rename operation, transfer the data file from the temporary directory of Spark Job to the following target path according to the order, and generate a _ SUCCESS flag file: (Final destination directory)
tableDir/datafile。
tableDir/_SUCCESS。
The Spark job execution principle based on the Hadoop fileoutputcommit V1 file submission algorithm is briefly summarized as the following specific process corresponding to fig. 2.
For a better understanding of the content of the solution of the present application, the following description of the solution will be made with reference to specific examples.
Referring to fig. 2, fig. 2 is a schematic flowchart of a file management method applied to a driver end of a file management system according to an embodiment of the present application, and as shown in fig. 2, the file management method in the present application specifically includes the following operation steps.
S201, starting a main thread to traverse the job temporary folder, and acquiring a task temporary directory.
And the task directory is generated by the execution end executing task submitting operation.
Specifically, the scheme of the application mainly utilizes spark architecture to realize the content of the scheme. The method provided by the scheme of the application can be suitable for all distributed file systems which are adapted to the file submission protocol of Hadoop.
Specifically, when the execution end executor writes data into the file system, a temporary folder for storing the data file is generated, wherein one or more subfolders exist below the temporary folder, and each subfolder has a corresponding directory path.
Furthermore, after the executor finishes the data reading operation, the executor executes the data submitting operation to store the data in the temporary directory of the execution end. In the process, the task submitting operation of the execution end adopts parallel operation, and a plurality of data files can be submitted simultaneously. When the parallel operation is carried out, the execution end can generate a task directory, and the task directory is used for representing the submission event.
S202, starting the first sub Job and traversing the task directory to obtain a first file set.
Specifically, the first file set includes data files to be written.
In particular, directories are another important concept in file systems. A "directory" or "folder" is a virtual container that holds a set of files and other directories. A typical file system may include thousands of directories, and multiple files may be stored in a directory for the purpose of storing the files in an organized manner. Also, additional directories (called subdirectories) may be maintained in a directory, with the directories and files forming a hierarchical structure (i.e., a directory tree).
Specifically, the driving end starts a main thread to traverse the task directory, and then obtains the data submission directory of each task. The data submission catalog includes: data file information and a data file path. Wherein the data file information includes a size, a name, etc. of the data file.
Specifically, the data files to be written in the first file set include, but are not limited to: large data files of the file system, small data files that need to perform a merge operation, etc. may be stored directly.
S203, filtering operation is carried out on the first file set to obtain a second file set.
Specifically, the second file set includes a target file to be merged.
Specifically, in the scheme of the present application, a merge operation is mainly performed on a small data file, so as to reduce the problem of resource waste and the like caused by discontinuous storage of a disk space due to fragmented files. Therefore, when the data file to be written is subjected to a disk-dropping operation, the data file needs to be filtered. And then, screening out the target files to be merged which need to be merged to form a second file set.
And S204, starting a second sub Job to concurrently execute a merging operation on the second file set to obtain a third file set.
Specifically, after the data file to be merged is selected in step S203, the driving end starts a merge sub Job, and performs a merge operation on the filtering result. And obtaining a third merged file set, wherein the file set at least comprises one merged data file.
S205, after the third file set is sequentially moved to the target directory, the second file set is deleted.
Specifically, after the operation of step S204 is finished, the merged third file set is moved to the target directory.
Further, the second file set is deleted, and the occupied memory resource is released. And meanwhile, creating a mark file for the moved third file set, wherein the mark file is used for representing that the current data file is successfully stored in the target directory.
The embodiment of the application discloses a file management method, which can start a main thread through a driving end to traverse a Job temporary folder generated by an execution end to obtain a task directory; starting a first sub Job to concurrently traverse the task directory to obtain a first file set; performing filtering operation on the first file set to obtain a second file set; and starting the second sub Job to concurrently execute merging operation on the second file set to obtain a third file set, deleting the second file set, and creating a marker file for the moved files. Therefore, according to the scheme of the embodiment of the application, the data file set needing to be written into the storage system is determined by acquiring the task directory generated by the execution end, the small data files needing to be combined are determined by screening the data file set and grouped and combined according to the partition information, so that the small data files can be ensured to occupy continuous storage space in the storage system, the number of the files in the storage system is reduced in a self-adaptive manner, and the life cycle management of data is facilitated; meanwhile, the merged data file can improve the utilization rate of the memory space of the Spark drive end, reduce the probability of OOM occurring at the drive end, and improve the stability and the execution efficiency of Spark tasks.
In one possible example, the task directory includes at least one subdirectory; the initiating the first child Job and traversing the task directory to obtain the first set of files may comprise: the driving end starts the first sub Job; the first sub Job conducts concurrent traversal on the data files under the at least one sub directory according to the naming rule of the data files to obtain the data files to be written; and constructing the first file set according to the data file to be written.
Specifically, when a file system is organized using a directory tree, a file name needs to be specified by some method. Pathnames describe how to identify the location of a file in a file system, and are a sequence of component names separated by separators (usually slashes). A component name is a sequence of characters that specifies a name that is uniquely contained in a prefix (parent) portion. A full pathname starts with a separator (for ease of discussion, the slash "/" is used later as a separator) and specifies a file that can be found starting from the root of the file system (directory without ancestors) and following the branch of the file tree where the successor component name of the pathname resides.
Specifically, the driving end starts the first child Job to traverse and filter the task directory obtained when the execution end performs the task submitting operation in step S201 according to the naming rule, that is, concurrently traverse the parent directory and the child directories in the task directory to obtain all the data files to be written. And constructing a first file set by the data files to be written obtained through traversal and filtering operation.
In the embodiment of the application, the task directories are traversed concurrently by the first sub Job according to the naming rules of the directories, so that all data files to be written can be acquired in a short time, and the next task operation is performed.
In a possible example, after obtaining the data file to be written, the method further includes the following steps: and analyzing the storage path of the data file to be written according to a table partition directory structure to obtain a mapping information table, wherein the mapping information table comprises the name of the data file to be written and table partition information corresponding to the name.
Specifically, when a data file is written into a data table, the data table allocates a corresponding storage space, and a table partition is a data organization scheme. Each data partition is stored separately. These memory objects may be in different tablespaces or in the same tablespace. Query processing may also utilize separate data to avoid scanning for irrelevant data, thereby allowing many data warehouse style queries to have better query performance.
Illustratively, after the data file to be written is obtained, the storage path of the data file is analyzed according to the table partition directory structure, so as to obtain the storage path corresponding to each data file, i.e. the mapping information table.
It can be seen that, in the embodiment of the present application, partition information corresponding to each data file can be obtained through analysis of the data file to be written, and then subsequent file merging operation can be performed according to the partition information.
In a possible example, after obtaining the data file to be written, the method further includes the following steps: filtering the data files to be written according to a merging rule to obtain the target files to be merged, wherein the merging rule is used for screening the data files which do not meet a file size threshold; and constructing the second file set according to the target file to be merged.
In particular, the storage space occupied by the larger data file in the data file to be written is continuous and larger, and thus can be directly stored in the target directory. For smaller data files, which usually occupy smaller space and all occupy storage space in a fragmented and scattered manner, in order to better ensure that the disk resources of the storage system are fully utilized, a merge operation may be performed on such files, so that several small data files are merged into a larger data file for storage.
Specifically, after all data files to be written, i.e., the first file set, are acquired, all data files to be written are concurrently traversed and filtered.
In practical applications, the above operations can be performed by setting a merge rule. The merge rule includes: the size of the data file, the type of data file, etc. It should be noted that the merging rule may be performed according to the original rule of the system, or may be set manually, and is not limited specifically here. In the present application, the size of the data file is described as an example of the merge rule.
Specifically, if the data file size threshold is set to 1024KB, the driving end adds the data file size threshold to the second file set as a target file to be merged when detecting that the data file size is smaller than 1024KB during concurrent traversal filtering. If the size of the current data file is larger than 1024KB, the next data storage operation can be directly carried out.
Another possible method further includes traversing and filtering the data files according to a preset file type to obtain a second file set. The priority can be specifically set according to the file type, and the data files are filtered according to the priority.
Therefore, in the embodiment of the application, all the data files to be written can be divided into the data files which can be directly stored and the target merged files which need to be merged by traversing and filtering the data files to be written according to the merging rules. Meanwhile, simple configuration rules are provided for users to use, and the method has strong adaptability and wide application range aiming at services of different scales.
In one possible example, the method further comprises the following operational steps: performing grouping operation on the second file set according to the mapping information table to obtain a grouping result, wherein the grouping result comprises at least one group, and each group comprises the target file to be merged corresponding to the same table partition; and starting the second sub Job, and performing merging operation on the grouping result to obtain the third file set, wherein the merging operation is used for merging the target files to be merged in each group.
Specifically, the target files to be merged in the second file set are grouped according to the mapping information table obtained in the above step, so as to obtain a grouping result. The grouping result comprises at least one group according to the table partition information corresponding to each target file to be merged. Therefore, each group contains the target file to be merged in the same partition.
It can be seen that, in the embodiment of the present application, a grouping operation is performed according to the mapping information, so as to obtain a grouping corresponding to each target file to be merged. Further, a merge operation may be performed according to the grouping result. And further executing the merging operation according to the grouping result, so that the data files in the same partition can be guaranteed to be merged into the same data file, and the execution efficiency of the operation of moving to the lower side of the target directory is improved.
In one possible example, after sequentially moving the third set of files to the target directory, the method further includes the following steps: and generating a marking information file according to the third file set, wherein the marking information file is used for indicating that the third file set is moved completely.
Specifically, the driving end deletes the target file to be merged in the second file set after moving the merged data file to the final target directory. And releasing the disk space occupied by the data files in the current second file set.
In practical application, if an execution end concurrently submits a large-scale data file to be written, even if all the submission tasks are executed completely, the execution is finished after waiting for a long period of time, and the time is mainly spent on a spare driver end to perform a second rename operation.
In the scheme, in the Spark Job execution process, if a part of task submission operations are successfully executed, and the Spark Job execution fails at this time, a part of data may be externally visible, and at this time, a data consumer needs to judge the integrity of the data according to whether a _ SUCCESS flag file is newly generated. Therefore, in the present embodiment, after the driving end moves the merged data file to the final target directory, an empty flag file _ SUCCESS is created for the moved data file.
As can be seen, in the embodiment of the present application, after the move operation is completed, the data file is stored in the target directory, and the _ SUCCESS flag file is regenerated to indicate that the Spark job is successfully executed.
In one possible example, the method is also applied to an execution end of a file management system; the method comprises the following steps: and executing task submitting operation on the temporary folder to generate a task directory.
Specifically, after writing the data file to be written into the temporary folder, the execution end submits the data file to be written from the temporary folder to the target directory of the execution end.
Therefore, in the embodiment of the application, the execution end moves the temporary file to the target directory of the execution end, so that the task directory generated in the submission process is used as the basis for traversal of the driving end.
In one possible example, before the performing the task submission operation on the temporary folder, the method further includes: under the user directory, performing clone operation on the data file and the partition directory to which the data file belongs to generate a temporary directory and a file; and after the task data file and the directory to which the task data file belongs are all cloned, executing the task submitting operation.
Therefore, in the embodiment of the application, the data file to be written is written into the temporary folder through the execution end, and the data file in the temporary folder is further moved into the target directory of the execution end, so that the task directory is generated. Thus, the drive end can obtain the data file to be stored according to the task directory.
Some or all of the steps of any of the file management methods described in the file processing method embodiments will be described in detail below with reference to fig. 3. As shown in the figure, fig. 3 is an interaction flow diagram of a file management method according to an embodiment of the present application.
Among them, steps S301 to S302 and steps S308 to S311 execute part or all of the contents of the steps described in fig. 2. Step S303 to step S307 execute part or all of the steps of the execution end, which are not described herein.
It can be seen that, in the embodiment of the application, after each task at the execution end (execution end) finishes the commit task method submission operation, the driver end (drive end) executes the operation of moving the data file to be written to the target directory, that is, before the commit job operation is executed, the file merging action is inserted, and the merging of the data files and the deletion operation after the merging are performed adaptively by traversing the data file to be written, so that the number of files in the storage system can be reduced while the data file is ensured to be normally written into the target directory, and the data life cycle management is facilitated; meanwhile, the merging operation is executed before the commit Job operation is directly carried out at the driver end, so that the probability of OOM occurring at the driver end can be reduced, the Spark task stability is improved, and the execution efficiency of Spark operation of a user is improved.
Referring to fig. 4, fig. 4 is a schematic flowchart of another file processing method provided in the embodiment of the present application, which is applied to a driver end of a file management system, as shown in fig. 4, the file management method in the present application includes the following operation steps:
s401, starting a main thread to traverse the small file merging working directory under the user directory, and obtaining a clone directory.
The clone directory is generated by copying a task directory by an execution end, and the task directory is generated by executing task submission operation by the execution end.
Specifically, each time all tasks of the executor end complete writing of one temporary data file, the corresponding table partition directory structure, file names and other basic information are collected, and after all tasks are executed and the data files to be written are submitted, the task directory, the table partition directory and the corresponding empty data files of the submitted task are cloned in the small file merging work directory under the current user directory.
Further, the driving end concurrently traverses the subdirectories of the small file merging working directory under the directory and acquires a clone directory of the task directory generated by submitting the task;
s402, starting a third sub Job and traversing the clone directory to obtain a fourth file set.
The fourth file set comprises clone files of data files to be written;
and S403, analyzing the fourth file set to obtain a fifth file set.
And the fifth file set comprises real files of target data files to be merged.
S404, starting a fourth sub Job to concurrently execute a merging operation on the fifth file set to obtain a sixth file set.
S405, after the sixth file set is sequentially moved to a target directory, the fifth file set is deleted.
Specifically, the steps S402 to S405 are substantially similar to the steps S202 to S205 and the related descriptions described in fig. 2, and are not repeated herein.
Through the whole process, for the Spark job, if the data source is distributed with a large number of small files, the Spark job may also generate a large number of tasks when acquiring data from the data source, and in this case, each executor may execute a plurality of tasks in sequence. If the data source is a large data file with balanced file size distribution, the number of tasks generated by Spark operation is relatively greatly reduced when the data is acquired from the data source, and under the condition, each expert may only need to execute a small number of tasks in sequence, so that switching among multiple task tasks is avoided, the data reading efficiency is improved, and the execution efficiency of Spark operation is further improved.
As shown in table 1 below, in the practical production environment test of S3, the data reading efficiency of Spark operation according to the solution of the present application is improved.
TABLE 1
Figure BDA0003554429630000101
As shown in table 1, when the average data size and the parallelism of task execution are consistent, that is, the number of lines is 1737216599 and the parallelism is 80, the data file writing operation is performed on the data files stored in the file merging method proposed in the present application and the data files stored in the conventional data files without the file merging method. The column for the average number of documents indicates: without the introduction and use of the application scheme, the average number of documents is reduced by 99.2%; the average total memory consumption of Spark operation is reduced by 61.5 percent; the average time consumption for reading the Spark job data is reduced by 62.8 percent.
From the statistical results, it can be found that the number of small files is reduced by introducing a Spark-based dynamic self-adaptive small file merging method, the number of small files generated by Spark operation, consumed memory and CPU resources can be greatly reduced, the data reading efficiency of Spark operation is effectively improved, and the execution efficiency of the whole Spark operation is further improved.
Therefore, in the embodiment of the application, the dynamic self-adaptive small file merging method based on Spark deeply fuses the process of merging small data files into the execution process of Spark operation, so that the atomicity of operation is guaranteed, and the consistency of data is also guaranteed. Meanwhile, the small file merging method is simple and convenient to use, strong in adaptability and wide in application range. The Spark plug is applied to Spark operation in a plug-in mode by expanding a Spark native interface, and is simple and convenient to use. Meanwhile, simple configuration rules can be provided for users to use, and the method is strong in adaptability and wide in application range aiming at services of different scales. The temporary folder for storing the data file to be written in by the execution end in the distributed file system is innovatively used for temporarily storing the small file and merging the temporary data, so that the dependence on a third-party component can be avoided, and good execution efficiency can be ensured.
In one possible example, the clone directory includes at least one subdirectory; the step of starting a third child Job and traversing the clone directory to obtain a fourth file set comprises the following steps: the driving end starts the third sub Job; the third sub Job concurrently traverses the clone directory, and the third sub Job performs filtering operation on the path of the data file to obtain the clone file of the data file to be written; and constructing the fourth file set according to the clone file of the data file to be written.
Specifically, a small file merging program executed by the driver end starts a third sub Job b for concurrently traversing all clone directories, and obtains all clone files to be written in the data files by filtering and matching file paths. Wherein the cloning documents include, but are not limited to: larger data files, smaller data files that need to perform a merge operation.
Therefore, in the embodiment of the application, the execution end driver performs concurrent traversal on the clone directory, can acquire all the clone files of the data file to be written in a short time, and performs the next task operation.
In one possible example, after obtaining the clone file of the data file to be written, the method further includes the following steps: analyzing the fourth file set according to a table partition directory structure to obtain a real file of the data file to be written; acquiring relevant information of a real file of the data file to be written, wherein the relevant information comprises: file size, path and table partition information corresponding to the real file of the data file to be written.
Specifically, for all the obtained clone files, the file paths are analyzed and reconstructed according to the table partition directory structure, the table partitions corresponding to all the real data files are obtained, and further basic information such as the paths of the real data files, the file sizes and the corresponding table partitions are obtained.
Therefore, in the embodiment of the application, the clone files can be analyzed according to the related information of the clone files, so that the partition information corresponding to each real data file can be obtained, and the subsequent file merging operation can be performed according to the partition information.
In a possible example, after obtaining the actual file of the data file to be written, the method further includes the following steps: according to the relevant information, filtering the real file of the data file to be written to obtain the real file of the target data file to be merged; constructing the fifth file set according to the real file of the target data file to be merged; and performing grouping operation on the fifth file set according to the table partition information to obtain a grouping result, wherein the grouping result comprises at least one group, and each group comprises a real file of the target data file to be merged corresponding to the same table partition.
Therefore, in the embodiment of the application, grouping operation is performed according to the mapping information to obtain a group corresponding to each target file to be merged. Further, a merging operation may be performed according to the grouping result. And further executing merging operation according to the grouping result, so that the data files in the same partition can be guaranteed to be merged into the same data file, and the execution efficiency of moving the data file to the target directory is improved.
In one possible example, the method further comprises the steps of: and starting the fourth sub Job, and executing a merging operation on the grouping results to obtain the sixth file set, wherein the merging operation is used for merging the target files to be merged in each grouping.
In practical application, for Spark operation, the more the number of small files distributed in a data source is, the more the number of tasks generated when data is acquired from the data source is, and the data fragment information and the corresponding generated task meta-information are both stored in the memory of the driver end, which may bring great pressure to the driver, and even cause the operation execution failure due to the OOM abnormality.
Specifically, the driving end filters out data files which do not meet a file size threshold value according to configured small file merging rules for the obtained basic information of the real data files, groups the filtered data files according to table partition information, and classifies and aggregates the data files to be written into the same table partition.
Therefore, in the embodiment of the application, the merging operation is further executed according to the grouping result, so that the data files in the same partition can be guaranteed to be merged into the same data file, and the execution efficiency of moving the data file to the target directory is improved. Meanwhile, the probability of OOM occurring in the driver can be reduced, and the Spark task stability is improved.
In one possible example, after sequentially moving the sixth set of files to the target directory, the method further comprises the steps of: and generating a mark file according to the sixth file set, wherein the mark file is used for indicating that the sixth file set is moved completely.
As can be seen, in the embodiment of the present application, by creating a mark file of which the movement is successful for a data file that is moved, a data consumer can determine the integrity of data according to whether a _ SUCCESS mark file is newly generated.
In one possible example, the method is also used for an execution end of a file management system, and the method comprises the following steps: executing the writing operation of the data file to generate a temporary folder; executing task submitting operation on the temporary folder to generate a task directory; and copying the task directory to obtain a clone directory.
Specifically, each time all tasks of the execution end executor complete the writing of one temporary data file, the corresponding table partition directory structure, file name and other basic information are acquired, and after the task execution is finished and the data file is submitted, the task directory, the table partition directory and the corresponding empty data file are cloned in the small file merging working directory under the current user directory.
Therefore, in the embodiment of the application, the data file to be written is written into the temporary folder through the execution end, the temporary folder is further moved into the target directory of the execution end, the task directory is generated, and the task directory is copied to obtain the clone directory. Therefore, the consistency of the written data can be ensured, and the drive end can acquire the data file to be stored according to the clone directory.
The above description has introduced the solution of the embodiment of the present application mainly from the perspective of the method-side implementation process. It is understood that the electronic device comprises corresponding hardware structures and/or software modules for performing the respective functions in order to realize the above-mentioned functions. Those of skill in the art will readily appreciate that the present application is capable of hardware or a combination of hardware and computer software implementing the various illustrative elements and algorithm steps described in connection with the embodiments provided herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment of the present application can be applied to a file management system according to the above method example, where the file management system includes a driver and an executor:
the driving end is used for starting a main thread to traverse the Job temporary folder and acquiring a task directory, wherein the task directory is generated by the execution end executing task submitting operation; starting a first sub Job and traversing the task directory to obtain a first file set, wherein the first file set comprises data files to be written; performing filtering operation on the first file set to obtain a second file set, wherein the second file set comprises a target file to be merged; starting a second sub Job to merge and send the second file set to execute a merging operation to obtain a third file set; deleting the second file set after the third file set is sequentially moved to a target directory;
the execution end is used for executing task submitting operation on the temporary folder and generating a task directory.
The embodiment of the present application can be applied to another file management system according to the above method example, where the file management system includes a driving end and an execution end:
the driving end is used for starting a main thread to traverse a small file merging folder under a user directory to obtain a clone directory, wherein the clone directory is generated by an execution end through copying a task directory, and the task directory is generated by the execution end executing task submission operation; starting a third sub Job and traversing the clone directory to obtain a fourth file set, wherein the fourth file set comprises clone files of data files to be written; analyzing the fourth file set to obtain a fifth file set, wherein the fifth file set comprises real files of target data files to be merged; starting a fourth sub Job to perform merging operation on the fifth file set concurrently to obtain a sixth file set; after the sixth file set is sequentially moved to a target directory, deleting the fifth file set;
the execution terminal is used for executing the writing operation of the data file and generating a temporary folder; executing task submitting operation on the temporary folder to generate a task directory; and copying the task directory to obtain the clone directory.
In the embodiment of the present application, the electronic device may be divided into the functional units according to the method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.
In the case of dividing each functional module by corresponding functions, consistent with the embodiment of fig. 2, fig. 5 shows a block diagram of functional units of a file management apparatus provided in the embodiment of the present application, as shown in fig. 5, the file management apparatus 500 may include: a driving end 510 and an execution end 520, wherein the driving end 510 includes: the system comprises an acquisition unit 501, a traversal unit 502, a merging unit 503 and a processing unit 504, and an execution end 520 comprises a submission unit 505. Wherein the content of the first and second substances,
among other things, the acquisition unit 501 may be used to support the electronic device to perform steps S201 and S202 described above, and/or other processes for the techniques described herein.
Traversal unit 502 can be used to support an electronic device to perform step S203 described above, and/or other processes for the techniques described herein.
Merging unit 503 may be used to enable the electronic device to perform step S204 described above, and/or other processes for the techniques described herein.
Processing unit 504 may be used to support the electronic device in performing step S205 described above, and/or other processes for the techniques described herein.
The submission unit 505 may be configured to support the electronic device to perform a task submission operation on the temporary folder, and generate a task directory.
Therefore, in the file management device provided in the embodiment of the present application, the obtaining unit at the driving end starts the main thread to traverse the temporary folder generated by the submitting unit at the execution end, and obtains the task directory; starting a first sub Job concurrent traversal task directory through a traversal unit to obtain a first file set; obtaining a second file set by executing filtering operation on the first file set; and after the merging unit concurrently executes merging operation on the second file set to obtain a third file set, deleting the second file set through the processing unit, and creating a mark file for the moved files. Therefore, the device of the embodiment of the application can adaptively reduce the number of small files in the storage system, and is convenient for data life cycle management; meanwhile, the probability of OOM occurring at the driving end can be reduced, and the stability and the execution efficiency of Spark tasks are improved.
Consistent with the embodiment of fig. 4, fig. 6 shows a block diagram of functional units of another file management apparatus provided in the embodiment of the present application, and as shown in fig. 6, the file management apparatus 600 may include: drive end 610 and execution end 620, drive end 610 includes wherein: the system comprises a starting unit 601, a resolving unit 602 and a processing unit 603, and an execution end 620 comprises a writing unit 604, a submitting unit 605 and a copying unit 606. Wherein the content of the first and second substances,
among other things, the startup unit 601 may be used to support the electronic device to perform steps S401-S403 described above, and/or other processes for the techniques described herein.
Parsing unit 602 may be used to enable the electronic device to perform step S404 described above, and/or other processes for the techniques described herein.
The processing unit 603 may be used to enable the electronic device to perform step S405 described above, and/or other processes for the techniques described herein.
The writing unit 604 may be used to support the electronic device to perform writing operations of data files, resulting in temporary folders.
The submission unit 605 may be configured to support the electronic device to perform a task submission operation on the temporary folder, so as to generate a task directory.
The replication unit 606 may be used to support the electronic device to copy the task directory, resulting in a clone directory.
Therefore, in the file management device provided by the embodiment of the application, the main thread is started by the starting unit of the driving end to traverse the temporary folder generated by the submitting unit of the execution end, and the clone directory is obtained; then, starting a third sub Job to traverse the clone directory concurrently to obtain a fourth file set; starting a fourth sub Job to concurrently execute a merging operation on the fifth file set to obtain a sixth file set; analyzing the fourth file set by an analyzing unit to obtain a real file of the target data file to be merged; and deleting the fifth file set after the processing unit sequentially moves the sixth file set to the target directory. Therefore, the device of the embodiment of the application can adaptively reduce the number of files in the storage system, and is convenient for data life cycle management; meanwhile, the probability of OOM occurring at the driving end can be reduced, and the stability and the execution efficiency of Spark tasks are improved.
It should be noted that all relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again.
The electronic device provided by the embodiment is used for executing the file management method as shown in fig. 2 or fig. 5, so that the same effect as the implementation method can be achieved.
Where an integrated unit is employed, the electronic device may include a processing module, a memory module, and a communication module. The processing module may be configured to control and manage actions of the electronic device, for example, may be configured to support the electronic device to execute the steps executed by the obtaining unit 501, the traversing unit 502, the merging unit 503, the processing unit 504, and the submitting unit 505. The memory module can be used to support the electronic device in executing stored program codes and data, etc. The communication module can be used for supporting the communication between the electronic equipment and other equipment.
The electronic device provided by the embodiment is used for executing the file management method as shown in fig. 4 or fig. 6, so that the same effect as the implementation method can be achieved.
Where an integrated unit is employed, the electronic device may include a processing module, a memory module, and a communication module. The processing module may be configured to control and manage actions of the electronic device, and for example, may be configured to support the electronic device to execute steps executed by the starting unit 601, the parsing unit 602, and the processing unit 603, and executed by the writing unit 604, the submitting unit 605, and the copying unit 606. The memory module may be used to support the electronic device in executing stored program codes and data, etc. The communication module can be used for supporting the communication between the electronic equipment and other equipment.
The processing module may be a processor or a controller. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may be a combination that implements a computing function, and may include, for example, a combination of one or more microprocessors, a combination of a Digital Signal Processing (DSP) and a microprocessor, or the like. The storage module may be a memory. The communication module may specifically be a radio frequency circuit, a bluetooth chip, a Wi-Fi chip, or other devices that interact with other electronic devices.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in fig. 7, the electronic device includes a processor, a memory, a communication interface, and one or more executable program codes, where the one or more executable program codes are stored in the memory and configured to be executed by the processor.
In one possible example, the program includes instructions for performing the steps of:
starting a main thread to traverse the Job temporary folder to obtain a task directory, wherein the task directory is generated by executing task submission operation by the execution end;
starting a first sub Job and traversing the task directory to obtain a first file set, wherein the first file set comprises data files to be written;
performing filtering operation on the first file set to obtain a second file set, wherein the second file set comprises a target file to be merged;
starting a second sub Job to concurrently execute a merging operation on the second file set to obtain a third file set;
and deleting the second file set after the third file set is sequentially moved to a target directory.
It can be seen that, in the electronic device described in the embodiment of the present application, the driving end starts the main thread to traverse the temporary folder generated by the execution end, and a task directory is obtained; starting a first sub Job concurrent traversal task directory to obtain a first file set; performing filtering operation on the first file set to obtain a second file set; and starting the second sub Job to perform merging operation on the second file set, deleting the second file set after obtaining a third file set, and creating a mark file for the moved files. Therefore, by the scheme of the embodiment of the application, the number of files in the storage system can be reduced in a self-adaptive manner, and the data life cycle management is facilitated; meanwhile, the probability of OOM appearing at the driving end can be reduced, and the stability and the execution efficiency of Spark tasks are improved.
An embodiment of the present application provides a computer-readable storage medium, in which a computer program for electronic data exchange is stored, where the computer program includes an execution instruction for executing part or all of the steps of any one of the file management methods described in the above file management method embodiments, and the computer includes an electronic terminal device.
Embodiments of the present application provide a computer program product, wherein the computer program product comprises a computer program operable to cause a computer to perform some or all of the steps of any of the file management methods as described in the above method embodiments, and the computer program product may be a software installation package.
It should be noted that, for the sake of simplicity, any of the above embodiments of the file management method is described as a series of action combinations, but those skilled in the art should understand that the present application is not limited by the described action sequence, because some steps may be performed in other sequences or simultaneously according to the present application. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
The foregoing embodiments have been described in detail, and the principles and embodiments of a file management method and related apparatuses are described herein with reference to specific embodiments, and the description of the foregoing embodiments is only used to help understand the method and its core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, the embodiments and the application scope of the present application may be changed, and in summary, the content of the present application should not be construed as a limitation to the present application.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, hardware products and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. The memory may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
While the present application has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
It will be understood by those skilled in the art that all or part of the steps of the various methods of any of the above embodiments of the file management method may be performed by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash memory disks, read-only memory, random access memory, magnetic or optical disks, and the like.
It will be appreciated that all products controlled or configured to perform the processing methods of the flowcharts described in the embodiments of the document management method of the present application, such as the apparatuses of the flowcharts described above, and computer program products, fall within the scope of the related products described herein.
It is apparent that those skilled in the art can make various changes and modifications to a file management method and apparatus provided in the present application without departing from the spirit and scope of the present application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include such modifications and variations.

Claims (25)

1. A file management method is characterized in that the method is applied to a drive end of a file management system; the method comprises the following steps:
starting a main thread to traverse a job temporary folder and acquiring a task temporary directory, wherein the task directory is generated by executing task submission operation by the execution end;
starting a first sub Job and traversing the task directory to obtain a first file set, wherein the first file set comprises data files to be written;
performing filtering operation on the first file set to obtain a second file set, wherein the second file set comprises files to be merged;
starting a second sub Job to concurrently execute a merging operation on the second file set to obtain a third file set;
and deleting the second file set after the third file set is sequentially moved to a target directory.
2. The method of claim 1, wherein the task directory comprises at least one subdirectory;
the starting of the first sub Job and the concurrent traversal of the task directory to obtain a first file set comprises:
the driving end starts the first sub Job;
the first sub Job conducts concurrent traversal on the data files under the at least one sub directory according to the naming rule of the data files to obtain the data files to be written;
and constructing the first file set according to the data file to be written.
3. The method of claim 2, wherein after said obtaining the data file to be written, the method further comprises:
and analyzing the storage path of the data file to be written according to a table partition directory structure to obtain a mapping information table, wherein the mapping information table comprises the name of the data file to be written and table partition information corresponding to the name.
4. The method according to claim 2 or 3, wherein after obtaining the data file to be written, the method further comprises:
filtering the data files to be written according to a merging rule to obtain the target files to be merged, wherein the merging rule is used for screening the data files which do not meet a file size threshold;
and constructing the second file set according to the target file to be merged.
5. The method of claim 4, further comprising:
performing grouping operation on the second file set according to the mapping information table to obtain a grouping result, wherein the grouping result comprises at least one group, and each group comprises the target file to be merged corresponding to the same table partition;
and starting the second sub Job, and executing a merging operation on the grouping results to obtain the third file set, wherein the merging operation is used for merging the target files to be merged in each grouping.
6. The method of claim 1, wherein after sequentially moving the third set of files to a target directory, the method further comprises:
and generating a mark information file according to the third file set, wherein the mark information file is used for indicating that the third file set is moved completely.
7. A file management method is characterized in that the method is applied to an execution end of a file management system; the method comprises the following steps:
and executing task submitting operation on the temporary folder to generate a task directory.
8. The method of claim 7, wherein prior to performing a task submission operation on the temporary folder, the method further comprises:
under a user directory, performing clone operation on the data file and the partition directory to which the data file belongs to generate a temporary directory and a file;
and after the task data file and the directory to which the task data file belongs are all cloned, executing the task submitting operation.
9. A file management method is characterized in that the method is applied to a drive end of a file management system; the method comprises the following steps:
starting a main thread to traverse a small file merging working directory under a user directory to obtain a clone directory, wherein the clone directory is generated by an execution end by copying a task directory, and the task directory is generated by the execution end executing task submitting operation;
starting a third sub Job and traversing the clone directory to obtain a fourth file set, wherein the fourth file set comprises clone files of data files to be written;
analyzing the fourth file set to obtain a fifth file set, wherein the fifth file set comprises real files of target data files to be merged;
starting a fourth sub Job to concurrently execute a merging operation on the fifth file set to obtain a sixth file set;
and deleting the fifth file set after the sixth file set is sequentially moved to a target directory.
10. The method of claim 9, wherein the clone directory comprises at least one subdirectory;
the starting of the third child Job concurrently traverses the clone directory to obtain a fourth file set, including:
the driving end starts the third sub Job;
the third child Job concurrently traverses the clone directory,
the third sub Job obtains a clone file of the data file to be written by executing filtering operation on the path of the data file;
and constructing the fourth file set according to the clone file of the data file to be written.
11. The method of claim 10, wherein after obtaining the clone file of the data file to be written, the method further comprises:
analyzing the fourth file set according to a table partition directory structure to obtain a real file of the data file to be written;
acquiring relevant information of the real file of the data file to be written, wherein the relevant information comprises: file size, path and table partition information corresponding to the real file of the data file to be written.
12. The method according to claim 11, wherein after obtaining the real file of the data file to be written, the method further comprises:
according to the relevant information, filtering the real file of the data file to be written to obtain the real file of the target data file to be merged;
constructing the fifth file set according to the real file of the target data file to be merged;
and performing grouping operation on the fifth file set according to the table partition information to obtain a grouping result, wherein the grouping result comprises at least one group, and each group comprises a real file of the target data file to be merged corresponding to the same table partition.
13. The method of claim 12, further comprising:
and starting the fourth sub Job, and executing a merging operation on the grouping results to obtain the sixth file set, wherein the merging operation is used for merging the target files to be merged in each grouping.
14. The method of claim 9, wherein after sequentially moving the sixth set of files to the target directory, the method further comprises:
and generating mark information according to the sixth file set, wherein the mark information is used for indicating that the sixth file set is moved completely.
15. A file management method is characterized in that the method is applied to an execution end of a file management system; the method comprises the following steps:
executing the writing operation of the data file to generate a temporary folder;
executing task submitting operation on the temporary folder to generate a task directory;
and copying the task directory to obtain a clone directory.
16. A file management system, characterized in that the file management system comprises a drive end and an execution end:
the driving end is used for starting a main thread to traverse the Job temporary folder and acquiring a task directory, wherein the task directory is generated by the execution end executing task submitting operation; starting a first sub Job and traversing the task directory to obtain a first file set, wherein the first file set comprises data files to be written; performing filtering operation on the first file set to obtain a second file set, wherein the second file set comprises files to be merged; starting a second sub Job to concurrently execute a merging operation on the second file set to obtain a third file set; after the third file set is sequentially moved to a target directory, deleting the second file set;
the execution end is used for executing task submitting operation on the temporary folder and generating a task directory.
17. A file management system, characterized in that the file management system comprises a drive end and an execution end:
the driving end is used for starting a main thread to traverse a small file merging folder under a user directory to obtain a clone directory, wherein the clone directory is generated by an execution end through copying a task directory, and the task directory is generated by the execution end executing task submitting operation; starting a third sub Job and traversing the clone directory to obtain a fourth file set, wherein the fourth file set comprises clone files of data files to be written; analyzing the fourth file set to obtain a fifth file set, wherein the fifth file set comprises real files of target data files to be merged; starting a fourth sub Job to concurrently execute a merging operation on the fifth file set to obtain a sixth file set; after the sixth file set is sequentially moved to a target directory, deleting the fifth file set;
the execution end is used for executing the writing operation of the data file and generating a temporary folder; executing task submitting operation on the temporary folder to generate a task directory; and copying the task directory to obtain the clone directory.
18. A file management apparatus applied to a drive side of a file management system, the apparatus comprising:
the acquisition unit is used for the driving end to start a main thread to traverse the temporary directory and acquire a task directory generated by the execution end executing the task submitting operation;
the acquiring unit is further configured to start a first sub Job to traverse the task directory by the main thread, and acquire a first file set;
the traversing unit is used for traversing the first file set to obtain a second file set, and the second file set comprises one or more files to be merged;
a merging unit, configured to start a second sub Job by the main thread to perform merging operation on the one or more files to be merged to obtain a third file set;
and the processing unit is used for deleting the second file set after the third file set is moved to the target directory.
19. A file management apparatus applied to an execution side of a file management system, the apparatus comprising:
and the submitting unit is used for executing task submitting operation on the temporary folder and generating a task directory.
20. A file management apparatus applied to a drive side of a file management system, the apparatus comprising:
the system comprises a starting unit, a task list generation unit and a task submitting unit, wherein the starting unit is used for starting a main thread to traverse small file merging working directories under a user directory and obtain a clone directory, the clone directory is generated by an execution end by copying a task directory, and the task directory is generated by the execution end executing task submitting operation;
the starting unit is further configured to start a third child Job and traverse the clone directory to obtain a fourth file set, where the fourth file set includes a clone file to be written in a data file;
the starting unit is further configured to start a fourth sub Job to concurrently execute a merge operation on the fifth file set, so as to obtain a sixth file set;
the analysis unit is used for analyzing the fourth file set to obtain a fifth file set, and the fifth file set comprises real files of the target data files to be merged;
and the processing unit is used for deleting the fifth file set after the sixth file set is sequentially moved to the target directory.
21. A file management apparatus applied to an execution side of a file management system, the apparatus comprising:
a writing unit for performing a writing operation of the data file to generate a temporary folder;
the submitting unit is used for executing task submitting operation on the temporary folder and generating a task directory;
and the replication unit is used for obtaining the clone directory by copying the task directory.
22. An electronic device comprising a processor, memory, a communication interface, and one or more programs, the one or more executable program codes being stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 1-6 or 7-8.
23. An electronic device comprising a processor, memory, a communication interface, and one or more programs, the one or more executable program codes being stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 9-14 or 15.
24. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any of claims 1-6 or 7-8.
25. A computer-readable storage medium, in which a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any one of claims 9-14 or 15.
CN202210270375.4A 2022-03-18 2022-03-18 File management method and related device Pending CN114661668A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210270375.4A CN114661668A (en) 2022-03-18 2022-03-18 File management method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210270375.4A CN114661668A (en) 2022-03-18 2022-03-18 File management method and related device

Publications (1)

Publication Number Publication Date
CN114661668A true CN114661668A (en) 2022-06-24

Family

ID=82028872

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210270375.4A Pending CN114661668A (en) 2022-03-18 2022-03-18 File management method and related device

Country Status (1)

Country Link
CN (1) CN114661668A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115952140A (en) * 2023-01-09 2023-04-11 弘泰信息技术(天津)有限公司 Computer resource management system and method based on big data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115952140A (en) * 2023-01-09 2023-04-11 弘泰信息技术(天津)有限公司 Computer resource management system and method based on big data
CN115952140B (en) * 2023-01-09 2023-10-27 华苏数联科技有限公司 Big data-based computer resource management system and method

Similar Documents

Publication Publication Date Title
US9740706B2 (en) Management of intermediate data spills during the shuffle phase of a map-reduce job
Padhy Big data processing with Hadoop-MapReduce in cloud systems
Vora Hadoop-HBase for large-scale data
US20130227194A1 (en) Active non-volatile memory post-processing
Humbetov Data-intensive computing with map-reduce and hadoop
WO2010005460A1 (en) Media aware distributed data layout
Mikami et al. Using the Gfarm File System as a POSIX compatible storage platform for Hadoop MapReduce applications
CN108073696B (en) GIS application method based on distributed memory database
EP3494493B1 (en) Repartitioning data in a distributed computing system
US11960442B2 (en) Storing a point in time coherently for a distributed storage system
CN111930716A (en) Database capacity expansion method, device and system
US20220083504A1 (en) Managing snapshotting of a dataset using an ordered set of b+ trees
US10489346B2 (en) Atomic update of B-tree in a persistent memory-based file system
CN112269887A (en) Distributed system based on graph database
Yan et al. Systems for Big Graph Analytics
CN114661668A (en) File management method and related device
CN113032356B (en) Cabin distributed file storage system and implementation method
CN116662019A (en) Request distribution method and device, storage medium and electronic device
CN111930684A (en) Small file processing method, device and equipment based on HDFS (Hadoop distributed File System) and storage medium
Huang et al. Survey of external memory large-scale graph processing on a multi-core system
CN113672556A (en) Batch file migration method and device
US10824640B1 (en) Framework for scheduling concurrent replication cycles
Saxena et al. Concepts of HBase archetypes in big data engineering
Pan The performance comparison of hadoop and spark
CN111767287A (en) Data import method, device, equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination