CN118227571A - File repartition method, device, equipment and storage medium - Google Patents

File repartition method, device, equipment and storage medium Download PDF

Info

Publication number
CN118227571A
CN118227571A CN202410413340.0A CN202410413340A CN118227571A CN 118227571 A CN118227571 A CN 118227571A CN 202410413340 A CN202410413340 A CN 202410413340A CN 118227571 A CN118227571 A CN 118227571A
Authority
CN
China
Prior art keywords
file
partition
data
repartitioning
identifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410413340.0A
Other languages
Chinese (zh)
Inventor
杨晨
孙喜锋
李杨
杨得力
周锋
李响
曹闯
冯彦明
王志敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan Zhongyuan Consumption Finance Co ltd
Original Assignee
Henan Zhongyuan Consumption Finance Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan Zhongyuan Consumption Finance Co ltd filed Critical Henan Zhongyuan Consumption Finance Co ltd
Priority to CN202410413340.0A priority Critical patent/CN118227571A/en
Publication of CN118227571A publication Critical patent/CN118227571A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/164File meta data generation
    • G06F16/166File name conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a file repartitioning method, a device, equipment and a storage medium, which relate to the field of data storage and comprise the following steps: generating a file repartitioning plan according to a preset file size partitioning rule, and determining file partitions to be partitioned and corresponding file identifiers; constructing a temporary file storage path in the file partition, and re-dividing the data files of the file partition according to a file re-dividing plan to generate hidden files to the temporary file storage path; and determining the current file identifier, and when the file identifiers are the same, moving the data file before the file repartition to a temporary file storage path and moving the hidden file in the temporary file storage path to the file partition. In the application, the data in the partition is repartitioned by using a mode of hiding the file, only the current file partition is repartitioned, other data is not influenced, new and old data are switched in a temporary path, and excessive or more data can not be calculated in the whole process of repartitioning the data.

Description

File repartition method, device, equipment and storage medium
Technical Field
The present invention relates to the field of data storage, and in particular, to a method, an apparatus, a device, and a storage medium for file repartitioning.
Background
When enterprises use Hadoop as data storage, hive as a database tool and Spark SQL as a large data warehouse of a data query engine, the same data under the Hadoop data partition directory can be divided and stored into files with different numbers, and different divisions have great influence on calculation efficiency, storage performance and cost. However, dividing a file into more files is not always optimal: the more files are divided, the larger the cluster metadata catalogue is, and the higher the cluster management cost is; meanwhile, if cluster resources are tense, excessive divided files can aggravate resource competing, and the overall computing efficiency can be reduced. It is therefore necessary to repartition the Hadoop file frequently to achieve an optimal balance of computation and storage.
However, in the process of dividing the Hadoop partition files, a process of copying and rewriting the directory files under the current partition is unavoidable. Because Hive is used as a database management tool in the data warehouse, if a file is selected to be copied from a target partition at present, and the file is divided on a new path, the query cannot be redirected to the new partition path under the condition that the main path of the table is not changed when Spark SQL and Hive are used for matching query data. The newly partitioned file must be re-copied to the old partition path while the old partition's data is deleted. When the general Spark SQL in the process executes the query, whether the data is moved is not determined, and the conditions of data query and actual inconsistency may exist.
The prior proposal for solving the problems comprises the following two methods:
1. in the partition dividing process, the whole table is made unavailable, and after copying and deleting the data are finished, the use is restored;
2. And copying the whole table data (comprising a plurality of partitions) to the new path to carry out data repartition, and after the repartition is finished by modifying the naming mode of the new table and the old table.
However, scheme 1 above sacrifices the availability of some of the data in the table; however, although the time for switching the table is short in scheme 2, the cost is high for operating the whole data table when dividing a certain partition, and the calculation of using only non-repartitioned partition data is affected in the switching process, so that the scheme has a great defect. Therefore, how to improve the effect of file repartitioning while guaranteeing the validity of non-repartitioned data is a problem to be solved in the art.
Disclosure of Invention
Accordingly, the present invention is directed to a method, an apparatus, a device, and a storage medium for repartitioning a file, which can repartitioning data in a file partition by using a hidden file, so that only the current file partition can be repartitioned without affecting other data, and switching new and old data is performed on a temporary path, thereby ensuring that too much or more data cannot be calculated in the whole process of data repartitioning. The specific scheme is as follows:
in a first aspect, the present application provides a file repartitioning method, including:
Generating a file repartitioning plan according to a preset file size partitioning rule, determining a file partition to be partitioned according to the file repartitioning plan, and determining a first file identifier of the file partition;
Constructing a temporary file storage path in the file partition, and re-dividing the data files of the file partition according to the file re-dividing plan to generate corresponding hidden files so as to store the hidden files in the temporary file storage path;
Determining a second file identifier of the current file partition, and comparing the first file identifier with the second file identifier;
and when the first file identifier and the second file identifier are the same, moving the data file in the file partition before the file repartition to the temporary file storage path, and moving the hidden file in the temporary file storage path to the file partition.
Optionally, the generating a file repartitioning plan according to a preset file size partitioning rule includes:
determining the calculation complexity and the data volume of the data file in the file partition, and judging whether the calculation complexity and the data volume meet preset file dividing conditions;
and determining the size of the repartitioning file according to the judging result, and generating the file repartitioning plan according to the size of the repartitioning file.
Optionally, the repartitioning the data file of the file partition according to the file repartitioning plan includes:
determining a partition file directory of the file partition according to the file repartition plan, and acquiring all file lists under the partition file directory through a preset interface;
and carrying out file segmentation and/or file merging on the file list according to the repartitionfile size.
Optionally, the generating the corresponding hidden file after the repartitioning the data file of the file partition includes:
and adding a preset identifier before the file name of the data file after repartitioning so as to take the data file after repartitioning as the hidden file.
Optionally, the determining the first file identifier of the file partition includes:
Determining a first file number of first data files in the file partition, and generating a corresponding first MD5 information code based on an MD5 information summary algorithm according to the first file number and the first data files;
Correspondingly, the determining the second file identifier of the current file partition includes:
And determining a second file number of second data files of the current file partition, and generating a corresponding second MD5 information code based on the MD5 information summarization algorithm according to the second file number and the second data files.
Optionally, after comparing the first file identifier and the second file identifier, the method further includes:
And deleting the hidden file in the temporary file storage path when the first file identifier and the second file identifier are different, and jumping to the step of determining the file partition to be divided according to the file repartitioning plan after the preset waiting time.
Optionally, before the moving the data file in the file partition before the file repartitioning to the temporary file storing path, the method further includes:
Generating a non-hidden invalid directory in the file partition so as to set the state of the data file in the file partition before the file is repartitioned to be prohibited from being read through the invalid directory based on the data reading rule of Hadoop; the invalid directory is a readable directory;
and after moving the hidden file in the temporary file storage path into the file partition, further comprising:
And deleting the invalid directory, and deleting the data file in the file partition before the file repartition in the temporary file storage path after the preset cleaning time.
In a second aspect, the present application provides a file repartitioning device, including:
the plan generation module is used for generating a file repartitioning plan according to a preset file size dividing rule, determining file partitions to be divided according to the file repartitioning plan, and determining first file identifiers of the file partitions;
the file dividing module is used for constructing a temporary file storage path in the file partition, and generating a corresponding hidden file after the data file of the file partition is re-divided according to the file re-dividing plan so as to store the hidden file in the temporary file storage path;
A file comparison module for determining a second file identifier of the current file partition and comparing the first file identifier with the second file identifier;
And the file moving module is used for moving the data file in the file partition before the file repartition to the temporary file storage path and moving the hidden file in the temporary file storage path to the file partition when the first file identifier and the second file identifier are the same.
In a third aspect, the present application provides an electronic device comprising a processor and a memory; the memory is used for storing a computer program, and the computer program is loaded and executed by the processor to realize the file repartition method.
In a fourth aspect, the present application provides a computer readable storage medium storing a computer program which when executed by a processor implements the aforementioned file repartition method.
According to the method, a file repartitioning plan is generated according to a preset file size partitioning rule, a file partition to be partitioned is determined according to the file repartitioning plan, a first file identifier of the file partition is determined, a temporary file storage path is constructed in the file partition, a corresponding hidden file is generated after the data file of the file partition is repartitioned according to the file repartitioning plan, and the hidden file is stored in the temporary file storage path; at this time, a second file identifier of the current file partition is determined, the first file identifier and the second file identifier are compared, when the first file identifier and the second file identifier are the same, a data file in the file partition before the file is subdivided is moved to a temporary file storage path, and a hidden file in the temporary file storage path is moved to the file partition. In this way, the characteristic of not inquiring hidden files in the partition can be configured by utilizing Spark SQL, the process of repartitioning the data in the partition is performed in a mode of hiding the files, and the current file partition can be repartitioned only by repartitioning the data under the directory of the partition without moving whole table data, so that the reading of other data is not influenced, the repartitioning efficiency is improved, and the resource consumption in the whole process is saved; in addition, in the process of switching new and old data, the data repartition is performed under a temporary path, and the characteristics of Spark are utilized, so that excessive or more data can not be calculated in the whole process of data repartitioning.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for repartitioning files provided by the present application;
FIG. 2 is a flow chart of file repartitioning according to the present application;
FIG. 3 is a flowchart of a specific file repartitioning method according to the present application;
FIG. 4 is a schematic diagram of a file repartitioning device according to the present application;
Fig. 5 is a block diagram of an electronic device according to the present application.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the process of dividing the Hadoop partition file, there is a process of copying and rewriting the directory file under the current partition, and because the data warehouse uses Hive as a database management tool, if the file is selected to be copied from the target partition first, the file is divided on a new path, when the Spark SQL and Hive are used for matching query data, under the condition that the main path of a table is not changed, the query cannot be redirected to the new partition path, when the general Spark SQL in the process executes the query, whether the data is moved is not determined, in the application, the characteristic of not querying the hidden file in the partition can be configured, the process of carrying out data repartition in the partition by using the hidden file can be carried out by repartitioning the data under the current partition directory, the current file partition can be repartitionat the same time without moving the whole table data, the reading of other data is not influenced, and in the process of switching new and old data, the characteristic of Spark is utilized, so that the whole process of the data repartitioning can not be calculated or the data can be more or less.
Referring to fig. 1, the embodiment of the invention discloses a file repartitioning method, which comprises the following steps:
step S11, a file repartitioning plan is generated according to a preset file size partitioning rule, a file partition to be partitioned is determined according to the file repartitioning plan, and a first file identifier of the file partition is determined.
It should be noted that, in this embodiment, the implementation is performed on the basis of Hadoop as a data storage, hive as a database tool, and Spark SQL as a data query engine, and the following features of Spark SQL need to be utilized:
spark SQL queries hdfs partition data files have two selectable properties:
1. The hidden file is not queried;
2. if the partition data has a non-hidden directory, the data reading will fail;
the partition file repartitioning process in this embodiment will take advantage of both of the features described above for Spark SQL.
Firstly, after a data file partition needing to be repartitioned is acquired, a file repartitioning plan is needed to be generated according to a preset file size partitioning rule, a file partition to be partitioned is determined according to the file repartitioning plan, and a first file identifier of the file partition is determined. When a file repartitioning plan is generated according to a preset file size partitioning rule, firstly determining the calculation complexity and the data volume of the data files in the file partitions, judging whether the calculation complexity and the data volume meet preset file partitioning conditions, determining the size of the repartitioning file according to a judging result, and generating the file repartitioning plan according to the size of the repartitioning file. Specifically, for the subareas which need to be calculated through Spark SQL, according to tuning experience, the single file has relatively high data reading efficiency in the process of uniformly distributing 80M-200M; for SQL with higher computation complexity, the partition can adopt 100M as the size of each file; for larger data volume, if the table with data exceeding 1TB is read, the file is divided according to 200M; and 120M is adopted by default for file division. For data that needs to be backed up, but not read frequently, the data for the entire partition is typically divided into 1-2 files.
When the data files of the file partitions are re-divided according to the file re-dividing plan, partition file catalogues of the file partitions can be determined according to the file re-dividing plan, all file lists under the partition file catalogues are obtained through a preset interface, and then file division and/or file merging are carried out on the file lists according to the size of the re-divided files. After determining the partition directory path needing to be subdivided, acquiring all file lists under the original partition through a Hadoop FILESYSTEM API tool, and dividing or merging the original file into a plurality of files according to the data of the original partition according to the file size standard defined in the generated file repartitioning plan.
It should be noted that, in this embodiment, the first file identifier of the file partition needs to be determined, specifically, the number of all files and the MD5 value (message-digest algorithm) of the sum of all file data in the current partition may be collected, by determining the first number of files of the first data file in the file partition, and generating the corresponding first MD5 information code based on the MD5 information digest algorithm according to the first number of files and the first data file, it is beneficial to confirm whether the file data is modified according to the identifier, so as to ensure that the data read by the user will not change during the process of repartitioning the file.
And step S12, constructing a temporary file storage path in the file partition, and re-dividing the data files of the file partition according to the file re-dividing plan to generate corresponding hidden files so as to store the hidden files in the temporary file storage path.
In this embodiment, when the Spark SQL is used to query the hdfs partition data file, the characteristic of not querying the hidden file is required to construct a temporary file storage path in the file partition, and the data file in the file partition is re-divided according to the file re-division plan to generate a corresponding hidden file, so as to store the hidden file in the temporary file storage path, as shown in fig. 2. When the data file is re-divided to generate the hidden file, a preset identifier can be added before the file name of the re-divided data file, so that the re-divided data file can be used as the hidden file, the file name is modified to enable the hidden file to be hidden, the front of the named file is added with ", and the partition SQL is skipped by default when inquiring the partition. According to the scheme, a new hidden file is generated on a temporary path, the generated hidden file does not participate in the calculation of Spark SQL, the user does not feel the hidden file, and the user can be ensured not to read file data in repartitioning.
Step S13, determining a second file identifier of the current file partition, and comparing the first file identifier with the second file identifier.
In this embodiment, in the above step S11, the second file identifier of the current file partition after deleting the hidden file needs to be determined, specifically, by determining the second number of files of the second data file of the current file partition, and generating the corresponding second MD5 information code based on the MD5 information summary algorithm according to the second number of files and the second data file, so as to ensure consistency of the whole file partition after generating the hidden file, that is, ensure that the user does not modify the file data in the file partition in the process of performing file repartitioning.
And S14, when the first file identifier and the second file identifier are the same, moving the data file in the file partition before the file repartition to the temporary file storage path, and moving the hidden file in the temporary file storage path to the file partition.
In this embodiment, as shown in fig. 2, MD5 values of the file partitions before and after the file repartition may be compared, and when the first file identifier and the second file identifier are the same, the data file in the file partition before the file repartition is moved to the temporary file storage path, and the hidden file in the temporary file storage path is moved to the file partition. By moving the original non-hidden data under the partition to the temporary storage directory (e.g.,/tmp/filecompress/{ original path }), the temporary hidden data modification path is put into the partition, and the process only modifies the hdfs metadata information, and does not actually create or delete data, so that the operation can be completed in seconds.
In this embodiment, firstly, determining the computation complexity and the data volume of a data file in a file partition, determining whether the computation complexity and the data volume meet preset file partition conditions, determining the size of a repartitioned file according to a determination result, generating a file repartitioning plan according to the size of the repartitioned file, determining the file partition to be partitioned according to the file repartitioning plan, determining a first file identifier of the file partition, constructing a temporary file storage path in the file partition, repartitioning a data file of the file partition according to the file repartitioning plan, generating a corresponding hidden file, storing the hidden file in the temporary file storage path, determining a second file identifier of the current file partition, comparing the first file identifier with the second file identifier, and moving the data file in the file partition before the file repartitioning to the temporary file storage path and moving the hidden file in the temporary file storage path to the file partition when the first file identifier and the second file identifier are identical. In this way, in this embodiment, the characteristic that the hidden file in the partition is not queried can be configured by using Spark SQL, and the process of repartitioning data in the partition is performed by using the hidden file mode, so that the data repartitioning is performed under the directory of the partition, and network IO (Input/Output) is reduced, and meanwhile, only the target partition can be repartitioned without moving whole table data, thereby improving repartitioning efficiency and saving resource consumption in the whole process; in addition, in the process of switching new and old data, the characteristic of Spark is utilized, so that excessive or more data can not be calculated in the whole process of data repartitioning.
Based on the above embodiment, the present application can use the mode of hiding the file to repartition the data under the partition directory, and only repartition the current file partition without moving the whole table data, and the file transfer process after the file repartition will be described in detail in this embodiment. Referring to fig. 3, the embodiment of the application discloses a specific file repartitioning method, which includes:
Step S21, determining a first file identifier of a file partition before file repartitioning and a second file identifier of a current file partition after file repartitioning, and comparing the first file identifier with the second file identifier.
And S22, deleting the hidden file in the temporary file storage path when the first file identifier and the second file identifier are different, and jumping to the step of determining the file partition to be divided according to the file repartitioning plan after the preset waiting time.
In this embodiment, when the first file identifier and the second file identifier are different, it is necessary to delete the hidden file in the temporary file storage path, and after a preset waiting time, jump to a step of determining the file partition to be divided according to the file repartitioning plan. It will be appreciated that after comparing the number of non-hidden files in the current partition with the MD5 value of the file data to determine whether the MD5 value is consistent with the previous value, if the data is inconsistent with the previous value, the hidden file data generated just before is the outdated data, so that the temporary directory hidden data is deleted, waiting for 5 minutes (considering that the data may be continuously modified), and then jumping to the step of file repartition, where the waiting time may be adjusted according to the historical operation habit of the user.
Step S23, when the first file identifier and the second file identifier are the same, generating a non-hidden invalid directory in the file partition so as to set the state of the data file in the file partition before the file is repartitioned to be prohibited from being read through the invalid directory based on the data reading rule of Hadoop; the invalid directory is a readable directory.
In this embodiment, as shown in fig. 2, when the first file identifier and the second file identifier are the same, a non-hidden invalid directory is generated in the file partition, so that the state of the data file in the file partition before the file is repartitioned by the invalid directory is set to prohibit reading based on the data reading rule of Hadoop; wherein the invalid directory is a readable directory. If the number of non-hidden files in the current partition is consistent with the MD5 value and the previous value of the file data, which indicates that the repartitioned hidden data is valid at this time, an invalid non-hidden directory (lockdirectory) is generated in the partition immediately, and the partition data cannot be read and modified according to the Spark SQL characteristic.
And step S24, moving the data file in the file partition before the file repartition to the temporary file storage path, and moving the hidden file in the temporary file storage path to the file partition.
In this embodiment, in the process of moving the data file in the file partition before the file repartitioning to the temporary file storing path and moving the hidden file in the temporary file storing path to the file partition, after moving the hidden file in the temporary file storing path to the file partition, the invalid directory needs to be deleted, and after the preset cleaning time, the data file in the file partition before the file repartitioning in the temporary file storing path is deleted. After the non-hidden invalid directory under the current partition is deleted, the partition restores the Spark SQL normal read-write function at the moment, and the data files in the temporary storage directory (/ tmp/filecompress/{ original path }) data are deleted after 24 hours, and the deleting time in the process of deleting the data files in the temporary directory can be flexibly adjusted according to actual conditions.
Based on the above process, in this embodiment, when a Spark SQL is started, the location of the data path in the data table is first read and recorded, and when a partition file is repartitioned at this time, it may cause failure in reading data by the Spark SQL being executed. Therefore, in order to reduce the occurrence of the situation, the Spark SQL source code is modified, when the file read in the org.apache.spark.sql.category.catalyst class is found to be absent, the original file path is automatically attempted to be increased (/ tmp/filecompress) to search the file, so that the abnormal situation that the data cannot be found due to the repartitioning of the file in the Spark SQL operation is greatly reduced through optimization.
It should be noted that, in the millisecond operation of modifying the data path and adjusting whether the data is hidden or not in the present embodiment, in the above series of operations, for the process of generating the perception for the Spark SQL user, under the real scene, under the condition of 10000 files in the hdfs partition, the operation can be completed in about 1 second as only the structure information of the data is modified. The rest of the time-consuming process is not perceived by the user, so that the data is not available only in the process of switching the hidden states of the last new and old data in the embodiment, and the process is completed in the second level, so that the whole process has very little influence on the use of the partition data.
For more specific processing in step S21, reference may be made to the corresponding content disclosed in the foregoing embodiment, and no further description is given here.
In this embodiment, a first file identifier of a file partition before file repartitioning and a second file identifier of a current file partition after file repartitioning are determined first, the first file identifier and the second file identifier are compared, when the first file identifier and the second file identifier are different, a hidden file in a temporary file storage path is deleted, and after a preset waiting time, a step of determining the file partition to be partitioned according to a file repartitioning plan is skipped; when the first file identifier and the second file identifier are the same, generating a non-hidden invalid directory in the file partition, setting the state of the data file in the file partition before the file repartition to be prohibited from being read through the invalid directory based on the data reading rule of Hadoop, then deleting the invalid directory after moving the hidden file in the temporary file storage path into the file partition, and deleting the data file in the file partition before the file repartition in the temporary file storage path after a preset cleaning time. According to the technical scheme, the characteristics of not inquiring hidden files in the partition and encountering non-hidden catalogues can be configured by utilizing Spark SQL in the embodiment, the process of repartitioning data in the partition is performed by using the hidden files, the fast new and old data switching is realized by utilizing the non-hidden catalogues, the unavailable time of the partition is reduced as far as possible, the data is unavailable only in the process of switching the hidden states of the last new and old data in the process of repartitioning the files, the process is completed in a second level, and the influence on the use of the partition data in the whole process is very small.
Referring to fig. 4, the embodiment of the application also discloses a file repartitioning device, which includes:
The plan generating module 11 is configured to generate a file repartitioning plan according to a preset file size partitioning rule, determine a file partition to be partitioned according to the file repartitioning plan, and determine a first file identifier of the file partition;
The file dividing module 12 is configured to construct a temporary file storage path in the file partition, and generate a corresponding hidden file after re-dividing the data file of the file partition according to the file re-dividing plan, so as to store the hidden file in the temporary file storage path;
A file comparison module 13, configured to determine a second file identifier of the current file partition, and compare the first file identifier with the second file identifier;
and a file moving module 14, configured to, when the first file identifier and the second file identifier are the same, move the data file in the file partition before the file repartitioning to the temporary file storage path, and move the hidden file in the temporary file storage path to the file partition.
In this embodiment, a file repartitioning plan is first generated according to a preset file size partitioning rule, a file partition to be partitioned is determined according to the file repartitioning plan, a first file identifier of the file partition is determined, a temporary file storage path is then constructed in the file partition, and a corresponding hidden file is generated after the data file of the file partition is repartitioned according to the file repartitioning plan, so that the hidden file is stored in the temporary file storage path; at this time, a second file identifier of the current file partition is determined, the first file identifier and the second file identifier are compared, when the first file identifier and the second file identifier are the same, a data file in the file partition before the file is subdivided is moved to a temporary file storage path, and a hidden file in the temporary file storage path is moved to the file partition. According to the technical scheme, the characteristic that hidden files in the partition are not queried can be configured by utilizing Spark SQL, the process of data repartitioning in the partition is performed in a mode of using the hidden files, and the current file partition can be repartitioned only without moving whole table data at the same time by repartitioning the data under the directory of the partition, so that the reading of other data is not influenced, the repartitioning efficiency is improved, and the resource consumption in the whole process is saved; in addition, in the process of switching new and old data, the data repartition is performed under a temporary path, and the characteristics of Spark are utilized, so that excessive or more data can not be calculated in the whole process of data repartitioning.
In some specific embodiments, the plan generating module 11 specifically includes:
The condition judging unit is used for determining the calculation complexity and the data volume of the data file in the file partition and judging whether the calculation complexity and the data volume meet preset file dividing conditions or not;
And the plan generation unit is used for determining the size of the repartitioning file according to the judging result and generating the file repartitioning plan according to the size of the repartitioning file.
In some specific embodiments, the plan generating module 11 specifically includes:
The list acquisition unit is used for determining a partition file directory of the file partition according to the file repartition plan and acquiring all file lists under the partition file directory through a preset interface;
And the file dividing unit is used for dividing and/or merging the files of the file list according to the size of the repartitioned file.
In some embodiments, the file dividing module 12 specifically includes:
And the file generation unit is used for adding a preset identifier before the file name of the data file subjected to the repartitioning so as to take the data file subjected to the repartitioning as the hidden file.
In some specific embodiments, the plan generating module 11 specifically includes:
A first identifier determining unit, configured to determine a first number of files of a first data file in the file partition, and generate a corresponding first MD5 information code based on an MD5 information summarization algorithm according to the first number of files and the first data file;
Correspondingly, the file comparison module 13 specifically includes:
And the second identifier determining unit is used for determining a second file number of second data files of the current file partition and generating a corresponding second MD5 information code based on the MD5 information summarization algorithm according to the second file number and the second data files.
In some embodiments, the file comparison module 13 further includes:
And the file deleting unit is used for deleting the hidden file in the temporary file storage path when the first file identifier and the second file identifier are different, and jumping to the step of determining the file partition to be divided according to the file repartitioning plan after the waiting time is preset.
In some embodiments, the file movement module 14 further comprises:
a directory generating unit configured to generate a non-hidden invalid directory in the file partition, so that a state of a data file in the file partition before file repartition is set to be prohibited from being read by the invalid directory based on a data reading rule of Hadoop; the invalid directory is a readable directory;
And the catalog deleting unit is used for deleting the invalid catalog and deleting the data files in the file partition before the file repartition in the temporary file storage path after the preset cleaning time.
Further, the embodiment of the present application further discloses an electronic device, and fig. 5 is a block diagram of an electronic device 20 according to an exemplary embodiment, where the content of the figure is not to be considered as any limitation on the scope of use of the present application.
Fig. 5 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is configured to store a computer program that is loaded and executed by the processor 21 to implement the relevant steps in the file repartitioning method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.
In this embodiment, the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.
The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.
The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and the computer program 222, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the file repartition method performed by the electronic device 20 as disclosed in any of the previous embodiments.
Further, the application also discloses a computer readable storage medium for storing a computer program; wherein the computer program, when executed by a processor, implements the file repartitioning method of the foregoing disclosure. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing has outlined rather broadly the more detailed description of the application in order that the detailed description of the application that follows may be better understood, and in order that the present principles and embodiments may be better understood; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (10)

1. A method for repartitioning a file, comprising:
Generating a file repartitioning plan according to a preset file size partitioning rule, determining a file partition to be partitioned according to the file repartitioning plan, and determining a first file identifier of the file partition;
Constructing a temporary file storage path in the file partition, and re-dividing the data files of the file partition according to the file re-dividing plan to generate corresponding hidden files so as to store the hidden files in the temporary file storage path;
Determining a second file identifier of the current file partition, and comparing the first file identifier with the second file identifier;
and when the first file identifier and the second file identifier are the same, moving the data file in the file partition before the file repartition to the temporary file storage path, and moving the hidden file in the temporary file storage path to the file partition.
2. The file repartitioning method according to claim 1, wherein said generating a file repartitioning plan according to a preset file size partitioning rule includes:
determining the calculation complexity and the data volume of the data file in the file partition, and judging whether the calculation complexity and the data volume meet preset file dividing conditions;
and determining the size of the repartitioning file according to the judging result, and generating the file repartitioning plan according to the size of the repartitioning file.
3. The method of file repartitioning according to claim 2, wherein said repartitioning the data file of the file partition according to the file repartitioning plan comprises:
determining a partition file directory of the file partition according to the file repartition plan, and acquiring all file lists under the partition file directory through a preset interface;
and carrying out file segmentation and/or file merging on the file list according to the repartitionfile size.
4. The method for repartitioning a file according to claim 1, wherein said repartitioning the data file of the file partition to generate a corresponding hidden file includes:
and adding a preset identifier before the file name of the data file after repartitioning so as to take the data file after repartitioning as the hidden file.
5. The method of file repartitioning according to claim 1, wherein said determining a first file identifier of said file partition comprises:
Determining a first file number of first data files in the file partition, and generating a corresponding first MD5 information code based on an MD5 information summary algorithm according to the first file number and the first data files;
Correspondingly, the determining the second file identifier of the current file partition includes:
And determining a second file number of second data files of the current file partition, and generating a corresponding second MD5 information code based on the MD5 information summarization algorithm according to the second file number and the second data files.
6. The method of file repartitioning according to claim 1, wherein said comparing said first file identifier with said second file identifier further comprises:
And deleting the hidden file in the temporary file storage path when the first file identifier and the second file identifier are different, and jumping to the step of determining the file partition to be divided according to the file repartitioning plan after the preset waiting time.
7. The method according to any one of claims 1 to 6, wherein before moving the data file in the file partition before the file repartition to the temporary file storing path, further comprising:
Generating a non-hidden invalid directory in the file partition so as to set the state of the data file in the file partition before the file is repartitioned to be prohibited from being read through the invalid directory based on the data reading rule of Hadoop; the invalid directory is a readable directory;
and after moving the hidden file in the temporary file storage path into the file partition, further comprising:
And deleting the invalid directory, and deleting the data file in the file partition before the file repartition in the temporary file storage path after the preset cleaning time.
8. A document repartitioning device, characterized by comprising:
the plan generation module is used for generating a file repartitioning plan according to a preset file size dividing rule, determining file partitions to be divided according to the file repartitioning plan, and determining first file identifiers of the file partitions;
the file dividing module is used for constructing a temporary file storage path in the file partition, and generating a corresponding hidden file after the data file of the file partition is re-divided according to the file re-dividing plan so as to store the hidden file in the temporary file storage path;
A file comparison module for determining a second file identifier of the current file partition and comparing the first file identifier with the second file identifier;
And the file moving module is used for moving the data file in the file partition before the file repartition to the temporary file storage path and moving the hidden file in the temporary file storage path to the file partition when the first file identifier and the second file identifier are the same.
9. An electronic device comprising a processor and a memory; wherein the memory is for storing a computer program that is loaded and executed by the processor to implement the file repartition method as claimed in any one of claims 1 to 7.
10. A computer readable storage medium for storing a computer program which when executed by a processor implements the method of file repartition according to any one of claims 1 to 7.
CN202410413340.0A 2024-04-08 2024-04-08 File repartition method, device, equipment and storage medium Pending CN118227571A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410413340.0A CN118227571A (en) 2024-04-08 2024-04-08 File repartition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410413340.0A CN118227571A (en) 2024-04-08 2024-04-08 File repartition method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN118227571A true CN118227571A (en) 2024-06-21

Family

ID=91503644

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410413340.0A Pending CN118227571A (en) 2024-04-08 2024-04-08 File repartition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN118227571A (en)

Similar Documents

Publication Publication Date Title
US11809726B2 (en) Distributed storage method and device
US10896102B2 (en) Implementing secure communication in a distributed computing system
US20190303382A1 (en) Distributed database systems and methods with pluggable storage engines
US5819272A (en) Record tracking in database replication
JP4117265B2 (en) Method and system for managing file system versions
CN101639835A (en) Method and device for partitioning application database in multi-tenant scene
CN104615606A (en) Hadoop distributed file system and management method thereof
CN106210015B (en) Cloud storage method for hot data caching in hybrid cloud structure
CN112789606A (en) Data redistribution method, device and system
CN111324606B (en) Data slicing method and device
CN112181309A (en) Online capacity expansion method for mass object storage
CN106796588B (en) The update method and equipment of concordance list
US11151081B1 (en) Data tiering service with cold tier indexing
US10817203B1 (en) Client-configurable data tiering service
CN111917834A (en) Data synchronization method and device, storage medium and computer equipment
CN103607424A (en) Server connection method and server system
US6055534A (en) File management system and file management method
CN109684270A (en) Database filing method, apparatus, system, equipment and readable storage medium storing program for executing
CN115114294A (en) Self-adaption method and device of database storage mode and computer equipment
CN112840334A (en) Method and device for managing data of partition table, management node and storage medium
US9898518B2 (en) Computer system, data allocation management method, and program
Arrieta-Salinas et al. Classic replication techniques on the cloud
CN101483668A (en) Network storage and access method, device and system for hot spot data
CN111459913B (en) Capacity expansion method and device of distributed database and electronic equipment
CN108304555A (en) Distributed maps data processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination