CN116185995A

CN116185995A - Data migration method, device, electronic equipment and storage medium

Info

Publication number: CN116185995A
Application number: CN202310208533.8A
Authority: CN
Inventors: 林鹏程
Original assignee: Hangzhou Dt Dream Technology Co Ltd
Current assignee: Hangzhou Dt Dream Technology Co Ltd
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-05-30

Abstract

The application provides a data migration method, a data migration device and electronic equipment, and relates to the technical field of data processing, wherein the method comprises the following steps: determining a first heat value of each first file set according to first metadata of each first file set in object storage in a data lake, and determining a second file set to be migrated from each first file set according to the first heat value of each first file set; wherein the first heat value of the second file set is greater than the first heat threshold; and migrating each first file in the second file set to a distributed file system of the data lake. Therefore, the method and the device can determine the heat value of each first file set based on the metadata of each first file set in the object storage, so that the second file set belonging to the heat data can be determined from each first file set according to the heat value of each first file set, and each file in the second file set belonging to the heat data is integrally migrated to the distributed file system, thereby improving the processing performance of the data storage system on the heat data.

Description

Data migration method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of data migration technologies, and in particular, to a data migration method, a data migration device, an electronic device, and a storage medium.

Background

A data lake is a repository or system that stores data in a raw format, which stores data as it is, without requiring prior structuring of the data. Current data lakes employ object storage and a distributed file system (e.g., HDFS (Hadoop Distributed File System, hadoop distributed file system)) as the data storage system, wherein the distributed file system has a relatively high advantage in computational performance over object storage, but because the distributed file system is multi-copy storage, its storage cost is higher than that of object storage. Thus, to compromise the computational performance and storage costs of the data storage system, hot data may be stored in a distributed file system in the data lake and cold data may be stored in the object store of the data lake.

Currently, in some scenarios, it is necessary to integrally migrate all files in the same file set in the object store in units of file sets. For example, training data of a neural network model is usually stored in the same file set, and during or before training of the neural network model, there may be some files updated in each file of the file set (i.e., training data), so that the part of files are changed from cold data to hot data, and other parts of files are still cold data, where if only files belonging to the hot data in the file set are separately migrated to the distributed file system, management of the training data may be unfavorable, and training of the model may be affected.

In the related art, only the files are migrated separately in units of files, but no technical scheme for integrally migrating the file sets exists, so how to migrate the hot data in the object storage in units of the file sets to the distributed file system is very important.

Disclosure of Invention

The object of the present application is to solve at least to some extent one of the above technical problems.

Therefore, the application provides a data migration method, a device, an electronic device and a storage medium, so as to determine the heat value of each first file set based on metadata of each first file set in object storage, determine a second file set belonging to thermal data from each first file set according to the heat value of each first file set, and integrally migrate each file in the second file set belonging to the thermal data to a distributed file system, thereby improving the processing performance of the data storage system on the thermal data.

An embodiment of a first aspect of the present application provides a data migration method, including:

acquiring first metadata of at least one first file set in object storage in a data lake; wherein the first set of files includes at least one first file;

Determining a first heat value of each first file set according to the first metadata of each first file set;

determining a second file set to be migrated from each first file set according to the first heat value of each first file set; wherein the first heat value of the second file set is greater than a first heat threshold;

and migrating each first file in the second file set to a distributed file system of the data lake.

An embodiment of a second aspect of the present application provides a data migration apparatus, including:

the first acquisition module is used for acquiring first metadata of at least one first file set in object storage in the data lake; wherein the first set of files includes at least one first file;

the first determining module is used for determining a first heat value of each first file set according to the first metadata of each first file set;

the second determining module is used for determining a second file set to be migrated from each first file set according to the first heat value of each first file set; wherein the first heat value of the second file set is greater than a first heat threshold;

and the migration module is used for migrating each first file in the second file set to the distributed file system of the data lake.

An embodiment of a third aspect of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the data migration method as described in the first aspect when executing the program.

An embodiment of a fourth aspect of the present application proposes a non-transitory computer readable storage medium, on which a computer program is stored, which program, when executed by a processor, implements a data migration method according to the first aspect.

An embodiment of a fifth aspect of the present application proposes a computer program product comprising a computer program which, when executed by a processor, implements a data migration method according to the first aspect of the present application.

The technical scheme provided by the embodiment of the application at least brings the following beneficial effects:

determining a first heat value of each first file set according to first metadata of each first file set in object storage in a data lake, and determining a second file set to be migrated from each first file set according to the first heat value of each first file set; wherein the first heat value of the second file set is greater than the first heat threshold; and migrating each first file in the second file set to a distributed file system of the data lake. Therefore, the method and the device can determine the heat value of each first file set based on the metadata of each first file set in the object storage, so that the second file set belonging to the heat data can be determined from each first file set according to the heat value of each first file set, and each file in the second file set belonging to the heat data is integrally migrated to the distributed file system, thereby improving the processing performance of the data storage system on the heat data.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

fig. 1 is a schematic flow chart of a data migration method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating another data migration method according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating another data migration method according to an embodiment of the present disclosure;

FIG. 4 is a flowchart illustrating another data migration method according to an embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating another data migration method according to an embodiment of the present disclosure;

FIG. 6 is a flowchart illustrating another data migration method according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a data lake according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a data migration apparatus according to one embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application.

Current data lakes employ object storage and a distributed file system (e.g., HDFS) as the data storage system, wherein the distributed file system has a relatively high advantage in computational performance over object storage, but because the distributed file system is a multi-copy storage, its storage cost is higher than that of object storage. Thus, to compromise the computational performance and storage costs of a data storage system, hot data may be stored in a distributed file system in a data lake and cold data may be stored in an object store in the data lake.

For a set of files or multiple files under a folder, distributed file systems and object stores have different implementations:

the folder in the object storage is just a logical concept, and when the folder is set by means of API (Application Programming Interface, application program interface)/SDK (Software Development Kit ), a key value (such as abc/1. Jpg) corresponding to the object can be specified, so that the function of logically forming the folder can be realized. For example, defining the key of the object as abc/1.Jpg creates a folder of abc under the bucket and a 1.Jpg file under the folder.

The folder in the object store is in fact an empty file of size 0KB, so that when the user creates an object with a key value of 1/folder 1 is defined, and if the user creates the file abc/1.Jpg, the system will not create abc/this folder, so that after deleting abc/1.Jpg, there will be no more abc this folder.

Since object storage is in a distributed storage manner, object objects are not physically stored according to folders, i.e., not all files under a folder are stored together. In the process of back-end storage, files under different folders are only different in key value prefix, so that under the framework, summary information under a certain folder, such as the size of the folder, the access frequency of the folder and the like, cannot be counted conveniently. If it is desired to traverse all files under a certain folder, the key values of all files under the folder need to be obtained first through the ListObject interface (here, the folder needs to be specified by prefix), and then the operation is performed.

Because the summarized information under the folders in the object storage cannot be counted conveniently, the heat value of the folders cannot be calculated based on the summarized information under the folders, so that the measurement of cold and hot data cannot be effectively performed on a file set consisting of a plurality of files under one folder based on the file set, namely, the cold and hot evaluation index (namely, the heat value) of the file set cannot be effectively calculated, and therefore whether the file set is hot data or cold data cannot be judged according to the cold and hot evaluation index (namely, the heat value) of the file set, and further the migration of the cold and hot data cannot be realized.

In view of the above problems, embodiments of the present application provide a data migration method, a data migration device, and an electronic device. Before describing the embodiments of the present application in detail, for ease of understanding, the general technical words are first introduced:

object storage, also known as object-based storage, is a generic term used to describe a method of resolving and processing discrete units, which are referred to as objects. The object is the same as a file in that: all contain data, differing from the document in that: objects will no longer have a hierarchy in a hierarchy. Each object is in the same level of a flat address space called a memory pool, and one object does not belong to the next level of another object.

Wherein both the file and the object have metadata associated with their own contained data, and the object is characterized by extended metadata. Each object is assigned a unique identifier that allows a server or end user to retrieve the object without having to know the physical address of the data.

HDFS refers to a distributed file system designed to fit on general purpose hardware (commodity hardware). The HDFS is a high fault tolerance system, suitable for deployment on inexpensive machines, and capable of providing high throughput data access, and well suited for application on large-scale data sets. HDFS relaxes a portion of POSIX (Portable Operating System Interface ) constraints to achieve the goal of streaming file system data.

The data migration method provided in the present application is described in detail below with reference to fig. 1.

Fig. 1 is a flow chart of a data migration method according to an embodiment of the present application.

The data migration method of the embodiment of the application can be executed by the data migration device provided by the embodiment of the application. The data migration device can be applied to electronic equipment to execute a data migration function. Alternatively, the data migration apparatus may be configured in an application of the electronic device, so that the application may perform the data migration function.

The electronic device may be any device with computing capability, and the device or an application in the device may be capable of performing a data migration function. The device with computing capability may be, for example, a personal computer (Personal Computer, abbreviated as PC), a mobile terminal, a server, etc., and the mobile terminal may be, for example, a vehicle-mounted device, a mobile phone, a tablet computer, a personal digital assistant, a wearable device, etc., and may be a hardware device with various operating systems, a touch screen, and/or a display screen.

As shown in fig. 1, the data migration method includes the steps of:

step S101, acquiring first metadata of at least one first file set in object storage in a data lake; wherein the first set of files includes at least one first file.

In embodiments of the present application, unified metadata information may be established for a set of files stored to an object store and distributed file system (e.g., HDFS) in a data lake, maintained by a unified data service (e.g., metadata service). That is, metadata information for each set of files in the object storage and distributed file system may be maintained by the data service (e.g., metadata service) described above.

In an embodiment of the present application, each file set in the object store (denoted as a first file set in the present application) may include at least one file (denoted as a first file in the present application), where the first file may include, but is not limited to: picture files, document files, PDF (Portable Document Format, portable file format) files, audio files, video files, etc.

For example, each file in the object store in the data lake may be parsed to determine each file belonging to the same file set, e.g., each file belonging to the same file set may be determined according to key values of each file.

In the embodiment of the present application, metadata information (referred to as first metadata in the present application) of the first file set may be determined according to file metadata of each first file in the first file set. Among them, file metadata includes, but is not limited to: creation time, update time, access time, number of accesses, access average interval, data source, time to lake (i.e., time to lake data), job to lake (e.g., full volume sync, incremental sync), job to integrate, file type, author information, storage creation time, storage address, etc.

The creation time and the lake entering time can be collectively called as a lake entering creation time or a lake entering creation time, and refer to the creation time in the access of the data lake; the data source refers to the connection address of the data source and the connection type of the data source when the file is extracted from the outside of the data lake; a lake entering task refers to a data integration tool name (IP (Internet Protocol, internet protocol) address) and a task name of data integration when a file is extracted from outside the data lake; the file type refers to the type to which the file belongs; the storage creation time is required to be updated in consideration of a new file in the object storage and deleting an old storage file in the distributed file system when cold data is migrated from the distributed file system (e.g., HDFS) to the object storage, or a new file in the distributed file system and deleting an old storage file in the object storage when hot data is migrated from the object storage to the distributed file system (e.g., HDFS), and is the time of migration of the file to the new storage system or the time of entering a lake.

Wherein, the access average interval can be based on the current access time T _t And the latest access time (i.e. the previous access time) T before being updated _t-1 The access interval is determined, and the access interval is obtained by multiplying the average access interval before updating by the access times before updating (N) and dividing the multiplied access times after updating.

For example, the average access interval corresponding to the current access is marked as S _t Access average interval S before update _t-1 The following steps are: s is S _t ＝((T _t -T _t-1 )+S _t-1 *N)/(N+1)。

Step S102, according to the first metadata of each first file set, determining a first heat value of each first file set.

In this embodiment of the present application, for any one first file set, the first heat value of each first file set may be determined according to the first metadata of the first file set.

As an example, the first heat value of the first file set may be determined according to at least one of a creation time (denoted as a first creation time in this application), a storage creation time (denoted as a first storage creation time in this application), an update time (denoted as a first update time in this application), an access time (denoted as a first access time in this application), a number of accesses (denoted as a first access number in this application), and an access average interval (denoted as a first access average interval in this application) in the first metadata of the first file set.

Wherein the closer the first creation time is to the current time (or current time), the greater the first heat value; the closer the first storage creation time is to the current time, the greater the first heat value; the closer the first update time is to the current time, the greater the first heat value; the closer the first access time is to the current time, the greater the first heat value; the larger the first access number is, the larger the first heat value is; the shorter the first access average interval, the greater the first heat value.

Step S103, determining a second file set to be migrated from each first file set according to the first heat value of each first file set; wherein the first heat value of the second set of files is greater than the first heat threshold.

The first heat threshold is a preset heat threshold, and the first heat threshold is a relatively large heat value.

The number of the second file sets may be one or may be multiple, which is not limited in this application.

In this embodiment of the present application, for any one first file set in the object storage, it may be determined whether the first heat value of the first file set is greater than a first heat threshold, where the first heat value of the first file set is greater than the first heat threshold, indicating that the first file set in the object storage is heat data, where the first file set may be used as a second file set to be migrated, and where the first heat value of the first file set is less than or equal to the first heat threshold, indicating that the first file set in the object storage is cold data, where no processing is required.

Step S104, each first file in the second file set is migrated to the distributed file system of the data lake.

In this embodiment of the present application, each first file of the second file set in the object storage may be migrated to a distributed file system in the data lake, for example, each first file (such as a hot file) in the second file set may be traversed to create a directory according to a file replication manner, replicated to the distributed file system, and the original file in the object storage may be deleted.

Optionally, after migrating the first files in the second file set to the distributed file system, file metadata (e.g., storage address, storage creation time) of each first file in the second file set may be updated, and first metadata (e.g., first storage creation time) of the second file set may be updated.

It can be understood that when the file set includes a plurality of files, the plurality of files may be hot files, or a part of the files are hot files, and another part of the files are cold files, in this application, all the files in the file set with relatively high heat value in the object storage are integrally migrated by taking the file set as a unit, even if the cold files exist in the file set, the cold files in the file set need to be migrated to the distributed file system, that is, in this application, the heat value is uniformly calculated according to the file set, and uniform cold and hot processing is performed on all the files in the file set according to the heat value of the file set.

According to the data migration method, first heat values of the first file sets are determined according to first metadata of the first file sets in object storage in a data lake, and second file sets to be migrated are determined from the first file sets according to the first heat values of the first file sets; wherein the first heat value of the second file set is greater than the first heat threshold; and migrating each first file in the second file set to a distributed file system of the data lake. Therefore, the method and the device can determine the heat value of each first file set based on the metadata of each first file set in the object storage, so that the second file set belonging to the heat data can be determined from each first file set according to the heat value of each first file set, and each file in the second file set belonging to the heat data is integrally migrated to the distributed file system, thereby improving the processing performance of the data storage system on the heat data.

In order to clearly explain how to calculate the heat value of each first file set in the above embodiment of the present application, the present application further proposes a data migration method.

Fig. 2 is a flow chart of another data migration method according to an embodiment of the present application.

As shown in fig. 2, the data migration method may include the steps of:

step S201, obtaining first metadata of at least one first file set in object storage in a data lake.

Wherein the first set of files includes at least one first file.

The explanation of step S201 may be referred to the related description in any embodiment of the present application, and will not be repeated here.

Step S202, for any first file set, determining a usage heat value of the any first file set according to at least one of the first creation time, the first update time and the first access time in the corresponding first metadata.

In this embodiment of the present application, for any one of the first file sets in the object storage, the usage heat value of the first file set may be determined according to at least one of the first creation time (or the first storage creation time), the first update time, and the first access time in the first metadata of the first file set.

Wherein the closer the first creation time is to the current time (or current time), the greater the usage heat value; the closer the first storage creation time is to the current time, the greater the usage heat value; the closer the first update time is to the current time, the larger the usage heat value is; the closer the first access time is to the current time, the greater the usage heat value.

In one possible implementation manner of the embodiment of the present application, a calculation manner of the usage heat value of the first file set is, for example: the first creation time (or first storage creation time), the first update time, and the first access time in the first metadata of the first file set are weighted and summed to obtain a target time, and a time difference between the target time and the set time, that is, time difference=target time-set time is calculated. In this application, the usage heat value of the first file set may be determined according to the above time difference, where the usage heat value and the above time difference have a positive correlation, and the smaller the time difference is, the smaller the usage heat value, the larger the time difference is, and the larger the usage heat value is.

That is, when the target time is later than the set time, the time difference is a positive value, when the target time is earlier than the set time, the time difference is a negative value, and the target time is later than the use heat value of the set time and is greater than the use heat value of the target time earlier than the set time.

Therefore, the method for carrying out weighted summation on the information related to time in the first metadata is used for determining the using heat value of the first file set, so that the reliability of calculating the using heat value can be improved.

Step S203, determining a frequency popularity value of the any first file set according to at least one of the first access times and the first access average interval in the corresponding first metadata.

In this embodiment of the present application, for any one first file set in the object storage, the frequency heat value of the first file set may be determined according to at least one of the first access number and the first access average interval in the first metadata of the first file set.

The larger the first access times, the larger the frequency heat value, namely the frequency heat value and the first access times are in positive correlation; the shorter the first access average interval, the greater the frequency heat value, i.e. the frequency heat value is inversely related to the first access average interval.

In one possible implementation manner of the embodiment of the present application, the frequency heat value of the first file set is calculated, for example, as follows: determining a first sub-heat value of the first file set according to a first access frequency in the first metadata of the first file set, wherein the first access frequency and the first sub-heat value are in positive correlation, i.e. the larger the first access frequency is, the larger the first sub-heat value is, and conversely, the smaller the first access frequency is, the smaller the first sub-heat value is; and, the second sub-heat value of the first file set may be determined according to the first access average interval in the first metadata of the first file set, where the first access average interval and the second sub-heat value are in a negative correlation, i.e. the shorter the first access average interval is, the larger the second sub-heat value is, whereas the longer the first access average interval is, the smaller the second sub-heat value is, so in this application, the frequency heat value of the first file set may be determined according to at least one of the first sub-heat value and the second sub-heat value.

As an example, the first sub-heat value may be used as the frequency heat value of the first file set.

As another example, the second sub-heat value may be used as the frequency heat value of the first file set.

As yet another example, a frequency heat value of the first set of files may be determined from the first sub-heat value and the second sub-heat value. For example, the sum of the first sub-heat value and the second sub-heat value may be taken as the frequency heat value of the first file set, or the average of the first sub-heat value and the second sub-heat value may be taken as the frequency heat value of the first file set, or the first sub-heat value and the second sub-heat value may be weighted and summed to obtain the frequency heat value of the first file set.

Therefore, according to the information related to access in the first metadata, the frequency heat value of the first file set is determined, and the reliability of frequency heat value calculation can be improved.

Step S204, determining the first heat value of any first file set according to at least one of the use heat value and the frequency heat value.

In the embodiment of the present application, the first heat value of the first file set may be determined according to at least one of a use heat value and a frequency heat value.

As an example, a heat value may be used as the first heat value of the first set of files.

As another example, the frequency heat value may be taken as the first heat value of the first set of files.

As yet another example, the first heat value of the first set of files may be determined from the use heat value and the frequency heat value. For example, the sum of the usage heat value and the frequency heat value may be taken as the first heat value of the first file set, or the average of the usage heat value and the frequency heat value may be taken as the first heat value of the first file set, or the usage heat value and the frequency heat value may be weighted and summed to obtain the first heat value of the first file set.

Step S205, determining a second file set to be migrated from each first file set according to the first heat value of each first file set; wherein the first heat value of the second set of files is greater than the first heat threshold.

Step S206, each first file in the second file set is migrated to the distributed file system of the data lake.

The explanation of steps S205 to S206 may be referred to the related description in any embodiment of the present application, and will not be repeated here.

The data migration method can be used for determining the first heat value of the first file set by combining multiple items of information in the first metadata of the first file set, and improves the accuracy and reliability of calculation of the first heat value.

In one possible implementation manner of the embodiment of the present application, in order to consider both the computing performance and the storage cost (or the storage overhead) of the data storage system in the data lake, the cold data in the distributed file system may also be migrated to the object storage. The above process will be described in detail with reference to fig. 3.

Fig. 3 is a flow chart of another data migration method according to an embodiment of the present application.

As shown in fig. 3, on the basis of any one of the above embodiments, the data migration method may further include the following steps:

step S301, obtaining second metadata of at least one third file set in the distributed file system in the data lake, where the third file set includes at least one second file.

In an embodiment of the present application, each file set (denoted as a third file set in the present application) in the distributed file system may include at least one file (denoted as a second file in the present application), where the second file may include, but is not limited to: picture files, document files, PDF files, audio files, video files, etc.

In the embodiment of the present application, the second metadata of the third file set may be determined according to file metadata of each second file in the third file set.

As one possible implementation, the second metadata of the third file set may include, but is not limited to: a second creation time, a second storage creation time, a second update time, a second access time, a second number of accesses, a second access average interval, and the like.

As an example, the second creation time of the third file set may be selected from creation times in file metadata of the second files in the third file set, for example, the latest creation time in the creation times of the second files in the third file set may be used as the second creation time of the third file set.

As an example, the second storage creation time of the third file set may be selected from storage creation times in file metadata of the second files in the third file set, for example, the latest storage creation time in the storage creation time of the second files in the third file set may be used as the second storage creation time of the third file set.

As an example, the second update time of the third file set may be selected from update times in file metadata of the second files in the third file set, for example, the latest update time in update times of the second files in the third file set may be used as the second update time of the third file set.

As an example, the second access time of the third file set may be selected from access times in file metadata of the second files in the third file set, for example, the latest access time in the access times of the second files in the third file set may be used as the second access time of the third file set.

As an example, the second access times of the third file set may be determined according to the access times in the file metadata of each second file in the third file set, for example, a mean value of the access times in the file metadata of each second file in the third file set may be used as the second access times of the third file set, or a median value of the access times in the file metadata of each second file in the third file set may be used as the second access times of the third file set, or the access times in the file metadata of each second file in the third file set may be weighted and summed to obtain the second access times of the third file set.

As an example, the second access average interval of the third file set may be determined according to the access average interval in the file metadata of each second file in the third file set, for example, the average value of the access average intervals in the file metadata of each second file in the third file set may be taken as the second access average interval of the third file set, or the median value of the access average intervals in the file metadata of each second file in the third file set may be taken as the second access average interval of the third file set, or the access average intervals in the file metadata of each second file in the third file set may be weighted and summed to obtain the second access average interval of the third file set.

Similarly, the first metadata of the first file set may be determined according to the file metadata of each first file in the first file set. The determining manner of the first metadata is similar to that of the second metadata, and will not be described herein.

Step S302, determining a second heat value of each third file set according to the second metadata of each third file set.

In this embodiment, for any one third file set, the second heat value of the third file set may be determined according to the second metadata of the third file set, where a calculation manner of the second heat value is similar to a calculation manner of the first heat value, which is not described herein.

Step S303, determining a fourth file set to be migrated from each third file set according to the second heat value of each third file set; wherein the second heat value of the fourth set of files is less than or equal to the second heat threshold.

The second heat threshold is also a preset heat threshold, and the second heat threshold may be smaller than the first heat threshold, or the second heat threshold may be equal to the first heat threshold.

The number of the fourth file sets may be one or may be plural, which is not limited in this application.

In this embodiment of the present application, for any one third file set in the distributed file system, whether the second heat value of the third file set is less than or equal to the second heat threshold may be determined, where, when the second heat value of the third file set is less than or equal to the second heat threshold, it is indicated that the third file set in the distributed file system is cold data, at this time, the third file set may be used as a fourth file set to be migrated, and, when the second heat value of the third file set is greater than the second heat threshold, it is indicated that the third file set in the distributed file system is hot data, where, any processing may not be required.

Step S304, each second file in the fourth file set is migrated to the object storage of the data lake.

In this embodiment of the present application, each second file in the fourth file set in the distributed file system may be migrated to the object storage of the data lake, for example, each second file (e.g., a cold file) in the fourth file set in the distributed file system may be copied to the object storage according to the object.

Optionally, after migrating the second files in the fourth file set to the object storage, file metadata (e.g., storage address, storage creation time) of each second file in the fourth file set may also be updated, and second metadata (e.g., second storage creation time) of the fourth file set may also be updated.

It can be understood that when the file set includes a plurality of files, the plurality of files may be cold files, or a part of the files are cold files, and another part of the files are hot files, in the application, all files in the file set with relatively low heat value in the distributed file system are integrally migrated in units of the file set, even if the hot files exist in the file set, the hot files in the file set need to be migrated to the object storage, that is, in the application, the heat value is uniformly calculated according to the file set, and uniform cold and hot processing is performed on all files in the file set according to the heat value of the file set.

It should be noted that, the execution timing of each step in the embodiment shown in fig. 3 is not limited in this application, for example, each step in the embodiment shown in fig. 3 may be executed sequentially with each step in the embodiment shown in fig. 1, or may be executed in parallel with each step in the embodiment shown in fig. 1.

According to the data migration method, not only can the hot data in the object storage be migrated to the distributed file system, but also the cold data in the distributed file system can be migrated to the object storage, and the computing performance and the storage cost (or the storage overhead) of the data storage system in the data lake can be considered.

In order to clearly illustrate the above embodiments of the present application, the present application further proposes a data migration method.

Fig. 4 is a flow chart of another data migration method according to an embodiment of the present application.

As shown in fig. 4, the data migration method may further include the following steps, based on the embodiment shown in fig. 1 or fig. 2:

step S401, obtaining second metadata of at least one third file set in the distributed file system in the data lake, wherein the third file set comprises at least one second file.

Step S402, determining a second heat value of each third file set according to the second metadata of each third file set.

Step S403, determining a fourth file set to be migrated from each third file set according to the second heat value of each third file set; wherein the second heat value of the fourth set of files is less than or equal to the second heat threshold.

The explanation of steps S401 to S403 may be referred to the related description in any embodiment of the present application, and will not be repeated here.

Step S404, determining whether the fourth file set is a periodically accessed file set according to the second access time and the second access average interval in the second metadata of the fourth file set.

In the embodiment of the application, whether the fourth file set is a periodically accessed file set or not (i.e. whether the fourth file set is periodically accessed cold data or not is determined) may be determined according to the second access time and the second access average interval in the second metadata of the fourth file set.

As an example, the second access time may include a current access time (or a last access time) and a previous access time, the current access interval may be determined according to a time difference between the current access time and the previous access time, in which case the fourth file set may be determined to be a periodically accessed file set, and in which case the current access interval is not equal to the second access average interval (the access interval is in days), the fourth file set may be determined to be a periodically accessed file set.

In this embodiment of the present application, when the fourth file set is not a periodically accessed file set, each second file in the fourth file set may be directly migrated to the object storage, and when the fourth file set is a periodically accessed file set, a subsequent step may be performed.

In step S405, when the fourth file set is a periodically accessed file set, the next access time of the fourth file set is determined according to the second access time and the second access average interval in the second metadata of the fourth file set.

In the embodiment of the present application, in the case where the fourth file set is a periodically accessed file set, the next access time of the fourth file set may be determined according to the second access time and the second access average interval in the second metadata of the fourth file set.

For example, the current access time (or the last access time) in the second access time may be added to the second access average interval to obtain the next access time of the fourth file set.

In step S406, if the difference between the current time and the next access time is greater than the first difference threshold, each second file in the fourth file set is migrated to the object store of the data lake.

In this embodiment of the present application, it may be determined whether a difference (i.e., a real-time difference) between a current time and a next access time of the fourth file set is greater than a set first difference threshold, and when the difference between the current time and the next access time is less than or equal to the first difference threshold, it is indicated that the next access time of the fourth file set is about to be reached.

And when the difference (i.e. the time difference) between the current time and the next access time of the fourth file set is greater than the first difference threshold, it indicates that a longer period of time will reach the next access time of the fourth file set, and at this time, in order to save the storage overhead of the distributed file system, each second file in the fourth file set may be migrated to the object storage.

In one possible implementation manner of the embodiment of the present application, after each second file in the fourth file set is migrated to the object system, each second file in the fourth file set in the object storage may be migrated to the distributed file system in a case that a next access time of the fourth file set is about to be reached.

As one example, in response to reaching the target time, migrating a fourth set of files in the object store into the distributed file system; wherein the difference (i.e., the instant difference) between the target instant and the next access time is less than or equal to the second difference threshold.

The second difference threshold is a preset difference threshold, where the second difference threshold may be equal to the first difference threshold, or the second difference threshold may be smaller than the first difference threshold.

Therefore, when the next access time of the periodically accessed cold data is about to be reached, the periodically accessed cold data is migrated from the object storage to the distributed file system, so that the processing performance of the data storage system on the periodically accessed cold data is improved.

According to the data migration method, the periodically accessed cold data can be determined from the cold data, the periodically accessed cold data is migrated to the object storage under the condition that the next access time of the periodically accessed cold data is not reached, and the periodically accessed cold data is not required to be migrated to the object storage under the condition that the next access time of the periodically accessed cold data is reached, so that the calculation performance and the storage cost (or the storage overhead) of a data storage system in a data lake can be considered.

In any embodiment of the present application, for a file of a newly-entered data lake (hereinafter simply referred to as a newly-entered file), similarity comparison may be performed between metadata of the file and metadata of each file set in the data lake, so as to determine, from each file set, a target file set in which metadata is similar to metadata of the file, and store the newly-entered file in a data storage system in which the target file set is stored, without storing the newly-entered file in a designated storage system of a task of entering the lake, and then migrating the file again according to a cold and hot state of the file. The above process will be described in detail with reference to fig. 5.

Fig. 5 is a flow chart of another data migration method according to an embodiment of the present application.

As shown in fig. 5, on the basis of the embodiment shown in any one of fig. 1 to 4, the data migration method may further include the following steps:

in step S501, file metadata of at least one file to be stored is obtained.

In the embodiments of the present application, explanation of the metadata of the file may be referred to the related description in any of the foregoing embodiments, which is not repeated herein.

Step S502, determining the similarity between the file metadata and the third metadata of each fifth file set in the data lake.

In this embodiment of the present application, for any one file to be stored, the similarity between the file metadata of the file to be stored and the third metadata of each fifth file set (including each file set stored in the object storage and each file set stored in the distributed file system) in the data lake may be calculated.

In step S503, a target file set is determined from each fifth file set according to the similarity of each fifth file set.

In the embodiment of the present application, the target file set may be determined from each fifth file set according to the similarity of each fifth file set, for example, the fifth file set with the highest similarity may be used as the target file set.

Step S504, storing the file to be stored in the target storage system stored in the target file set in the data lake.

In the embodiment of the application, the file to be stored can be stored in a target storage system stored by a target file set in a data lake. For example, when the target file set is stored in the object storage, the file to be stored may be stored in the object storage, and for example, when the target file set is stored in the distributed file system, the file to be stored may be stored in the distributed file system.

In summary, for the newly-entered file, the newly-entered file may be stored in the data storage system where the target file set with metadata similar to that of the newly-entered file is stored, without storing the newly-entered file in the designated storage system of the lake-entering task, and then migrating the file again according to the cold and hot states of the file, so that frequent file migration can be avoided, and secondary data migration consumption can be reduced.

In order to clearly explain how to obtain the first metadata of each first file set in the distributed file system in the above embodiment of the present application, the present application further proposes a data migration method.

Fig. 6 is a flowchart of another data migration method according to an embodiment of the present application.

As shown in fig. 6, the data migration method may include the steps of:

in step S601, in response to the configuration operation, the target hierarchy corresponding to the configuration file set is configured.

In this embodiment of the present application, the range of the user profile set, that is, the directory level corresponding to the profile set, may be referred to as a target level in this application, such as a first level (or first level), a second level (or second level), a last level (or last level), and so on.

In the embodiment of the application, the target hierarchy corresponding to the configuration file set can be configured in response to configuration operation triggered by a user.

In step S602, each first folder in the target hierarchical directory in the distributed file system is determined.

In embodiments of the present application, each first folder in the distributed file system that is located in the target hierarchical directory may be determined.

Step S603, generating a first file set according to the files in the same folder.

In this embodiment of the present application, the number of folders in the target hierarchical directory may be at least one, and a first file set may be generated according to each file under the same folder, that is, each first file set includes each file under the same folder.

In step S604, the first metadata of each first file set is determined according to the file metadata of each first file in each first file set.

In this embodiment of the present application, for any one first file set, the first metadata of the first file set may be determined according to the file metadata of each first file in the first file set. The determining manner of the first metadata is similar to that of the second metadata, and will not be described herein.

Of course, the range of the file set may be preset, that is, the directory level corresponding to the file set may be preset, and in this application, the directory level is denoted as a designated level.

Thus, the range of each first file set can be determined according to different modes, and the flexibility and applicability of the method can be improved.

Step S605 determines a first heat value of each first file set according to the first metadata of each first file set.

Step S606, determining a second file set to be migrated from each first file set according to the first heat value of each first file set; wherein the first heat value of the second set of files is greater than the first heat threshold.

In step S607, each first file in the second file set is migrated to the distributed file system of the data lake.

The explanation of steps S605 to S606 may be referred to the related description in any embodiment of the present application, and will not be repeated here.

The data migration method can realize the range of the user-defined file set, can meet the personalized use requirement of a user, and improves the use experience of the user.

In any one embodiment of the application, taking the distributed file system as an HDFS for example, unified metadata can be maintained for file sets stored in the HDFS and object storage, information such as lake entering time, lake entering mode, integration task, file type, creation time, modification time, access times and the like is recorded, a more reasonable cold and hot data evaluation method is established, unified information maintenance of the file sets is realized, and files newly entering the lake are automatically stored in a grading mode.

Metadata information based on the fileset may be established in the object store while capturing operations on objects (e.g., files) within the fileset, updating the metadata information of the fileset to evaluate the fileset for cold and hot data. And for the newly-entered files, intelligent association is carried out on the newly-entered files with the cold and hot states of the existing file sets according to the information such as data sources, integration tasks, file types and the like, so that automatic grading storage is realized.

As an example, the architecture of a data lake may be as shown in fig. 7, and metadata information of a file set may be stored at a data processing layer of the data lake.

Specifically, migration of a file set in a data lake and hierarchical storage of a newly entered lake file can be achieved by the following steps:

step 1, unified metadata information is established in a data lake for a file set and a file stored to object storage and HDFS, and maintained by a unified data service (such as a metadata service).

Step 2, defining a first-level directory or a last-level directory as a file set range in the data lake by default. The scope of the user-defined file set can be supported to point to any one level of directory under the first level of directory.

Step 3, when the key value in the object storage is a null file at the end and is set as a file set, metadata entities about the object are generated in the metadata service (i.e. the above-mentioned data service for uniformly maintaining metadata information), and the following information is recorded in the metadata table shown in table 1.

Table 1 metadata table

/>

The metadata table includes metadata of not only the file metadata of the file but also metadata of the file set, for example, metadata of the file set abc in the metadata table in table 1, and file metadata of the file 1.Jpg in the file set abc.

When a file operation is performed in the data storage system, including updating or accessing a file, it is necessary to synchronize and update information such as update time, access times, access average intervals, etc. of the data file in the metadata table.

And step 4, when a storage file or object is newly added in the data lake, adding related records into a metadata table, wherein the metadata table comprises data sources, lake entering tasks, file types, author information and the like, and for the author information, a data lake management system is required to acquire information carried by the file and acquire detailed information of authors and the like.

Step 5, the data lake establishes a heat and cold degree evaluation index (i.e. heat degree value) for the same file set by taking the metadata table as a basis and periodically (an evaluation period can be set, such as daily). Because a plurality of files are arranged under one file set, each row of data in the table 1 can be compared, the creation time, the update time, the access times and the access average interval data of the file set can be obtained in a summarizing mode, wherein the creation time, the update time and the access time of the file set can be the latest time in all files contained in the file set, and the access times and the access average interval data of the file set can be average values of all file metadata.

And 6, comprehensively calculating creation time, update time, access times, access average interval and the like in the metadata of the file set by the data lake, establishing a multi-dimensional cold and hot data evaluation dimension, for example, carrying out weighted calculation on time according to the creation time, the update time and the access time of the file set to obtain a time value, comparing the time value with a preset time threshold, if the time value is later than the time threshold, the file set is the hot data, and if the time value is earlier than the time threshold, the file set is the cold data, and by adopting the mode, the cold data which is not used for a long time can be evaluated. In addition, for cold data that is not used for a long time, it is also possible to determine whether the fileset is periodically accessed cold data according to the access time and the access average interval of the fileset.

And 7, after the data lake judges that a certain file set exists in the object storage as hot data, the file set needs to be migrated from the object storage to the HDFS in a file migration mode. For example, the data lake traverses the creation directory according to the metadata table and the file copy mode to copy the file to the HDFS according to the hot file related to the file set, deletes the original file on the object storage, updates the metadata information related to the file related to the corresponding file set, updates the storage address and the storage creation time of the file, and keeps the metadata of other columns unchanged. The updated metadata table may be as shown in table 2.

It should be noted that, the migration of heterogeneous data in the data lake results in the operation of the file, and the update time, the access time and the access times are not affected.

Table 2 metadata table

And step 8, correspondingly, when the file set on the HDFS is evaluated as cold data, the file set needs to be migrated to the object storage, except that all files under the file set directory in the HDFS need to be copied to the object storage respectively according to the objects, and simultaneously, an empty file with the size of 0KB is also created as a directory.

And 9, after the cold and hot data evaluation dimension information is established for the file set by the data lake, carrying out association analysis on the increment file newly input into the data lake during data integration, and realizing more intelligent cold and hot data storage. For the incremental file with the data integration tool for periodic integration, the metadata information including the file path, the data source, the lake entering task, the file type, the author information and the like of the incremental file is obtained, similarity comparison is carried out with the metadata table of the data lake, if the same metadata information exists, the incremental file which is newly entered into the data lake can be directly stored into the same data storage system according to the cold and hot states of the stored file set and the corresponding storage address, the designated storage system of the lake entering task is not needed to be stored first, and then the file is migrated again according to the cold and hot states of the file.

In summary, in the object storage of the data lake, based on independent object files, metadata information of a file set is maintained by establishing a metadata table, so that cold and hot evaluation of the file set is realized; in the data lake, the cold and hot degree evaluation of the file set is realized through metadata summarization statistics on the file set stored in the object storage, and the method is used for carrying out heterogeneous migration between the object storage and the HDFS, so that the balance of the data storage cost and the data calculation performance in the data lake is realized. For the incremental files, the files entering the lake can be automatically stored in a grading mode according to the cold and hot states of the similar files, and the secondary migration consumption of the data is reduced. Through unified metadata management in the data lake, updatable and inheritable fields are defined in the heterogeneous data migration process, and the correctness and consistency of information are ensured.

Corresponding to the data migration methods provided in the above embodiments, an embodiment of the present application further provides a data migration apparatus. Since the data migration apparatus provided in the embodiment of the present application corresponds to the data migration methods provided in the above several embodiments, implementation manners of the data migration method are also applicable to the data migration apparatus provided in the embodiment, and will not be described in detail in the embodiment.

Fig. 8 is a schematic structural diagram of a data migration apparatus according to an embodiment of the present application.

As shown in fig. 8, the data migration apparatus 800 may include: a first acquisition module 801, a first determination module 802, a second determination module 803, and a migration module 804.

The first obtaining module 801 is configured to obtain first metadata of at least one first file set in object storage in a data lake; wherein the first set of files includes at least one first file.

A first determining module 802, configured to determine a first heat value of each of the first file sets according to the first metadata of each of the first file sets.

A second determining module 803, configured to determine, from each of the first file sets, a second file set to be migrated according to a first heat value of each of the first file sets; wherein the first heat value of the second set of files is greater than a first heat threshold.

A migration module 804, configured to migrate each of the first files in the second file set to a distributed file system of the data lake.

As a possible implementation manner of the embodiment of the present application, the first determining module 802 is specifically configured to: determining a using heat value of any first file set according to at least one of first creation time, first update time and first access time in corresponding first metadata aiming at any first file set; determining a frequency heat value of any first file set according to at least one of the first access times and the first access average interval in the corresponding first metadata; and determining a first heat value of any first file set according to at least one of the use heat value and the frequency heat value.

As a possible implementation manner of the embodiment of the present application, the first determining module 802 is specifically configured to: for any first file set, carrying out weighted summation on the first creation time, the first update time and the first access time in the corresponding first metadata to obtain target time; determining a time difference between the target time and a set time; and determining a using heat value of any first file set according to the time difference, wherein the using heat value and the time difference are in positive correlation.

As a possible implementation manner of the embodiment of the present application, the first determining module 802 is specifically configured to: determining a first sub-heat value of any one of the first file sets according to the first access times in the corresponding first metadata, wherein the first access times and the first sub-heat value form a positive correlation; determining a second sub-heat value of any one of the first file sets according to a first access average interval in the corresponding first metadata, wherein the first access average interval and the second sub-heat value are in a negative correlation; and determining the frequency heat value of any first file set according to at least one of the first sub heat value and the second sub heat value.

As a possible implementation manner of the embodiment of the present application, the data migration apparatus 800 may further include:

and the second acquisition module is used for acquiring second metadata of at least one third file set in the distributed file system in the data lake, wherein the third file set comprises at least one second file.

And the third determining module is used for determining a second heat value of each third file set according to the second metadata of each third file set.

A fourth determining module, configured to determine a fourth file set to be migrated from each of the third file sets according to the second heat value of each of the third file sets; wherein the second heat value of the fourth set of files is less than or equal to a second heat threshold.

The migration module 804 is further configured to migrate each of the second files in the fourth file set to the object store of the data lake.

As a possible implementation manner of the embodiment of the present application, the migration module 804 is specifically configured to: determining whether the fourth file set is a periodically accessed file set according to a second access time and a second access average interval in second metadata of the fourth file set; determining the next access time of the fourth file set according to the second access time and the second access average interval in the second metadata of the fourth file set under the condition that the fourth file set is the periodically accessed file set; and under the condition that the difference between the current time and the next access time is larger than a first difference threshold value, migrating each second file in the fourth file set into the object storage of the data lake.

As a possible implementation manner of the embodiment of the present application, the migration module 804 is further configured to: in response to reaching a target time, migrating the fourth set of files in the object store into the distributed file system; wherein a difference between the target time and the next access time is less than or equal to the second difference threshold.

and the third acquisition module is used for acquiring file metadata of at least one file to be stored.

And a fifth determining module, configured to determine a similarity between the file metadata and third metadata of each fifth file set in the data lake.

And a sixth determining module, configured to determine a target file set from each of the fifth file sets according to the similarity of each of the fifth file sets.

And the storage module is used for storing the files to be stored into a target storage system stored by the target file set in the data lake.

As a possible implementation manner of the embodiment of the present application, the first obtaining module 801 is specifically configured to: responding to configuration operation, and configuring a target level corresponding to the file set; determining each first folder in the distributed file system, which is located in the target hierarchical directory; generating a first file set according to each file under the same folder; and determining the first metadata of each first file set according to the file metadata of each first file in each first file set.

Alternatively, the first obtaining module 801 is specifically configured to: acquiring a designated level; determining each second folder in the distributed file system located in the specified hierarchical directory; generating a first file set according to each file under the same second folder; and determining the first metadata of each first file set according to the file metadata of each first file in each first file set.

As one possible implementation manner of the embodiments of the present application, the first metadata includes at least one of the following:

the first creation time is selected from the creation time in the file metadata of each first file in the first file set;

the first updating time is selected from updating times in file metadata of each first file in the first file set;

the first access time is selected from access time in file metadata of each first file in the first file set;

the first access times are determined according to the access times in the file metadata of each first file in the first file set;

And a first access average interval, wherein the first access average interval is determined according to the access average interval in the file metadata of each first file in the first file set.

According to the data migration device, first heat values of the first file sets are determined according to first metadata of the first file sets in object storage in a data lake, and second file sets to be migrated are determined from the first file sets according to the first heat values of the first file sets; wherein the first heat value of the second file set is greater than the first heat threshold; and migrating each first file in the second file set to a distributed file system of the data lake. Therefore, the method and the device can determine the heat value of each first file set based on the metadata of each first file set in the object storage, so that the second file set belonging to the heat data can be determined from each first file set according to the heat value of each first file set, and each file in the second file set belonging to the heat data is integrally migrated to the distributed file system, thereby improving the processing performance of the data storage system on the heat data.

In order to implement the above embodiment, the present application further provides an electronic device, and fig. 9 is a schematic structural diagram of the electronic device provided in the embodiment of the present application. The electronic device includes:

Memory 901, processor 902, and a computer program stored on memory 901 and executable on processor 902.

The processor 902, when executing the program, implements the data migration method provided in any of the embodiments described above.

Further, the electronic device further includes:

a communication interface 903 for communication between the memory 901 and the processor 902.

Memory 901 for storing a computer program executable on processor 902.

Memory 901 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

And a processor 902, configured to implement the data migration method according to any one of the foregoing embodiments when executing the program.

If the memory 901, the processor 902, and the communication interface 903 are implemented independently, the communication interface 903, the memory 901, and the processor 902 may be connected to each other through a bus and perform communication with each other. The bus may be an industry standard architecture (Industry Standard Architecture, abbreviated ISA) bus, an external device interconnect (Peripheral Component, abbreviated PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 9, but not only one bus or one type of bus.

Alternatively, in a specific implementation, if the memory 901, the processor 902, and the communication interface 903 are integrated on a chip, the memory 901, the processor 902, and the communication interface 903 may communicate with each other through internal interfaces.

The processor 902 may be a central processing unit (Central Processing Unit, abbreviated as CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or one or more integrated circuits configured to implement embodiments of the present application.

In order to implement the above embodiments, the embodiments of the present application also propose a non-transitory computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements a data migration method as provided in any of the embodiments above.

In order to implement the above embodiments, the embodiments of the present application further propose a computer program product, which when executed by an instruction processor in the computer program product, implements the data migration method provided in any of the above embodiments.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. Although embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. A method of data migration, the method comprising:

2. The method of claim 1, wherein determining a first heat value for each of the first filesets based on the first metadata for each of the first filesets comprises:

Determining a using heat value of any first file set according to at least one of first creation time, first update time and first access time in corresponding first metadata aiming at any first file set;

determining a frequency heat value of any first file set according to at least one of the first access times and the first access average interval in the corresponding first metadata;

and determining a first heat value of any first file set according to at least one of the use heat value and the frequency heat value.

3. The method of claim 2, wherein the determining, for any first set of files, a usage heat value for the any first set of files based on at least one of a first creation time, a first update time, and a first access time in the corresponding first metadata, comprises:

for any first file set, carrying out weighted summation on the first creation time, the first update time and the first access time in the corresponding first metadata to obtain target time;

determining a time difference between the target time and a set time;

and determining a using heat value of any first file set according to the time difference, wherein the using heat value and the time difference are in positive correlation.

4. The method of claim 2, wherein determining the frequency heat value of the any one of the first filesets according to at least one of the first number of accesses and the first average interval of accesses in the corresponding first metadata comprises:

determining a first sub-heat value of any one of the first file sets according to the first access times in the corresponding first metadata, wherein the first access times and the first sub-heat value form a positive correlation;

determining a second sub-heat value of any one of the first file sets according to a first access average interval in the corresponding first metadata, wherein the first access average interval and the second sub-heat value are in a negative correlation;

and determining the frequency heat value of any first file set according to at least one of the first sub heat value and the second sub heat value.

5. The method according to claim 1, characterized in that the method further comprises:

acquiring second metadata of at least one third file set in the distributed file system in the data lake, wherein the third file set comprises at least one second file;

Determining a second heat value of each third file set according to the second metadata of each third file set;

determining a fourth file set to be migrated from each third file set according to the second heat value of each third file set; wherein the second heat value of the fourth set of files is less than or equal to a second heat threshold;

and migrating each second file in the fourth file set to the object storage of the data lake.

6. The method of claim 5, wherein said migrating each of said second files in said fourth set of files into said object store of said data lake comprises:

determining whether the fourth file set is a periodically accessed file set according to a second access time and a second access average interval in second metadata of the fourth file set;

determining the next access time of the fourth file set according to the second access time and the second access average interval in the second metadata of the fourth file set under the condition that the fourth file set is the periodically accessed file set;

and under the condition that the difference between the current time and the next access time is larger than a first difference threshold value, migrating each second file in the fourth file set into the object storage of the data lake.

7. The method of claim 6, wherein, in the event that the difference between the current time and the next access time is greater than a first difference threshold, after migrating each of the second files in the fourth set of files into the object store of the data lake, the method further comprises:

in response to reaching a target time, migrating the fourth set of files in the object store into the distributed file system;

wherein the difference between the target time and the next access time is less than or equal to a second difference threshold.

8. The method according to any one of claims 1-7, further comprising:

acquiring file metadata of at least one file to be stored;

determining the similarity between the file metadata and third metadata of each fifth file set in the data lake;

determining a target file set from each fifth file set according to the similarity of each fifth file set;

and storing the files to be stored into a target storage system stored by the target file set in the data lake.

9. The method of any of claims 1-7, wherein the obtaining the first metadata of the at least one first set of files in the distributed file system in the data lake comprises:

Responding to configuration operation, and configuring a target level corresponding to the file set;

determining each first folder in the distributed file system, which is located in the target hierarchical directory;

generating a first file set according to each file under the same folder;

determining first metadata of each first file set according to the file metadata of each first file in each first file set;

or alternatively, the process may be performed,

acquiring a designated level;

determining each second folder in the distributed file system located in the specified hierarchical directory;

generating a first file set according to each file under the same second folder;

and determining the first metadata of each first file set according to the file metadata of each first file in each first file set.

10. The method of any of claims 1-7, wherein the first metadata comprises at least one of:

the first updating time is selected from updating time in file metadata of each first file in the first file set;

11. A data migration apparatus, the apparatus comprising:

12. An electronic device, comprising:

memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the data migration method according to any one of claims 1-10 when executing the program.

13. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the data migration method according to any one of claims 1-10.