CN117370272A

CN117370272A - File management method, device, equipment and storage medium based on file heat

Info

Publication number: CN117370272A
Application number: CN202311389337.1A
Authority: CN
Inventors: 梁尔真; 袁学群; 夏磊; 陈平刚; 郑望献; 蔡利华; 周蕾; 曹军
Original assignee: Zhejiang Xinghan Information Technology Ltd By Share Ltd
Current assignee: Zhejiang Xinghan Information Technology Ltd By Share Ltd
Priority date: 2023-10-25
Filing date: 2023-10-25
Publication date: 2024-01-09

Abstract

The invention discloses a file management method, device, equipment and storage medium based on file heat. The method comprises the steps of obtaining access record data of files to be managed in a preset past time period; inputting the access record data into a pre-trained LSTM model to perform access frequency prediction, and obtaining an access frequency prediction result; determining a predicted heat level of the file to be managed based on the access frequency prediction result and a preset access heat level; based on the predicted heat level, the files to be managed are moved to the corresponding solid state disk, mechanical hard disk or magnetic tape, so that reasonable distribution of storage resources is realized, the access efficiency of files with high access frequency is improved, and the overall storage cost of the files is reduced.

Description

File management method, device, equipment and storage medium based on file heat

Technical Field

The embodiment of the invention relates to a data processing technology, in particular to a file management method, a device, equipment and a storage medium based on file heat.

Background

In the information age, the rapid growth of data has become a normative state. Enterprises, organizations, and individuals are all faced with the challenge of handling large volumes of electronic files. These archives may include text documents, images, audio, video, and other data in a variety of formats. In handling such large amounts of data, efficient archive management becomes critical.

In most cases, the archive is not accessed uniformly. Some files may be accessed frequently while other files are rarely or hardly accessed. The traditional storage method is usually static, and is easy to cause (1) resource waste: storing all files in the same location results in the high-heat files and low-heat files occupying the same storage resources, wasting valuable storage space. (2) inefficient access: the high-heat files are stored in the same location as the low-heat files, which may result in slower access speeds of the high-heat files, as they compete with a large number of low-heat files for access to resources. (3) data management is complex: when backup, migration, or deletion is required, traditional methods may require manual intervention, adding to the complexity and cost of management.

Disclosure of Invention

The invention provides a file management method, device, equipment and storage medium based on file heat, so as to realize dynamic management of files, and enable the files to have higher access efficiency and resource utilization rate.

In a first aspect, an embodiment of the present invention provides a archive management method based on archive heat, including:

acquiring access record data of files to be managed in a preset past time period;

inputting the access record data into a pre-trained LSTM model to conduct access frequency prediction, and obtaining an access frequency prediction result;

determining the predicted heat level of the file to be managed based on the access frequency prediction result and a preset access heat level;

and moving the files to be managed to corresponding solid state disks, mechanical hard disks or magnetic tapes based on the predicted heat level.

Optionally, after the obtaining the access record data of the file to be managed in the preset past time period, the method includes:

carrying out structuring treatment on the access record data based on a preset data structure to obtain process access record data with a unified data structure;

and quantizing the process access record to obtain target access record data based on the one-hot coding.

Optionally, the pre-trained LSTM model includes:

processing a sample file of the LSTM model for training to obtain the access frequency to obtain a sample set;

initializing weights and deviations of a preset LSTM model based on randomized seeds;

and training and testing the LSTM model by using the sample set to obtain a target LSTM model which meets the consistency requirement and takes the access frequency of the file as an output target.

Optionally, the processing the sample file of the LSTM model for training the access frequency to obtain a training set and a test set includes:

taking historical access record data of files within a first preset time length as sample data, and taking file access frequency of a second preset time length after the first preset time length as a sample label of the sample data to obtain a training set and a test set which are composed of the sample data and the sample label.

Optionally, processing the sample file of the LSTM model for training to obtain the access frequency to obtain a sample set further includes:

the sample set was randomly partitioned into training and test sets using standard z-score normalization methods.

Optionally, a cross entropy loss function is selected in the LSTM model as a loss function in the training process.

Optionally, after training and testing the LSTM model by using the sample set to obtain a target LSTM model with the access frequency of the archive as an output target, the method further includes:

calculating Kappa coefficient and model accuracy of the target LSTM model;

and updating the target LSTM model based on a preset Kappa threshold value and an accuracy rate threshold value.

In a second aspect, an embodiment of the present invention further provides a archive management device based on archive heat, including:

the acquisition module is used for acquiring access record data of the files to be managed in a preset past time period;

the prediction module is used for inputting the access record data into a pre-trained LSTM model to perform access frequency prediction, and obtaining an access frequency prediction result;

the determining module is used for determining the predicted heat level of the file to be managed based on the access frequency prediction result and a preset access heat level;

and the execution module is used for moving the files to be managed to the corresponding solid state disk, mechanical hard disk or magnetic tape based on the predicted heat level.

In a third aspect, an embodiment of the present invention further provides a archive management device based on archive heat, where the device includes:

one or more processors;

a storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the archive management method based on archive heat as described in the first aspect.

In a fourth aspect, embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing the archive management method based on archive heat as described in the first aspect.

According to the invention, access record data of the files to be managed in a preset past time period are obtained, access frequency prediction is carried out by utilizing a pre-trained LSTM model, an access frequency prediction result and a predicted heat level of the files to be managed are obtained, and the files to be managed are moved to corresponding solid state disks, mechanical hard disks or magnetic tapes based on the predicted heat level, so that reasonable allocation of storage resources is realized, the access efficiency of the files with high access frequency is improved, and the overall storage cost of the files is reduced.

Drawings

FIG. 1 is a flowchart of a file management method based on file hotness according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a file management apparatus based on file heat according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a file management apparatus based on file hotness according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Fig. 1 is a flowchart of a file management method based on file heat according to an embodiment of the present invention, where the embodiment is applicable to a case of a method for dynamically managing files, and the method may be executed by a file management device based on file heat, and specifically includes the following steps:

step 110, access record data of the files to be managed in a preset past time period are obtained.

With the development of the information age, businesses, organizations, and individuals are now in the daily business of creating, accessing, and processing a vast array of electronic files, which may include text documents, images, audio, video, and other data in a variety of formats. In handling such large amounts of data, efficient archive management becomes critical.

The heat (or access heat) of a profile is a key concept. It refers to the frequency with which files are accessed or used. In most cases, the archive is not accessed uniformly. Some files may be accessed frequently while other files are rarely or hardly accessed. Conventional storage methods are generally static, and they do not take into account the difference in the heat of the files, but store all files in the same location or device, and files with low access heat will seriously affect the user's access efficiency to files with high access heat when the user accesses the files. The high access heat files may be stored in a device with low access efficiency, while the low access heat files are stored in a device with high access efficiency, in which case the user accessing the high access heat files will be affected by the device with low access efficiency, severely reducing the efficiency of the user to obtain and access the target files.

In a specific implementation, when each archive performs an access operation, corresponding scheme record data is generated, and the data can record data such as archive information, user information, archive creation, opening and closing operations, and each time of file pointer movement and data reading and writing.

And 120, inputting the access record data into a pre-trained LSTM model to conduct access frequency prediction, and obtaining an access frequency prediction result.

In the embodiment of the invention, the access frequency is predicted based on the access record data of the file by adopting a pre-trained LSTM model, so that an access frequency prediction result is obtained.

In the embodiment of the invention, the LSTM model is used for prediction, so that the prediction effect of file heat is improved by more matching with a certain time characteristic of file access, and the scientificity and practicability of file migration and classified storage are enhanced.

And 130, determining the predicted heat level of the file to be managed based on the access frequency prediction result and the preset access heat level.

In the embodiment of the invention, different access frequency prediction results are divided into different access heat levels, and files with different access heat levels are stored by adopting different storage strategies so as to match the access requirements of users on the files, so that files with higher access frequency can be accessed more efficiently.

And 140, moving the files to be managed to the corresponding solid state disk, mechanical hard disk or magnetic tape based on the predicted heat level.

Illustratively, the predicted heat level is divided into a cold archive, a warm archive, and a hot archive, and the cold archive is periodically migrated to storage to tape. In order to further distinguish warm files from hot files, an access frequency threshold gamma is defined, files with a frequency less than or equal to the threshold gamma are defined as warm files, a migration system periodically migrates the files to a mechanical hard disk, files with a frequency greater than gamma are defined as hot files, and the migration system periodically migrates the files to a solid state hard disk.

According to the technical scheme, access record data of the files to be managed in a preset past time period are obtained, access frequency prediction is carried out in a pre-trained LSTM model, an access frequency prediction result and a predicted heat level of the files to be managed are obtained, the files to be managed are moved to corresponding solid state disks, mechanical hard disks or magnetic tapes based on the predicted heat level, reasonable distribution of storage resources is achieved, access efficiency of files with high access frequency is improved, and overall storage cost of the files is reduced.

In an embodiment of the present invention, n archive storage categories may be defined, with each storage category having different access performance and resource allocation, by way of example. Access heat levels (0, 1,..n-1) for n files are defined. The hotness labels of an archive are converted to a sparse vector y= {0,..1,..0 }, using one-hot encoding, respectively.

Taking the archive access record of the archive storage server for the past 30 days, and setting the access characteristics extracted from the file access log in the previous 27 days as the input of a prediction model. The access frequency Q3 days after the file is divided into a plurality of sections based on the aforementioned access heat level division method. The 0 file in Q defines the file as a cold file. The archive migration system periodically stores such files to tape. In order to further distinguish warm files from hot files, an access frequency threshold gamma is defined, files with a frequency less than or equal to the threshold gamma are defined as warm files, a migration system periodically migrates the files to a mechanical hard disk, files with a frequency greater than gamma are defined as hot files, and the migration system periodically migrates the files to a solid state hard disk.

The archive storage system provides a history access log in units of archive names for each archive and persistently stores the history access log. Recording file creation, opening and closing operations, each time file pointer movement, data reading and writing, and the like. Calculating the mean value and variance of various file operations to measure the discrete trend change on a time axis, mining the time characteristics of file access, and sorting the time characteristics into a time sequence access characteristic sequence of the file according to a proper time window.

The archive I/O access record data structure is defined as a 24 byte string. The 0 th byte is a file operation type field, such as file opening, closing, reading and writing; bytes 1 to 16 are file name hash value fields, and the hashed file names have uniform lengths so as to improve query efficiency; the 17 th byte to the 20 th byte are file operation time fields; the 21 st byte to the 23 rd byte are extension fields, record the user name, the file operation authority and the like of the file. When model training data is prepared, the initial time of accessing the acquisition file to the I/O record is set to be t respectively _s And t _e The time span is: Δt=t _e -t _s 。

The minimum loss function is set as a training target of the model, and given a randomization seed randomizes the weights and deviations in the LSTM network. Model training uses a gradient back-propagation algorithm and updates parameters in the network using Adam's random optimization algorithm.

Defining the original file access characteristic time sequence as F _o ＝{f ₁ ,...,f _n N is the total number of files, f _t For the time sequence of the t-th archive, t is E [1, n]。

The training set and the test set are randomly divided, and a standard z-score standardization method is adopted, so that the standardized training set can be expressed as:

F' _train ＝{f' ₁ ,...,f' _n }

wherein t is more than or equal to 1 and less than or equal to L, t is the file sequence number, L is the model expansion step length, namely the hidden layer comprises L connected LSTM neurons. The input of the segmented model is: x= { X ₁ ,X ₂ ,...,X _L And (2) X is the file access I/O record extracted in the second step, and the corresponding output Y is the file access heat label defined in the second step.

The model input layer transmits the file access I/O record X to the hidden layer, and the output after passing through the hidden layer is as follows:

O＝{O ₁ ,O ₂ ,...,O _L }

O _p ＝LSTM _forward (X _p ,C _p-1 ,H _p-1 )

wherein C is _p-1 And H _p-1 Corresponding to the state and output of the last LSTM neuron, respectively, function LSTM _forward Representing the method of forward transfer of information in LSTM neurons. Here, assuming that the neuron state vector is S in size, it is known that C _p-1 And H _p-1 The vector sizes are also S.

A softmax layer is connected after the LSTM hidden layer output to output the probability of various access hotness. And outputting class labels corresponding to the maximum probability value during prediction, namely:

the model training adopts a cross entropy loss function as a loss function in the training process, and is defined as follows:

the output of the model is the access heat of the prediction file, namely the range of the access frequency falls in which interval, and the prediction accuracy is an important evaluation index of the model performance. The invention requires that frequent file class migration be reduced as much as possible to reduce resource consumption. Typically, the access frequency of an archive fluctuates slightly and does not change the storage class, i.e., no migration is required.

Kappa coefficients were used to evaluate the consistency of the model. The Kappa coefficient value range is set to be 0,1, and the higher the value is, the higher the prediction confidence on each archive category is. Conversely, if approaching 0, it is explained that the model classification result is close to the random classification. The Kappa coefficient is calculated as follows:

wherein p is _o Is the overall accuracy, p _e Is an occasional consistency error.

And (3) taking the model accuracy and the Kappa coefficient as indexes (for example, the model accuracy is greater than 80 percent, and the Kappa coefficient is greater than 0.75), and continuing training the model until the indexes are met.

And calling the first step to preprocess files needing classified storage, and calling the second step on the basis to generate model input meeting the model requirement.

And carrying out heat prediction by using the trained model in the previous step, and storing and migrating the archives according to the heat prediction result and the archives storage specification, so as to realize classified storage based on heat.

Fig. 2 is a schematic structural diagram of a file management apparatus based on file heat according to an embodiment of the present invention, and as shown in fig. 2, the file management apparatus based on file heat includes an obtaining module 21, a predicting module 22, a determining module 23 and an executing module 24. Wherein:

an acquisition module 21, configured to acquire access record data of a file to be managed in a preset past period;

the prediction module 22 is configured to input the access record data into a pre-trained LSTM model to perform access frequency prediction, and obtain an access frequency prediction result;

a determining module 23, configured to determine a predicted heat level of the file to be managed based on the access frequency prediction result and a preset access heat level;

the execution module 24 is configured to move the file to be managed to a corresponding solid state disk, mechanical disk, or tape based on the predicted heat level.

Optionally, after obtaining the access record data of the file to be managed in the preset past time period, the method includes:

and quantizing the process access record to obtain the target access record data based on the one-hot coding.

Optionally, the pre-trained LSTM model includes:

Optionally, processing the sample archive for training the LSTM model of the access frequency to obtain a training set and a test set includes:

taking historical access record data of the archives within a first preset time length as sample data, and taking archives access frequency of a second preset time length after the first preset time length as sample labels of the sample data to obtain a training set and a test set which are composed of the sample data and the sample labels.

the sample set was randomly divided into training and test sets using standard z-score normalization methods.

Optionally, a cross entropy loss function is selected in the LSTM model as the loss function in the training process.

calculating Kappa coefficient and model accuracy of the target LSTM model;

updating the target LSTM model based on a preset Kappa threshold and an accuracy threshold.

The file management device based on the file heat provided by the embodiment of the invention can execute the file management method based on the file heat provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Fig. 3 is a schematic structural diagram of a file management apparatus based on file heat according to an embodiment of the present invention, as shown in fig. 3, the apparatus includes a processor 30, a memory 31, a communication module 32, an input device 33 and an output device 34; the number of processors 30 in the device may be one or more, one processor 30 being taken as an example in fig. 3; the processor 30, the memory 31, the communication module 32, the input means 33 and the output means 34 in the device may be connected by a bus or other means, in fig. 3 by way of example.

The memory 31 is a computer readable storage medium, and may be used to store a software program, a computer executable program, and modules, such as program instructions/modules corresponding to the archive management method based on archive heat in the embodiment of the present invention (for example, the acquisition module 21, the prediction module 22, the determination module 23, and the execution module 24 in the archive management device based on archive heat). The processor 30 executes various functional applications of the device and data processing by running software programs, instructions and modules stored in the memory 31, i.e. implements the above-described archive management method based on archive heat.

The memory 31 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the terminal, etc. In addition, the memory 31 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 31 may further include memory located remotely from processor 30, which may be connected to the device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

And the communication module 32 is used for establishing connection with the display screen and realizing data interaction with the display screen. The input means 33 may be used for receiving input numeric or character information and for generating key signal inputs related to user settings and function control of the electronic device, and the output means 34 may comprise a display device such as a display screen.

The file management device based on the file heat provided by the embodiment of the invention can execute the file management method based on the file heat provided by any embodiment of the invention, and particularly has corresponding functions and beneficial effects.

Embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing a archive management method based on archive heat, the method comprising:

inputting the access record data into a pre-trained LSTM model to perform access frequency prediction, and obtaining an access frequency prediction result;

determining a predicted heat level of the file to be managed based on the access frequency prediction result and a preset access heat level;

and moving the files to be managed to the corresponding solid state disk, mechanical hard disk or magnetic tape based on the predicted heat level.

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the above-mentioned method operations, and may also perform the related operations in the file management method based on file hotness provided in any embodiment of the present invention.

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.

It should be noted that, in the above embodiment of the archive management device based on archive heat, each unit and module included are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A archive management method based on archive heat, comprising:

2. A archive management method based on archive heat according to claim 1, comprising, after the access record data of the archive to be managed for a preset past period of time is acquired:

3. The archive management method based on archive heat of claim 1, wherein the pre-trained LSTM model comprises:

4. A archive management method based on archive heat according to claim 3 wherein processing the sample archive of LSTM model for training access frequency to obtain training set and test set comprises:

5. A archive management method based on archive heat according to claim 3 wherein, in processing the sample archive of LSTM model for training access frequency to obtain a sample set, further comprising:

6. A archive management method based on archive heat according to claim 3 wherein cross entropy loss function is selected in the LSTM model as the loss function in the training process.

7. A archive management method based on archive heat according to claim 3, further comprising, after training and testing the LSTM model with the sample set to obtain a target LSTM model with an access frequency of archive as an output target, the target LSTM model meeting a consistency requirement:

calculating Kappa coefficient and model accuracy of the target LSTM model;

8. A archive management device based on archive heat, comprising:

9. A archive management device based on archive heat, the device comprising:

one or more processors;

a storage means for storing one or more programs;

when executed by the one or more processors, causes the one or more processors to implement the archive management method of any one of claims 1-7 based on archive heat.

10. A storage medium containing computer executable instructions which, when executed by a computer processor, are for performing a archive management method based on archive heat of any one of claims 1 to 7.