WO2023169188A1

WO2023169188A1 - Popularity identification method and apparatus in file system, and computer device

Info

Publication number: WO2023169188A1
Application number: PCT/CN2023/077025
Authority: WO
Inventors: 杨伦; 付克博; 沈建强; 李亚飞; 魏展
Original assignee: 华为技术有限公司
Priority date: 2022-03-09
Filing date: 2023-02-18
Publication date: 2023-09-14
Also published as: CN116775580A

Abstract

Disclosed in the present application are a popularity identification method and apparatus in a file system, and a computer device, which are applied to the technical field of computers. The method comprises: acquiring an access request from an application program, and determining access objects and access types for the access request; counting the access frequency for the same access type of each access object; and according to a storage path of each access object and the access frequency for the same access type of each access object, synchronously updating the popularity of the access object and the popularity of each node in the direction of a parent node in the storage path where the access object is located, wherein the popularity of the parent node is the sum of the popularity of each child node under the parent node. In the embodiments of the present application, not only the popularity of an access object is counted and updated, but the popularity of a directory is also counted and updated, thereby facilitating the realization of storage analysis and optimization at a directory level; and the popularity of the access object and the popularity of a parent node of the access object are also synchronously updated, thereby solving the problem of update lag of the popularity of an upper-layer directory.

Description

Heat identification method, device and computer equipment in file system

Technical field

The present application relates to the field of computer technology, and in particular to a heat identification method, device and computer equipment in a file system.

Background technique

According to the principle of time locality, the data being accessed is likely to be accessed again in the near future, forming a data hotspot. Identification of hot and cold data is very important for optimizing the storage system. It is the basis for realizing data hot and cold classification, data placement and migration, and is also an indispensable factor for the storage system to achieve high cost performance.

The identification of hot and cold data in traditional storage systems is mainly based on data blocks. However, as the scale of unstructured data becomes larger and larger, the number and size of files that need to be identified in the storage system have reached billions and gigabytes respectively. In addition to identifying the popularity of files, sometimes it is also necessary to record the popularity of directories. Building directory popularity helps users more comprehensively perceive the hotness and coldness of data, conduct data analysis and mining, and perform directory-level data flow. The popularity of a directory is the sum of the popularity of the files and subdirectories under the directory. According to the traditional solution, when the user requests to obtain the popularity of a certain directory, the files and subdirectories in the entire directory are recursively traversed and summed to obtain the directory popularity.

However, when the number of file directories is large, the time consumed by recursive traversal will become very large. Moreover, when the file and destination popularity are updated, they are traversed layer by layer from bottom to top, which will cause the upper directory popularity update to lag behind and fail to meet real-time requirements.

Contents of the invention

This application provides a heat identification method, device and computer equipment in a file system, which are used to realize synchronous updates of files and directories and avoid the problem of lag in upper directory update.

In the first aspect, this application provides a method for identifying popularity in a file system. The method includes: obtaining an access request from an application, determining the access object of the access request; counting the access frequency of each access object; The storage path of the access object and the access frequency of each access object are synchronized to update the popularity of the access object and the popularity of each node of the access object in the direction of the parent node in the storage path. The popularity of the parent node is The sum of the popularity of each child node under the parent node. In the embodiment of this application, not only the popularity of the accessed object is counted and updated, but also the popularity of the directory is counted and updated, which helps to achieve directory-level storage analysis and optimization; in addition, each direction of the accessed object and its parent node is synchronously updated. The node's popularity. When it is subsequently necessary to query the popularity of any node in the direction of the parent node, the updated popularity data can be directly returned to the user. Instead of temporarily counting the popularity information, this solution solves the problem of lag in the update of the popularity of the upper directory. There is no problem of lag in the update of the popularity of the upper directory in the directory tree, or the time of the file and directory popularity being out of sync. When the directory popularity needs to be output, there is no need to Unupdated heat information will be output because the directory heat update lags behind, and there is no need to temporarily update the directory heat when the directory heat needs to be output. For large file systems, it can meet real-time requirements.

In a possible implementation, the method further includes: determining the access type for the access object according to the access request; and counting the access frequency of each access object, including: counting the same access frequency of each access object. Access frequency of access type. For the same access object, the popularity of different access types can be further distinguished, for example, respectively By counting the popularity of read operations and write operations on the same access object, we can gain a more detailed understanding of the user's needs for the access object, allowing for more reasonable storage optimization.

In a possible implementation, the access object includes a file directory, a file, or a data block in a file.

In a possible implementation, determining the access object of the access request includes: determining the file requested to be accessed based on the object identifier in the access request, and based on the offset and length in the access request, Determine that the access object is located in one or more blocks in the file; synchronize the popularity of the access object and the number of times the access object is in the file according to the storage path of each access object and the access frequency of each access object. Storing the popularity of each node in the direction of the parent node in the path includes: synchronously updating the popularity of the one or more blocks according to the storage path of the one or more blocks and the access frequency of the one or more blocks, The popularity of the file, and the popularity of each node in the direction of the parent node in the storage path of the file. In this implementation, larger files can be stored in blocks. When counting the popularity, the popularity is calculated separately for each block, and the storage of different blocks is optimized based on the popularity without having to cache the entire file.

In a possible implementation, the method further includes: periodically attenuating the popularity of each access object; if the popularity of the first access object decays to less than or equal to a preset threshold, then deleting the first access The object's warmth. In the embodiment of this application, since there is no need to set up separate metadata and there is no independent storage space for metadata, it is necessary to consider the size of the popularity information and the size of the storage space. As time accumulates, the access objects gradually increase. The popularity information is also gradually increasing, which may cause the problem of insufficient storage space for the popularity information. The popularity information of less popular access objects is deleted to control the storage space occupied by the popularity information.

In a possible implementation, attenuating the popularity of each access object includes: multiplying the popularity of each access object by an attenuation coefficient; if the value multiplied by the attenuation coefficient is a non-integer, then use 1 The probability minus the attenuation coefficient is rounded down, and the probability of the attenuation coefficient is rounded up. If rounding up or rounding down is used, there will be a difference between the sum of the heat of the child nodes and the heat of the parent node, which is not conducive to subsequent storage optimization of accessed objects based on heat; and the above method can make the heat of the child nodes The sum is equal or approximately equal to the heat of the parent node.

In a possible implementation, counting the access frequency of each access object includes counting the access frequency of each access object within a preset interval, and the preset interval includes any of the following: a preset time interval , preset traffic, preset number of access requests. If a hotness update is performed for each access request, frequent update operations will occupy too much bandwidth when the number of visits is large. In the above method, the accesses in the preset interval are counted and then the hotness is updated. Helps save bandwidth resources.

In a possible implementation, the method further includes: determining whether the accessed object is hot data according to the popularity of the accessed object and the first popularity threshold; and/or determining whether the accessed object is hot data according to the popularity of the accessed object and the second popularity threshold. Check whether the access object is cold data. Classifying hot and cold data will help optimize data storage in the future.

In a possible implementation, the method further includes: sorting the heat of all stored access objects from large to small; using the heat of the Nth access object as the first heat threshold; the N satisfies the following conditions: N divided by the number of all visited objects satisfies the preset proportion condition; or the sum of the popularity of the first N visited objects is divided by the sum of the popularity of all visited objects satisfies the preset proportion condition.

In a possible implementation, the method further includes: receiving a popularity query request, the request being used to request to query the popularity of the target access object; and outputting the popularity of the target access object.

In a second aspect, this application provides a heat identification device in a file system. The device includes modules/units that execute the above-mentioned first aspect and any possible implementation of the first aspect; these modules/units can be implemented through hardware. Implementation can also be implemented by hardware executing corresponding software.

Exemplarily, the device includes: a collection module, used to obtain the access request from the application program, determine the access object of the access request; count the access frequency of each access object; and the popularity update module, used to calculate the number of access requests based on each access object. The storage path and the access frequency of each access object are synchronized to update the popularity of the access object and the popularity of each node of the access object in the direction of the parent node in the storage path. The popularity of the parent node is the parent node. The sum of the heat of each child node.

In a third aspect, the present application provides a computer device, the computer device including a memory and a processor; the memory stores a computer program; the processor is used to call the computer program stored in the memory to execute the first aspect and the method described in any implementation of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium. Instructions are stored in the computer-readable storage medium. When the instructions are run on a computer, the computer is caused to execute the first aspect and the first aspect. The method described in any implementation.

In a fifth aspect, the present application provides a computer program product containing instructions that, when run on a computer, causes the method described in the first aspect and any implementation manner of the first aspect to be executed.

For the technical effects that can be achieved by any of the possible implementation methods in any of the above-mentioned second to fifth aspects, please refer to the description of the technical effects that can be achieved by the corresponding implementation scheme in the above-mentioned first aspect, and repeated points will not be discussed.

Description of the drawings

Figure 1 is a schematic diagram of Qumulo file directory metadata provided by the embodiment of this application;

Figure 2 is a schematic diagram of the Qumulo file directory metadata update process provided by the embodiment of this application;

Figure 3 is a schematic flow chart of the heat identification method in the file system provided by the embodiment of the present application;

Figure 4 is a schematic diagram of the growth of heat information during the heat update process provided by the embodiment of the present application;

Figure 5 is a schematic diagram of cold and hot data classification provided by the embodiment of the present application;

Figure 6 is a schematic diagram of heat information pruning during the heat update process provided by the embodiment of the present application;

Figure 7 is a schematic structural diagram of a heat identification device provided by an embodiment of the present application;

Figure 8 is a schematic structural diagram of a computer device provided by an embodiment of the present application.

Detailed ways

The general parallel file system (GPFS) is configured with a file heat identification method, including a file heat calculation method and a heat update method. Specifically, when calculating the popularity, the exponential moving average of the number of file accesses can be calculated. The popularity value of files that have not been accessed within a period will decay by a percentage. The effective range of attenuation percentage is 0~100%, and the default value is 10%. It can also be customized by the user or customized by the user. When a user accesses a file, the access time atime of the file is automatically modified, and the access popularity of the file also increases. If the update of access time atime is suppressed, the heat calculation of file access may be adversely affected, that is, there is the following dependency relationship: file is accessed -> atime update -> file heat increases.

However, the above method of identifying file popularity only calculates the popularity of files and does not count the popularity of file directories. Since building directory popularity helps users more comprehensively perceive the hotness and coldness of data, conduct data analysis and mining, and directory-level data flow, the distributed file system (Qumulo file fabric, QF2) developed by Qumulo has built-in file elements. Real-time aggregation and real-time analysis features of data.

Qumulo file directory metadata can be shown in Figure 1, where rank indicates the level of the directory or file, size indicates the size of the directory or file itself, and the values in square brackets indicate uncoordinated values and coordinated values respectively. The uncoordinated value and the reconciled value of a directory are equal to the sum of the uncoordinated values and the sum of the reconciled values of the directory and all files and subdirectories under the directory, respectively. When the size of file F1 is updated, the uncoordinated value is first modified, as shown in step 1 in Figure 2. Then add its storage path to the dirty list, as shown in step 2 in Figure 2. Next, the background asynchronously updates the F1 storage path upwards and modifies the coordinated value of F1, as shown in step 3 in Figure 2. Further modify the uncoordinated value of the directory D1 where F1 is located (which can also be called the parent node of F1), as shown in step 4 in Figure 2, and delete the storage path of F1 from the dirty list and add the storage path of its parent node D1, As shown in step 5 in Figure 2. Dirty list objects are sorted by rank. When updating, objects with larger rank values are processed first, that is, the uncoordinated values and coordinated values of objects with larger rank values are updated first, and then objects with smaller rank values are processed. The reasons why QF2 has high-performance real-time analysis capabilities include: (1) the analysis module is built into the QF2 file system and is fully integrated with the file system itself; (2) the QF2 file system is implemented based on a multi-path search tree (B-tree). Real-time aggregation of metadata can be achieved through the following two technologies: (1) timely updating of aggregated metadata without waiting for request traversal; (2) bottom-up update and top-down traversal.

However, the dirty list updates objects with larger rank values first. If users frequently access objects with larger rank values, it will cause the upper-level directory popularity updates in the directory tree to lag behind, and the file and directory popularity times to be out of sync. For large file systems, this cannot be done. Meet real-time requirements.

In view of this, embodiments of the present application provide a method for identifying hotness in a file system, which is used to realize synchronous updates of files and directories and avoid the problem of lagging updates of upper-level directories.

This method can be applied in a heat identification device, that is, the heat identification device executes the heat identification method. The heat identification device may be deployed in an independent server, or may be deployed in the same server with other systems. For example, the heat identification device may be deployed together with a storage system. Further, the heat identification device can include a collection module and a heat update module, wherein the adoption module can be set on the client to collect data from the client, and then send the collected data or processed information to the server. In the popularity update module, the popularity update module is used to update the popularity of access objects. Alternatively, it is not necessary to set up a collection module in the client, but all the heat identification devices are set up in the server. After receiving the user's access request, the client sends the access request to the heat identification device in the server, so that The heat identification device performs heat statistics and updates.

Refer to Figure 3, which is a schematic flow chart of a heat identification method in a file system provided by an embodiment of the present application. As shown in the figure, the method may include the following steps:

Step 301: Obtain the access request from the application program and determine the access object of the access request.

In a possible implementation, the above step 301 can be performed by a collection module provided on the client in the heat identification device. Specifically, the collection module can use an application programming interface (API). Further, a cache space can be configured for the collection module, and the collection module stores the collected access requests in the cache space.

In another possible implementation, the client does not need to set up a collection module, but the client sends the access request to the heat identification device, so that the heat identification device can perform heat statistics and updates based on the access request, that is, In other words, the above-mentioned step 301 is for the collection module in the heat recognition device to receive the access request and determine the access object based on the received access request.

The access request can include information such as data object identification, operation words, offset and length. Among them, the data object identifier represents the file or file directory requested to be accessed, etc. The operation word indicates the user's operation type on the access object, such as reading operations, write operations, etc. When a user accesses a file, he or she may only access part of the data in the file but not all the data in the file. The offset can be used to represent the starting position of the data the user requests to access, and the length can be used to represent the size of the data the user requests to access. For example, if the size of a file is 1GB, and the data that the user needs to access is the data from 0.5GB to 0.6GB in the middle of the file, then the offset in the access request is 0.5GB and the length is 0.1GB.

The collection module can determine the access object based on the data object identifier in the access request, or determine the access object based on the data object identifier, offset, and length in the access request. Among them, the access objects may include file directories, files, or data blocks in files.

In this embodiment of the present application, when the data of each file is stored, it may be stored in blocks or not. For example, files whose size reaches a preset threshold are divided into blocks, and files which do not reach the preset threshold are not. Carry out chunking. For files that are not stored in blocks, when determining the access object based on the access request, the accessed file or file directory can be determined based only on the data object identifier. For files stored in blocks, in order to accurately count the popularity of each block in the future, in the above steps, the block requested to be accessed can be determined according to the access request. For example, the size of file A is 1GB and is divided into 10 blocks for storage. The data corresponding to each block is 0~0.1GB, 0.1GB~0.2GB, 0.2GB~0.3GB,..., 0.9GB~1GB; if accessed The offset in the request is 0.55GB, the length is 0.1GB, and the user requests data between 0.55 and 0.65GB. Then the data the user requests to access is located in the 6th and 7th blocks, then the access object is determined are the 6th and 7th blocks of file A.

Furthermore, the time of the access request can also be obtained, so that subsequent heat statistics can be analyzed more accurately from a time perspective. Specifically, the access request may carry time information, then the time information can be obtained directly from the access request; or, the access request may not contain time information, then the access request can be recorded when the access request is obtained. Requested time.

Step 302: Count the access frequency of each access object.

For example, if access request 1 requests to read the 6th and 7th blocks of file A, the number of read operations for the 6th block of file A is increased by 1, and the number of read operations for the 7th block of file A is increased by 1. 1; if access request 2 requests to write the 6th block in file A, then the number of write operations on the 6th block in file A is increased by 1; if access request 3 requests to read file B, then file B The read operation of directory C is increased by 1; if access request 4 requests to read directory C, the read operation of directory C is increased by 1.

The above step 302 can be executed by the collection module, either by the collection module provided in the client or by the collection module provided in the server.

Optionally, when the collection module performs the above step 302, it can count the access requests within the preset interval and count the access frequency of each access object, where the preset interval can be a preset time period or a preset time period. Set the traffic segment, or preset the number of access requests, etc. After the collection module counts the access frequency of each access object within the preset interval, it sends the calculated frequency to the heat update module used to update the access object heat, then cleans the calculated frequency data and re-counts the next interval. frequency of visits within.

For example, if the preset time period is 10 minutes, then the number of visits to each access object can be counted for all access requests within 10 minutes; when 10 minutes is reached, the counted number of visits will be sent to the server for updating the access object. The popularity update module of Hotness then clears the number of visits to zero and re-counts the number of visits in the next 10 minutes. For another example, if the preset traffic segment is 1MB, then multiple access requests with a total size not exceeding 1MB can be counted, and the number of accesses for each access object can be counted. When the new access requests obtained are compared with the accesses that have been obtained, If the sum of the request traffic exceeds 1MB, the counted number of visits will be sent to the popularity update module, and then the number of visits will be cleared, and the access requests within the new preset traffic will be counted. For another example, if the default number of access requests is 50, then 50 The number of visits to the object accessed by the access request is counted, and the counted number of visits is sent to the popularity update module, and then the statistical number is cleared to zero, and the number of visits to the object accessed by the 50 access requests obtained thereafter is re-counted.

The collection module counts the access requests within the preset interval and then sends them to the popularity update module, which can reduce the number of transmissions and help reduce the bandwidth resources occupied by popularity updates. It does not have to be like GPFS for popularity statistics and updates every time. Once an access request is obtained, it is sent once. The updates are too frequent and occupy too much bandwidth resources.

However, in an extreme embodiment of the present application, when the preset interval is a preset number of access requests, and the preset number is 1, it means that the acquisition module needs to report to the popularity update module once for each access request.

Step 303: According to the storage path of each access object and the access frequency of each access object, synchronously update the popularity of the access object and the popularity of each node in the direction of the parent node in the storage path where the access object is located.

The above steps can be performed by a hotness update module set in the server. Specifically, the popularity update module updates the access popularity of each access object stored in itself according to the access frequency of each access object received. For example, the number of visits to an access object can be used as the popularity value of the access object. Then, after receiving the number of accesses to each access object, the popularity update module adds the stored number of accesses to each access object to the newly acquired to obtain the updated number of visits to each access object, that is, the updated popularity value.

If the popularity update module receives the access frequency information of a new access object, that is, the popularity update module has not previously stored the access frequency of the access object, then the popularity update module can generate popularity information about the access object. Furthermore, if the popularity update module does not store the popularity information of the parent node of the access object, it also needs to generate the popularity information of its parent node; if it does not store the popularity information of the parent node's parent node, it also needs to generate the parent node's parent node. The popularity information is accessed until the root directory where the object is located. As shown in Figure 4, before the update, the heat information stored by the heat update module is shown in (a) of Figure 4. The heat information of the read operation of directory 00 is stored, and subdirectory 11 and subdirectory 12 under directory 00 are stored. The heat information of the read operation also stores the heat information of the read operation of file 21 under subdirectory 11, and the heat information of the read operation of file 22 and file 23 under subdirectory 12; the heat update module receives the heat information sent by the collection module Based on the heat information of each access object, it is determined that the heat information of the read operation of the file 24 under the subdirectory 21 needs to be added, as shown in (b) in Figure 4 .

In the embodiment of this application, not only the popularity of each accessed object is recorded, but also the popularity of the directory is recorded at the same time, thereby facilitating subsequent storage analysis and optimization of the directory level. Specifically, when updating the popularity of the accessed object, it is also necessary to obtain the storage path of the accessed object, and update the popularity of each node in the direction of the parent node of the accessed object in the storage path, where the popularity of the parent node is The sum of the popularity values of each child node. For example, the access object is file F1 shown in Figure 2, and its storage path /D0/D1/F1 is obtained. Folder D1 is the parent node of file F1, and folder D0 is the parent node of folder D1; then for file F1 When updating the popularity, the popularity values of folder D1 and folder D0 also need to be updated. The popularity of folder D1 is equal to the sum of the popularity of all files under folder D1, and the popularity of folder D0 is equal to the sum of the popularity of all files under folder D0. In addition, for files stored in blocks and the access object is a block, the parent node of the access object is the file, and the popularity value of the file is the sum of the popularity of all blocks under the file.

In the embodiment of this application, not only the popularity of accessed objects is counted and updated, but also the popularity of directories is counted and updated. Compared with the heat identification method used by GPFS, it helps to achieve directory-level storage analysis and optimization. Although QF2's popularity recognition can also record the popularity of a directory, when QF2 updates the popularity, it always updates the object with a larger rank value first. If the user frequently accesses the object with a larger rank value, the popularity of the upper directory in the directory tree will be reduced. The update lags, and the file and directory popularity times are not synchronized. When the directory popularity needs to be output, unupdated popularity information may be output because the directory popularity update lags, or the directory popularity may be temporarily updated when the directory popularity needs to be output. For large users, File system, none cannot meet the real-time requirements. In the embodiment of this application, the popularity of the access object and its parent node is updated synchronously, which solves the problem of lagging update of the upper directory popularity, so that when the directory popularity needs to be output, the latest popularity information can be output in a timely manner.

In order to analyze users' access needs in more detail, you can also perform heat statistics on different access types for the same access object. Specifically, the collection module not only determines the access object based on the access request, but also determines the access type of the access request based on the operation word in the access request, such as read operation and write operation; and counts the access frequency separately for different access types of each access object. . The popularity update module updates the popularity separately for each access type of each access object. Hot updates that distinguish access types can provide a more detailed understanding of users' needs for access objects, allowing for more reasonable storage optimization.

In order to provide reference information and optimization basis for storage, push and other services, the heat identification device can also output heat information. For example, the popularity identification device can periodically send the latest popularity of each access object to the storage server, so that the storage server can perform storage optimization based on the popularity of each access object. In addition, the popularity identification device can also receive a popularity query request for querying the popularity of a target access object, and the popularity identification device can determine the target access object according to the request and output the current popularity information of the target access object.

Furthermore, after determining the popularity of the access object and its parent node, the cold and hot data can be further divided according to the popularity value, thereby providing clearer reference information and optimization basis for storage, push and other services, and simplifying the storage server. , push server operation. Specifically, the popularity value of each accessed data can be compared with the first popularity threshold. If it is greater than or equal to the first popularity threshold, the access object is regarded as the popular data. Similarly, the popularity value of each accessed data can also be compared with the second popularity threshold. If it is less than or equal to the first popularity threshold, the access object is regarded as unpopular data. The first heat threshold and the second heat threshold may be equal or unequal; if not, the first heat threshold is greater than the second heat threshold.

Optionally, the first heat threshold and the second heat threshold may be preset, may be obtained by machine learning by the heat identification device, or may be obtained according to a preset strategy.

In a possible implementation, the first popularity threshold can be determined according to the following method: first, based on the total number of access objects and the preset proportion value, determine the number of access objects that reach the preset proportion of the total number, and then determine the number of access objects that reach the preset proportion of the total number. The quantity is represented by N. Then, the popularity values of the accessed objects are sorted from large to small, and the Nth heat value is determined, and the Nth heat value is used as the first heat threshold.

In another possible implementation, the first popularity threshold can also be determined in the following manner: first, the popularity values of the accessed objects are sorted from large to small. Then, find the cumulative sum of the heat values, recorded as K. After that, the sorted popularity values are accumulated from front to back. For example, the first popularity value is accumulated to obtain L ₁ , the first and second popularity values are accumulated to obtain L ₂ , and accumulated to the i-th The heat value is obtained _Li . When L _N-1 does not meet the preset proportion condition, but L _N meets the preset proportion condition, the Nth heat value is used as the first heat threshold. Wherein, satisfying the preset ratio condition may be greater than or equal to the preset ratio.

The first heat threshold determined according to either of the above two methods can also be used as the second heat threshold at the same time; or, when the first heat threshold and the second heat threshold are not equal, the first heat threshold can also be determined based on the above two methods. In any method, the second heat threshold is determined by setting different preset ratios or preset ratios. According to the first heat threshold and/or the second heat threshold, the access object can be classified into cold and hot categories.

Since the heat of the parent node is the sum of the heat of all its child nodes, the heat value of the parent node is greater than or equal to the heat value of its child nodes. Therefore, if a leaf node is popular data, then the parent node of the leaf node and the parent node parent node Up to the root directory is popular data. Optionally, in order to simplify the process of determining popular data, you can only sort the popularity values of leaf nodes and determine whether each leaf node is popular data. For each non-leaf node, determine whether it contains a child node of popular data. If it does, determine the non-leaf node as popular data.

For example, in the embodiment shown in Figure 5, directory 00 contains subdirectory 10, subdirectory 11, and subdirectory 12; subdirectory 10 contains subdirectory 20, and subdirectory 20 contains file 30, file 31 and file 32; subdirectory 11 contains file 21 and subdirectory 22, and subdirectory 22 contains file 33 and file 34. Among them, file 33 can be regarded as a data block 40, and the data block 40 is split into sub-directories. Block 50 and sub-block 51, and sub-block 50 is divided into sub-block 60 and sub-block 61, sub-block 51 is divided into sub-block 62 and sub-block 63; sub-directory 12 contains sub-directory 23 and sub-directory 24, sub-directory 24 Contains file 35 and file 36. In the directory tree shown in Figure 5, file 30, file 31, file 32, sub-block 60, sub-block 61, sub-block 62, sub-block 63, file 34, file 35 and file 36 are leaf nodes. You can first For these 10 leaf nodes, determine the first popularity threshold according to the aforementioned method, determine whether each leaf node is popular data, and then determine whether other non-leaf nodes are popular data. Specifically, if among these 10 leaf nodes, sub-block 60 and file 35 are popular data, then the nodes on the storage paths of sub-block 60 and file 35 are all popular data. Among them, the storage path of sub-block 60 is: directory 00-sub-directory 11-sub-directory 22-file 33-block 40-sub-block 50-sub-block 60, then directory 00, sub-directory 11, sub-directory in the storage path 22. File 33, block 40 and sub-block 50 are also popular data. The storage path of file 35 is: directory 00 - subdirectory 12 - subdirectory 24 - file 35. Then directory 00, subdirectory 12 and subdirectory 24 in the storage path are also popular data.

If the popularity value of the access object keeps accumulating, even for data that is not frequently accessed, the number of visits will gradually increase over time, that is, the popularity value keeps increasing, which is not conducive to the identification of cold and hot data. Therefore, the heat value of each accessed object can be periodically attenuated to prevent the heat value of cold data from increasing all the time. For example, the popularity value of each access type of each access object can be periodically multiplied by the attenuation coefficient α, where 0<α<1, to reduce its popularity value. For example, assuming that the attenuation coefficient α is 0.5, the heat value of each access type of each access object is multiplied by the attenuation coefficient every 30 minutes to complete the heat attenuation; if the heat value of the read operation of folder 1 is 30, the heat value of the read operation of folder 1 is 30. The following contains file A and file B. The read operation heat value of file A is 20. Among them, the read operation heat value of block 1 of file A is 15, the read operation heat value of block 2 of file A is 5, and the read operation heat value of file B is 20. The read operation heat value is 10. When the decay time is reached, the heat value of the read operation of folder 1 is 30*0.5=15, the heat value of the read operation of file A is 20*0.5=10, and the heat value of the read operation of block 1 of file A is 15. *0.5=7.5, the heat value of the read operation of block 2 of file A is 5*0.5=2.5, and the heat value of the read operation of file B is 10*0.5=5.

After multiplying by the attenuation coefficient, the heat value of file A block 1 becomes 7.5, and the heat value of file A block 2 becomes 2.5. In order to facilitate calculation, in a possible implementation, they can be rounded. However, if rounding up or rounding down is used, the sum of the popularity value of block 1 of file A and the popularity value of block 2 of file A may not be equal to the popularity value of file A. In order to make the heat value of the parent node equal to or approximately equal to the sum of the heat values of all child nodes contained in the parent node, when rounding, you can round up according to the probability of α, and round down according to the probability of 1-α. Round up or down.

In QF2's heat identification method, data heat information is recorded in the module that stores metadata, and the metadata has independent storage space and sufficient storage resources, so there is no need to consider the size of the storage resources occupied by the heat information. In the embodiment of the present application, since there is no need to set separate metadata and there is no independent storage space for metadata, the size of the popularity information is an issue that needs to be considered in the embodiment of the present application. As time accumulates, the number of access objects gradually increases, and so does the popularity information, which may lead to insufficient storage space for the popularity information.

In order to solve the problem of limited heat information storage resources, in a possible implementation, when an access object When the heat value of the access object decays below the preset threshold, the heat value of the access object is deleted. That is, the embodiment of the present application provides a pruning solution to control the storage space occupied by the heat information and avoid the heat information only increasing but not Problems that lead to insufficient storage control will be reduced.

Figure 6 exemplarily provides a schematic diagram of pruning. In the embodiment shown in FIG. 6 , the preset threshold is set to 0, that is, when the heat value decays to 0, the heat information is deleted. Before attenuation, the heat information stored by the heat update module is shown in (a) in Figure 6. It stores the heat value 6 of the read operation of directory 00, and the heat values of the read operations of subdirectory 11 and subdirectory 12 under directory 00. are 2 and 4 respectively, the heat value of the read operation of file 21 under subdirectory 11 is 2, and the heat value of the read operation of file 22, file 23, file 24, and file 25 under subdirectory 12 are all 1; assuming attenuation The coefficient α is 0.5. After attenuation, the heat value of the read operation of directory 00 is 3, the heat values of the read operations of subdirectory 11 and subdirectory 12 under directory 00 are 1 and 2 respectively, and the heat value of the read operation of file 21 under subdirectory 11 The heat value of the operation is 1. Files 22, 23, 24, and 25 under subdirectory 12 have a heat value probability of α (i.e. 50%) of the read operation of 1, and a probability of 1-α of 0, and we get The heat value of the read operation of file 22 is 1, the heat value of the read operation of file 23 is 0, the heat value of the read operation of file 24 is 1, and the heat value of the read operation of file 25 is 0. Since the file The heat values of the read operations of files 23 and 25 decay to 0, and their heat information needs to be deleted, that is, the heat information of the read operations of files 23 and 25 is pruned, as shown in (b) in Figure 6.

Through the above pruning process, the growth of hot information can be suppressed, which helps to avoid the problem of insufficient storage space caused by excessive hot information. However, if the storage space occupied by the popularity information has reached the maximum allowed storage space, the popularity information whose popularity value is greater than or equal to the preset threshold can also be deleted. In one possible design, when the storage space occupied by the heat information reaches the maximum allowed storage space, pruning is performed immediately to delete one or more heat information with the lowest heat value, or alternatively, the heat value can be All heat information whose difference from the preset threshold is within the preset range is deleted. For example, when the storage space occupied by the heat information does not reach the maximum allowable storage space, the heat information with a heat value of 0 can be deleted; when the storage space occupied by the heat information has reached the maximum allowable storage space, the heat value will be deleted. All popularity information less than or equal to 1 is deleted. In another possible design, when the storage space occupied by the heat information has reached the maximum allowed storage space and new heat information needs to be added, pruning is performed to delete the heat information with the lowest heat value, or to All heat information whose value differs from the preset threshold within the preset range is deleted.

Based on the same technical concept, embodiments of the present application also provide a heat identification device for implementing the above method embodiments. The device may include modules/units that execute any of the possible implementation methods in the above method embodiments; these modules/units may be implemented by hardware, or may be implemented by hardware executing corresponding software.

For example, as shown in Figure 7, the device may include: a collection module 701 and a popularity update module 702.

The collection module 701 is used to obtain access requests from applications, determine the access objects of the access requests, and count the access frequency of each access object.

The popularity update module 702 is used to synchronously update the popularity of the access object and each node in the direction of the parent node of the access object in the storage path according to the storage path of each access object and the access frequency of each access object. The popularity of the parent node is the sum of the popularity of each child node under the parent node.

In a possible implementation, the collection module 701 is also configured to: determine the access type for the access object according to the access request; when the collection module 701 counts the access frequency of each access object, specifically Used to: count the access frequency of the same access type for each access object.

In a possible implementation, when determining the access object of the access request, the collection module 701 has The body is used to: determine the file requested to be accessed according to the object identifier in the access request, and determine one or more blocks in the file where the access object is located according to the offset and length in the access request; the popularity The update module 702 is specifically configured to: synchronously update the popularity of the one or more blocks, the popularity of the file, and The popularity of each node in the direction of the parent node in the storage path of the file.

In a possible implementation, the device may also include: a heat attenuation module 703, configured to periodically attenuate the heat of each access object; if the heat of the first access object attenuates to less than or equal to the preset threshold, Then delete the popularity of the first visited object.

In a possible implementation, when attenuating the popularity of each access object, the popularity attenuation module 703 is specifically used to: multiply the popularity of each access object by an attenuation coefficient; if If the value is a non-integer, then the probability of the attenuation coefficient is rounded down to 1 minus the probability of the attenuation coefficient, and the probability of the attenuation coefficient is rounded up.

In a possible implementation, when counting the access frequency of each access object, the collection module 701 is specifically used to: count the access frequency of each access object within a preset interval, and the preset interval includes any of the following: One: preset time interval, preset traffic, and preset number of access requests.

In a possible implementation, the device may further include: a classification module 704, configured to determine whether the accessed object is hot data based on the popularity of the accessed object and the first popularity threshold; and/or, based on the popularity of the accessed object and The second hotness threshold determines whether the access object is cold data.

In a possible implementation, the ranking module 704 is also configured to: sort the popularity of all stored access objects from large to small; use the popularity of the Nth access object as the first popularity threshold; the N satisfies The following conditions: N divided by the number of all visited objects satisfies the preset proportion condition; or, the sum of the popularity of the first N visited objects divided by the sum of the popularity of all visited objects satisfies the preset proportion condition.

In a possible implementation, the device may also include a transceiver module (not shown in the figure) for receiving a popularity query request, where the request is used to request to query the popularity of a target access object; and output the target access object of heat.

Based on the same technical concept, embodiments of the present application also provide a computer device. The computer device includes a processor 801 as shown in Figure 8, and a communication interface 802 connected to the processor 801.

The processor 801 may be a general processor, a microprocessor, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or one or more integrated circuits used to control the execution of the program of this application, etc. A general-purpose processor may be a microprocessor or any conventional processor, etc. The steps of the methods disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware processor for execution, or can be executed by a combination of hardware and software modules in the processor.

Communication interface 802 is used to communicate with other devices, such as PCI bus interface, Ethernet, wireless access network (radio access network, RAN), wireless local area networks (WLAN), etc.

In this embodiment of the present application, the processor 801 is configured to call the communication interface 802 to perform receiving and/or sending functions, and to perform the method as described in any of the previous possible implementations.

Further, the computer device may also include a memory 803 and a communication bus 804.

The memory 803 is used to store program instructions and/or data, so that the processor 801 calls the instructions and/or data stored in the memory 803 to implement the above functions of the processor 801. Memory 803 may be read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM) or other types of dynamic storage devices that can store information and instructions. It can also be electrically erasable programmable read-only memory (EEPROM) or can be used for portability or storage. Any other medium that has the desired program code in the form of instructions or data structures and can be accessed by a computer, without limitation. The memory 803 may exist independently, such as an off-chip memory, and is connected to the processor 801 through a communication bus 804 . Memory 803 may also be integrated with processor 801.

Communication bus 804 may include a path for communicating information between the above-described components.

The computer device may communicate with the storage structure through a network, or the computer device may also include a storage structure (not shown in the figure). The storage structure includes one or more memories, and the memory in the storage structure can be a disk, a solid state disk (solid state disk or solid state drive, SSD), a storage-class memory (storage-class memory, SCM), etc., used Used to store the object accessed by the access request.

Exemplarily, the processor 801 can perform the following steps through the communication interface 802: obtain the access request from the application program, determine the access object of the access request; count the access frequency of each access object; and calculate the storage path of each access object according to and the access frequency of each access object, synchronously updating the popularity of the access object and the popularity of each node of the access object in the direction of the parent node in the storage path. The popularity of the parent node is the value of each child under the parent node. The sum of node heats.

In a possible implementation, the processor 801 is further configured to: determine the access type for the access object according to the access request; when counting the access frequency of each access object, the processor 801 is specifically configured to: count The access frequency of the same access type for each access object.

In a possible implementation, when determining the access object of the access request, the processor 801 is specifically configured to: determine the file requested to be accessed according to the object identifier in the access request, and determine the file requested to be accessed according to the object identifier in the access request. The offset and length of the access object are determined to be located in one or more blocks in the file; the processor 801 synchronously updates the access object according to the storage path of each access object and the access frequency of each access object. The popularity of the object and the popularity of each node of the access object in the direction of the parent node in the storage path are specifically used for: based on the storage path of the one or more blocks and the access of the one or more blocks. Frequency, the popularity of the one or more blocks, the popularity of the file, and the popularity of each node in the direction of the parent node in the storage path of the file are updated synchronously.

In a possible implementation, the processor 801 may also be configured to: periodically attenuate the popularity of each accessed object; if the popularity of the first accessed object decays to less than or equal to a preset threshold, delete all Describes the popularity of the first visited object.

In a possible implementation, when attenuating the popularity of each access object, the processor 801 is specifically used to: multiply the popularity of each access object by the attenuation coefficient; if the value after multiplying by the attenuation coefficient If it is a non-integer, then 1 minus the probability of the attenuation coefficient is rounded down, and the probability of the attenuation coefficient is rounded up.

In one possible implementation, when counting the access frequency of each access object, the processor 801 is specifically configured to: count the access frequency of each access object within a preset interval, and the preset interval includes any of the following: One: preset time interval, preset traffic, and preset number of access requests.

In a possible implementation, the processor 801 may also be configured to: determine whether the accessed object is hot data based on the popularity of the accessed object and the first popularity threshold; and/or, determine whether the accessed object is hot data based on the popularity of the accessed object and the second popularity threshold. The heat threshold determines whether the access object is cold data.

In a possible implementation, the processor 801 may also be configured to: store the hot data of all access objects The degree is sorted from large to small; the heat of the Nth accessed object is used as the first heat threshold; the N satisfies the following conditions: N divided by the number of all accessed objects satisfies the preset proportion condition; or, the first N accessed objects The sum of the popularity is divided by the sum of the popularity of all visited objects to meet the preset proportion conditions.

In a possible implementation, the processor 801 can also execute through the communication interface 802: receive a popularity query request, where the request is used to request to query the popularity of the target access object; and output the popularity of the target access object.

Based on the same technical concept, embodiments of the present application also provide a computer-readable storage medium. Computer-readable instructions are stored in the computer-readable storage medium. When the computer-readable instructions are run on a computer, the above-mentioned The steps in the method are executed.

Based on the same technical concept, embodiments of the present application provide a computer program product containing instructions, which when run on a computer causes the steps in the above method to be executed.

It should be understood that in the description of this application, words such as "first" and "second" are only used for the purpose of distinguishing the description, and cannot be understood as indicating or implying relative importance, nor can they be understood as indicating or implying. order. Reference in this specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Therefore, the phrases "in one embodiment", "in some embodiments", "in other embodiments", "in other embodiments", etc. appearing in different places in this specification are not necessarily References are made to the same embodiment, but rather to "one or more but not all embodiments" unless specifically stated otherwise. The terms "including," "includes," "having," and variations thereof all mean "including but not limited to," unless otherwise specifically emphasized.

Those skilled in the art will understand that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.

These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.

These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

Although the embodiments of the present application have been described, additional changes and modifications are made to these embodiments. Therefore, it is intended that the appended claims be construed to include the above-described embodiments and all changes and modifications that fall within the scope of this application.

Obviously, those skilled in the art can make various changes and modifications to the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. In this way, if these modifications and variations of the embodiments of the present application fall within the scope of the claims of this application and equivalent technologies, then this application is also intended to include these modifications and variations.

Claims

A heat identification method in a file system, which is characterized by including:

Obtain the access request from the application and determine the access object of the access request;

Count the access frequency of each access object;

According to the storage path of each access object and the access frequency of each access object, the popularity of the access object and the popularity of each node of the access object in the direction of the parent node in the storage path are synchronously updated. The popularity of the parent node is the sum of the heat of each child node under the parent node.
The method according to claim 1, characterized in that the method further includes: determining the access type for the access object according to the access request;

The statistics of the access frequency of each access object include:

Count the access frequency of the same access type for each access object.
The method according to claim 1 or 2, characterized in that the access object includes a file directory, a file, or a data block in a file.
The method according to any one of claims 1-3, characterized in that the method further includes:

Periodically attenuate the popularity of each visited object;

If the popularity of the first access object decreases to less than or equal to the preset threshold, the popularity of the first access object is deleted.
The method according to any one of claims 1 to 4, characterized in that counting the access frequency of each access object includes:

Count the access frequency of each access object within the preset interval. The preset interval includes any of the following: preset time interval, preset traffic, and preset number of access requests.
The method according to any one of claims 1-5, characterized in that the method further includes:

Determine whether the access object is hot data based on the popularity of the access object and the first popularity threshold; and/or

Determine whether the access object is cold data based on the popularity of the access object and the second popularity threshold.
The method according to any one of claims 1-6, characterized in that the method further includes:

Receive a popularity query request, the request is used to request to query the popularity of the target access object;

Output the popularity of the target access object.
A heat identification device in a file system, which is characterized by including:

The collection module is used to obtain the access request from the application program, determine the access object of the access request, and count the access frequency of each access object;

A popularity update module, configured to synchronously update the popularity of the access object and the popularity of each node in the direction of the parent node of the access object in the storage path according to the storage path of each access object and the access frequency of each access object. Heat, the heat of a parent node is the sum of the heat of each child node under the parent node.
The device according to claim 8, wherein the collection module is further configured to: determine the access type for the access object according to the access request;

When counting the access frequency of each access object, the collection module is specifically used to: count the access frequency of the same access type for each access object.
The device according to claim 8 or 9, characterized in that the access object includes a file directory, a file, or a data block in a file.
The device according to any one of claims 8-10, further comprising:

The heat attenuation module is used to periodically attenuate the heat of each visited object; if the heat of the first visited object declines, If it is reduced to less than or equal to the preset threshold, the popularity of the first access object is deleted.
The device according to any one of claims 8-11, characterized in that when the collection module counts the access frequency of each access object, it is specifically used to:

Count the access frequency of each access object within the preset interval. The preset interval includes any of the following: preset time interval, preset traffic, and preset number of access requests.
The device according to any one of claims 8-12, further comprising:

A classification module, configured to determine whether the accessed object is hot data based on the popularity of the accessed object and the first popularity threshold; and/or determine whether the accessed object is cold data based on the popularity of the accessed object and the second popularity threshold.
The device according to any one of claims 8-13, further comprising:

A transceiver module, configured to receive a popularity query request, where the request is used to request to query the popularity of a target access object; and to output the popularity of the target access object.
A computer device, characterized in that the computer device includes a memory and a processor;

The memory stores a computer program;

The processor is configured to call a computer program stored in the memory to execute the method described in any one of claims 1-7.
A computer-readable storage medium, characterized in that instructions are stored in the computer-readable storage medium, and when the instructions are run on a computer, they cause the computer to execute as described in any one of claims 1-7 Methods.