CN118051478A

CN118051478A - Distributed block storage small file aggregation index management method

Info

Publication number: CN118051478A
Application number: CN202311667919.1A
Authority: CN
Inventors: 李佳徐; 刘伟锋; 李博奇; 牛鹏举; 吴伟
Original assignee: China Telecom Digital Intelligence Technology Co Ltd
Current assignee: China Telecom Digital Intelligence Technology Co Ltd
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-05-17

Abstract

The invention discloses a distributed block storage small file aggregation index management method, which comprises dividing original data of index information after small file aggregation into virtual groups, and calculating keywords of the virtual groups; dividing index information of the small files into corresponding virtual groups as sub-index information; calculating relative address space offset of the sub-indexes, sequencing in the virtual group and storing in a database; calculating keywords of the target small file, and searching the satisfied virtual group; comparing and screening sub-indexes in the retrieved virtual group; splitting the target small file to obtain a sub-file; and combining index retrieval results according to the aggregated space address of the sub-file corresponding read sub-index to read the file from the storage bottom layer. According to the invention, the index information in the database can be reduced by one order of magnitude or even several orders of magnitude by dividing the original structured data of the index information after the small files are aggregated into a plurality of virtual groups; by optimizing index information retrieval, fast querying of unstructured indexes can be achieved.

Description

Distributed block storage small file aggregation index management method

Technical Field

The invention relates to the technical field of distributed storage, in particular to a distributed block storage small file aggregation index management method.

Background

Distributed storage may provide three storage interfaces, distributed block storage, distributed object storage, and distributed file storage, where object storage and file storage are used to store unstructured data and distributed block storage is used to store structured data. Unstructured data of the object/file store may be identified at the storage layer using file names, while structured data of the block store no longer contains the original file name information, but information such as file space/file size. Reference is made to fig. 3.

The persistent storage medium of the distributed storage bottom layer mostly uses a mechanical disk, is limited by the physical structure of the mechanical disk, the random read-write performance of small files is often bad, and the read-write performance of large files can meet the requirements. In order to improve the read-write performance of small files in a storage system, the small files are generally gathered into large files and then are intensively brushed down to the bottom layer for persistent storage.

Object storage and file storage small file aggregation small files are usually written into a large file in an additional way and then are brushed down to the bottom layer persistent storage, and index relation can be established between the small file and the large file through file names for file reading. The object/file storage doclet aggregation principle can be seen with reference to fig. 4.

Because of the unstructured data characteristics, the block storage cannot directly establish a simple index between an original file and an aggregated file during small file aggregation, and the space address of the original data block and the space address of the aggregated data block are required to be associated and integrated into index information. Based on the continuity of the address space, the storage of index information is made difficult by the fact that the index information entry based on the space address is virtually unlimited in theory. And the massive index information data also causes low efficiency in index information retrieval, so that the original small files are extremely difficult to read from the aggregated file. In addition, due to the continuity of the space address, the index cannot be quickly positioned through the key words in index retrieval, and the space address segments often need to be compared, which also causes difficulty in index information retrieval.

Therefore, there is a need for a distributed block storage small file aggregation index management method to solve the problems of multiple entries of distributed block storage small file aggregation index information, inconvenient storage and low index information retrieval efficiency.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a distributed block storage small file aggregation index management method to solve the problems of more items of distributed block storage small file aggregation index information, inconvenient storage and low index information retrieval efficiency.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

The distributed block storage small file aggregation index management method is characterized by comprising the following steps of:

Dividing the original data of the index information after the small file aggregation into virtual groups according to a preset fixed space size based on an original space address, and calculating keywords of each virtual group according to the original space address and the fixed space size; taking index information of each small file as sub-index information, and dividing each sub-index into a corresponding virtual group according to the keywords; calculating the relative address space offset of each sub-index, sorting in each virtual group according to the relative address space offset of each sub-index, and storing the sorted virtual groups in a database;

Calculating the keywords of the target small file, traversing the database, and searching the virtual group meeting the keywords; in the retrieved virtual group, screening all sub-indexes meeting the conditions according to the comparison of the original space address and the original file size, and recording; splitting the target small file according to the recorded sub-index to obtain a split sub-file; and correspondingly reading the aggregated space address in the sub-index meeting the condition according to the original space address of each split sub-file, and combining the index retrieval result into the aggregated space address/file size of the target small file so as to read the file from the storage bottom layer.

In order to optimize the technical scheme, the specific measures adopted further comprise:

Further, dividing the original data of the index information after aggregating the small files into virtual groups according to a preset fixed space size based on the original space address, including: the method comprises the steps of presetting a fixed space SIZE as a segmentation unit SEG_SIZE, aggregating small files to obtain original data of index information, and segmenting the original data into a plurality of virtual groups according to an original space address in the original data and the preset segmentation unit.

Further, the calculating the keywords of each virtual group according to the original space address and the fixed space size includes: and according to the original space address OFFSET and the fixed space SIZE SEG_SIZE, dividing the original space address OFFSET by the fixed space SIZE SEG_SIZE, and correspondingly calculating the KEY words KEY of each virtual group.

Further, the calculating the relative address space offset of each sub-index includes: the remainder after dividing the original space address OFFSET of each sub-index by the fixed space SIZE SEG_SIZE is calculated, and the remainder obtained corresponds to the relative address space OFFSET of each sub-index.

Further, the storing in the database includes: the virtual groups are stored in the database in KEY/VALUE form, where KEY is a KEY and VALUE is a sub-index.

Further, in the retrieved virtual group, according to the comparison between the original space address and the original file size, screening all sub-indexes meeting the condition, and recording, including: in the retrieved virtual group, searching a sub-index of which the first original space address is larger than the original space address of the target small file by a dichotomy, and recording; and comparing the original file size of the target small file with the original file size of the sub-index, if the original file size of the target small file is smaller than the original file size of the sub-index, reserving and recording the sub-index, if the target small file is larger than the original file size of the sub-index, accumulating the original file size of the sub-index with the original file sizes of the sub-indexes before and after the original file size, and comparing the accumulated result with the original file size of the target small file, if the target small file is still larger than the original file size of the sub-index, continuing to accumulating the original file sizes of a plurality of sub-indexes backwards until the target small file is smaller than the original file size of the sub-index, and reserving all sub-indexes meeting the conditions and recording at the moment.

Further, splitting the target small file according to the recorded sub-index to obtain a split sub-file, including: and correspondingly splitting the target small file according to the number of the recorded sub-indexes, the original space address of each sub-index and the original file size, and obtaining a split sub-file.

Further, splitting the target small file according to the recorded sub-index to obtain a split sub-file; wherein the number of subfiles is consistent with the number of recorded sub-indexes.

Further, a computer-readable storage medium storing a computer program, characterized in that: the computer program causes a computer to perform a distributed block store doclet aggregation index management method as described above.

Further, an electronic device is characterized by comprising: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the distributed block storage small file aggregation index management method when executing the computer program.

The beneficial effects of the invention are as follows:

According to the invention, the original structured data of the index information after the small files are aggregated is divided into a plurality of virtual groups, so that the index information in the database can be reduced by one order of magnitude or even several orders of magnitude, and the index information storage and retrieval efficiency is greatly improved; by optimizing index information retrieval, the rapid query of unstructured indexes can be realized, the index query efficiency is improved from the original O (n) to O (log n), and the reading time delay of the aggregated file is greatly reduced.

Drawings

FIG. 1 is a flow chart of a retrieval process of a distributed block storage doclet aggregation index management method according to the present invention;

FIG. 2 is a diagram illustrating a virtual group of a distributed block storage small file aggregate index management method according to the present invention;

FIG. 3 is a schematic diagram of distributed storage;

Fig. 4 is a schematic diagram of the object/file storage doclet aggregation principle.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Index information simplification scheme: dividing the original structured data of the index information after the small file aggregation into virtual groups according to a preset fixed space size based on an original space address, and calculating keywords of each virtual group according to the original space address and the fixed space size; taking index information of each small file as sub-index information, and dividing each sub-index into a corresponding virtual group according to the keywords; calculating the relative address space offset of each sub-index, sorting in each virtual group according to the relative address space offset of each sub-index, and storing the sorted virtual groups in a database;

Index retrieval optimization algorithm: according to the original space address OFFSET_O of the target small file and the fixed space SIZE SEG_SIZE, calculating a keyword KEY_O of the target small file, traversing the database, and searching a virtual group meeting the keyword; in the retrieved virtual group, screening all sub-indexes meeting the conditions according to the comparison of the original space address and the original file size, and recording; splitting the target small file according to the recorded sub-index to obtain a split sub-file; and correspondingly reading the aggregated space address in the sub-index meeting the condition according to the original space address of each split sub-file, and combining the index retrieval result into the form of the aggregated space address/file size of the target small file so as to read the file from the storage bottom layer.

In the index information simplification scheme of the scheme, the method for dividing the original structured data of the index information after the small file aggregation into virtual groups according to the preset fixed space size based on the original space address specifically comprises the following steps: the method comprises the steps of presetting a fixed space SIZE as a segmentation unit SEG_SIZE, aggregating small files to obtain original structured data of index information, and segmenting the original structured data into a plurality of virtual groups according to an original space address in the original structured data and the preset segmentation unit.

In this scheme, calculate the keyword of each virtual group according to the original space address and fixed space size, include: and according to the original space address OFFSET and the fixed space SIZE SEG_SIZE, dividing the original space address OFFSET by the fixed space SIZE SEG_SIZE, and correspondingly calculating the KEY words KEY of each virtual group. When offset=1001, seg_size=8, OFFSET/seg_size=125 (125.125 rounded). In this embodiment, the calculation mode of the keyword key_o of the target doclet is consistent with the calculation mode.

Wherein calculating the relative address space offset of each sub-index comprises: the remainder after dividing the original space address OFFSET of each sub-index by the fixed space SIZE SEG_SIZE is calculated, and the remainder obtained corresponds to the relative address space OFFSET of each sub-index. In the example above, 1001 divided by 8 by 125 by more than 1, the relative address space offset is 1.

In this scheme, deposit in the database, include: the virtual groups are stored in the database in the form of KEYs/VALUEs, where KEYs are keywords and VALUEs may be sub-indices.

As shown in fig. 2, the index information structure of the virtual group is as follows, and in this example, one virtual group contains three sub-indexes. Wherein the three sub-indices are ordered (sequentially increased) in size according to a "intra-group relative offset," which herein corresponds to the relative address space offset described above.

The implementation of the index information simplification scheme greatly reduces the index information items, accelerates the retrieval speed of the index information to a certain extent, and belongs to an implicit optimization algorithm for index information retrieval. In addition, the optimization of the display index information retrieval algorithm is performed based on the simplified index information data structure characteristics, namely, the index information simplification scheme is an implementation basis of the index information retrieval optimization algorithm.

As shown in fig. 1, in the index search optimization algorithm in this solution, in the searched virtual group, according to the comparison between the original space address and the original file size, all sub-indexes meeting the conditions are screened and recorded, and the method includes: in the retrieved virtual group, searching a sub-index of which the first original space address is greater than the original space address OFFSET_O of the target small file by a dichotomy, and recording; and comparing the original file size of the target small file with the original file size of the sub-index, if the original file size of the target small file is smaller than the original file size of the sub-index, reserving and recording the sub-index, if the target small file is larger than the original file size of the sub-index, accumulating the original file size of the sub-index with the original file sizes of the sub-indexes before and after the original file size, and comparing the accumulated result with the original file size of the target small file, if the target small file is still larger than the original file size of the sub-index, continuing to accumulating the original file sizes of a plurality of sub-indexes backwards until the target small file is smaller than the original file size of the sub-index, and reserving all sub-indexes meeting the conditions and recording at the moment.

In this scheme, according to the sub-index recorded, split the target small file, get the sub-file after splitting, including: and correspondingly splitting the target small file according to the number of the recorded sub-indexes, the original space address of each sub-index and the original file size, and obtaining a split sub-file. In this embodiment, if necessary, the original space address offset_o of the target small file is split into multiple offsets_o1, offset_o2, and the like, and the split sub-file sizes LENGTH1, LENGTH2, and the like are calculated by comparing with the sub-index original space address OFFSET/original file size.

According to the above embodiment, the corresponding aggregated space address in the sub-index satisfying the condition may be specifically read in the sub-index according to the split original space address of each sub-file, and the corresponding aggregated space address, such as offset_o1_ N, OFFSET _o2_n, may be specifically read in the sub-index according to the split offset_o1, offset_o2, and the like, where offset_o1_ N, OFFSET _o2_n is the aggregated space address in the sub-index.

In the scheme, splitting a target small file according to the recorded sub-index to obtain a split sub-file; wherein the number of subfiles is consistent with the number of recorded sub-indexes.

The index retrieval optimization algorithm principle in the invention is that a virtual group is quickly positioned through a keyword KEY of a target small file, and then sub-indexes related to analysis are calculated according to an original space address, an original file size and relative offsets in the group. When searching the sub-indexes in the group, searching the first sub-index meeting the condition in the group by a dichotomy, and comparing the original file size of the target small file with the original file size in the sub-index to calculate whether a plurality of sub-indexes are involved or not.

In the invention, the index information simplifying scheme enables the storage and retrieval of the index information of unstructured data to have practicality through the conversion of the space address of the proposed structural data and the KEY word KEY of the virtual group. Meanwhile, index information items are reduced by one order of magnitude or even several orders of magnitude in a mode of combining and managing a plurality of sub-indexes, so that the index information storage and retrieval efficiency is greatly improved.

The index retrieval optimization algorithm provides a rapid index algorithm of unstructured file indexes, and the index retrieval efficiency of unstructured files is improved from O (n) to O (log n) through combined mathematical operation on the information such as an original space address, an original space address of a target small file, a file size of the target small file and the like, so that the reading time delay of the aggregated files is greatly reduced. The index retrieval optimization algorithm is a core protection point of the patent, and the index information simplification scheme serves the index retrieval optimization algorithm.

In another embodiment, the present invention provides a computer-readable storage medium storing a computer program for causing a computer to execute a distributed block storage small file aggregate index management method as described above.

In another embodiment, the present invention provides an electronic device, including: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the distributed block storage small file aggregation index management method when executing the computer program.

In the disclosed embodiments, a computer storage medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The computer storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a computer storage medium would include one or more wire-based electrical connections, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.

Claims

1. The distributed block storage small file aggregation index management method is characterized by comprising the following steps of:

2. The method for managing the aggregation index of distributed block storage small files according to claim 1, wherein the dividing the original data of the index information after the small file aggregation into virtual groups according to a preset fixed space size based on the original space address comprises: the method comprises the steps of presetting a fixed space SIZE as a segmentation unit SEG_SIZE, aggregating small files to obtain original data of index information, and segmenting the original data into a plurality of virtual groups according to an original space address in the original data and the preset segmentation unit.

3. The method for managing an aggregate index of distributed block storage small files according to claim 1, wherein said calculating keywords of each virtual group according to an original space address and a fixed space size comprises: and according to the original space address OFFSET and the fixed space SIZE SEG_SIZE, dividing the original space address OFFSET by the fixed space SIZE SEG_SIZE, and correspondingly calculating the KEY words KEY of each virtual group.

4. A distributed block store doclet aggregation index management method according to claim 3, wherein said calculating the relative address space offset of each sub-index comprises: the remainder after dividing the original space address OFFSET of each sub-index by the fixed space SIZE SEG_SIZE is calculated, and the remainder obtained corresponds to the relative address space OFFSET of each sub-index.

5. The method for managing an aggregate index of distributed block storage doclets as recited in claim 1, wherein said storing in a database comprises: the virtual groups are stored in the database in KEY/VALUE form, where KEY is a KEY and VALUE is a sub-index.

6. The method for managing the aggregate index of distributed block storage small files according to claim 1, wherein in the retrieved virtual group, all sub-indexes satisfying the condition are screened and recorded according to the comparison of the original space address and the original file size, and the method comprises the steps of: in the retrieved virtual group, searching a sub-index of which the first original space address is larger than the original space address of the target small file by a dichotomy, and recording; and comparing the original file size of the target small file with the original file size of the sub-index, if the original file size of the target small file is smaller than the original file size of the sub-index, reserving and recording the sub-index, if the target small file is larger than the original file size of the sub-index, accumulating the original file size of the sub-index with the original file sizes of the sub-indexes before and after the original file size, and comparing the accumulated result with the original file size of the target small file, if the target small file is still larger than the original file size of the sub-index, continuing to accumulating the original file sizes of a plurality of sub-indexes backwards until the target small file is smaller than the original file size of the sub-index, and reserving all sub-indexes meeting the conditions and recording at the moment.

7. The method for managing the aggregate index of distributed block storage small files according to claim 1, wherein splitting the target small file according to the recorded sub-index to obtain the split sub-file comprises: and correspondingly splitting the target small file according to the number of the recorded sub-indexes, the original space address of each sub-index and the original file size, and obtaining a split sub-file.

8. The distributed block storage small file aggregation index management method according to claim 1, wherein: splitting the target small file according to the recorded sub-index to obtain a split sub-file; wherein the number of subfiles is consistent with the number of recorded sub-indexes.

9. A computer-readable storage medium storing a computer program, wherein the computer program causes a computer to execute a distributed block storage small file aggregation index management method according to any one of claims 1 to 8.

10. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a distributed block storage doclet aggregation index management method according to any one of claims 1-8 when the computer program is executed.