CN118051478A - Distributed block storage small file aggregation index management method - Google Patents

Distributed block storage small file aggregation index management method Download PDF

Info

Publication number
CN118051478A
CN118051478A CN202311667919.1A CN202311667919A CN118051478A CN 118051478 A CN118051478 A CN 118051478A CN 202311667919 A CN202311667919 A CN 202311667919A CN 118051478 A CN118051478 A CN 118051478A
Authority
CN
China
Prior art keywords
sub
index
file
original
size
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311667919.1A
Other languages
Chinese (zh)
Inventor
李佳徐
刘伟锋
李博奇
牛鹏举
吴伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Digital Intelligence Technology Co Ltd
Original Assignee
China Telecom Digital Intelligence Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Digital Intelligence Technology Co Ltd filed Critical China Telecom Digital Intelligence Technology Co Ltd
Priority to CN202311667919.1A priority Critical patent/CN118051478A/en
Publication of CN118051478A publication Critical patent/CN118051478A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a distributed block storage small file aggregation index management method, which comprises dividing original data of index information after small file aggregation into virtual groups, and calculating keywords of the virtual groups; dividing index information of the small files into corresponding virtual groups as sub-index information; calculating relative address space offset of the sub-indexes, sequencing in the virtual group and storing in a database; calculating keywords of the target small file, and searching the satisfied virtual group; comparing and screening sub-indexes in the retrieved virtual group; splitting the target small file to obtain a sub-file; and combining index retrieval results according to the aggregated space address of the sub-file corresponding read sub-index to read the file from the storage bottom layer. According to the invention, the index information in the database can be reduced by one order of magnitude or even several orders of magnitude by dividing the original structured data of the index information after the small files are aggregated into a plurality of virtual groups; by optimizing index information retrieval, fast querying of unstructured indexes can be achieved.

Description

Distributed block storage small file aggregation index management method
Technical Field
The invention relates to the technical field of distributed storage, in particular to a distributed block storage small file aggregation index management method.
Background
Distributed storage may provide three storage interfaces, distributed block storage, distributed object storage, and distributed file storage, where object storage and file storage are used to store unstructured data and distributed block storage is used to store structured data. Unstructured data of the object/file store may be identified at the storage layer using file names, while structured data of the block store no longer contains the original file name information, but information such as file space/file size. Reference is made to fig. 3.
The persistent storage medium of the distributed storage bottom layer mostly uses a mechanical disk, is limited by the physical structure of the mechanical disk, the random read-write performance of small files is often bad, and the read-write performance of large files can meet the requirements. In order to improve the read-write performance of small files in a storage system, the small files are generally gathered into large files and then are intensively brushed down to the bottom layer for persistent storage.
Object storage and file storage small file aggregation small files are usually written into a large file in an additional way and then are brushed down to the bottom layer persistent storage, and index relation can be established between the small file and the large file through file names for file reading. The object/file storage doclet aggregation principle can be seen with reference to fig. 4.
Because of the unstructured data characteristics, the block storage cannot directly establish a simple index between an original file and an aggregated file during small file aggregation, and the space address of the original data block and the space address of the aggregated data block are required to be associated and integrated into index information. Based on the continuity of the address space, the storage of index information is made difficult by the fact that the index information entry based on the space address is virtually unlimited in theory. And the massive index information data also causes low efficiency in index information retrieval, so that the original small files are extremely difficult to read from the aggregated file. In addition, due to the continuity of the space address, the index cannot be quickly positioned through the key words in index retrieval, and the space address segments often need to be compared, which also causes difficulty in index information retrieval.
Therefore, there is a need for a distributed block storage small file aggregation index management method to solve the problems of multiple entries of distributed block storage small file aggregation index information, inconvenient storage and low index information retrieval efficiency.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a distributed block storage small file aggregation index management method to solve the problems of more items of distributed block storage small file aggregation index information, inconvenient storage and low index information retrieval efficiency.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
The distributed block storage small file aggregation index management method is characterized by comprising the following steps of:
Dividing the original data of the index information after the small file aggregation into virtual groups according to a preset fixed space size based on an original space address, and calculating keywords of each virtual group according to the original space address and the fixed space size; taking index information of each small file as sub-index information, and dividing each sub-index into a corresponding virtual group according to the keywords; calculating the relative address space offset of each sub-index, sorting in each virtual group according to the relative address space offset of each sub-index, and storing the sorted virtual groups in a database;
Calculating the keywords of the target small file, traversing the database, and searching the virtual group meeting the keywords; in the retrieved virtual group, screening all sub-indexes meeting the conditions according to the comparison of the original space address and the original file size, and recording; splitting the target small file according to the recorded sub-index to obtain a split sub-file; and correspondingly reading the aggregated space address in the sub-index meeting the condition according to the original space address of each split sub-file, and combining the index retrieval result into the aggregated space address/file size of the target small file so as to read the file from the storage bottom layer.
In order to optimize the technical scheme, the specific measures adopted further comprise:
Further, dividing the original data of the index information after aggregating the small files into virtual groups according to a preset fixed space size based on the original space address, including: the method comprises the steps of presetting a fixed space SIZE as a segmentation unit SEG_SIZE, aggregating small files to obtain original data of index information, and segmenting the original data into a plurality of virtual groups according to an original space address in the original data and the preset segmentation unit.
Further, the calculating the keywords of each virtual group according to the original space address and the fixed space size includes: and according to the original space address OFFSET and the fixed space SIZE SEG_SIZE, dividing the original space address OFFSET by the fixed space SIZE SEG_SIZE, and correspondingly calculating the KEY words KEY of each virtual group.
Further, the calculating the relative address space offset of each sub-index includes: the remainder after dividing the original space address OFFSET of each sub-index by the fixed space SIZE SEG_SIZE is calculated, and the remainder obtained corresponds to the relative address space OFFSET of each sub-index.
Further, the storing in the database includes: the virtual groups are stored in the database in KEY/VALUE form, where KEY is a KEY and VALUE is a sub-index.
Further, in the retrieved virtual group, according to the comparison between the original space address and the original file size, screening all sub-indexes meeting the condition, and recording, including: in the retrieved virtual group, searching a sub-index of which the first original space address is larger than the original space address of the target small file by a dichotomy, and recording; and comparing the original file size of the target small file with the original file size of the sub-index, if the original file size of the target small file is smaller than the original file size of the sub-index, reserving and recording the sub-index, if the target small file is larger than the original file size of the sub-index, accumulating the original file size of the sub-index with the original file sizes of the sub-indexes before and after the original file size, and comparing the accumulated result with the original file size of the target small file, if the target small file is still larger than the original file size of the sub-index, continuing to accumulating the original file sizes of a plurality of sub-indexes backwards until the target small file is smaller than the original file size of the sub-index, and reserving all sub-indexes meeting the conditions and recording at the moment.
Further, splitting the target small file according to the recorded sub-index to obtain a split sub-file, including: and correspondingly splitting the target small file according to the number of the recorded sub-indexes, the original space address of each sub-index and the original file size, and obtaining a split sub-file.
Further, splitting the target small file according to the recorded sub-index to obtain a split sub-file; wherein the number of subfiles is consistent with the number of recorded sub-indexes.
Further, a computer-readable storage medium storing a computer program, characterized in that: the computer program causes a computer to perform a distributed block store doclet aggregation index management method as described above.
Further, an electronic device is characterized by comprising: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the distributed block storage small file aggregation index management method when executing the computer program.
The beneficial effects of the invention are as follows:
According to the invention, the original structured data of the index information after the small files are aggregated is divided into a plurality of virtual groups, so that the index information in the database can be reduced by one order of magnitude or even several orders of magnitude, and the index information storage and retrieval efficiency is greatly improved; by optimizing index information retrieval, the rapid query of unstructured indexes can be realized, the index query efficiency is improved from the original O (n) to O (log n), and the reading time delay of the aggregated file is greatly reduced.
Drawings
FIG. 1 is a flow chart of a retrieval process of a distributed block storage doclet aggregation index management method according to the present invention;
FIG. 2 is a diagram illustrating a virtual group of a distributed block storage small file aggregate index management method according to the present invention;
FIG. 3 is a schematic diagram of distributed storage;
Fig. 4 is a schematic diagram of the object/file storage doclet aggregation principle.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The distributed block storage small file aggregation index management method is characterized by comprising the following steps of:
Index information simplification scheme: dividing the original structured data of the index information after the small file aggregation into virtual groups according to a preset fixed space size based on an original space address, and calculating keywords of each virtual group according to the original space address and the fixed space size; taking index information of each small file as sub-index information, and dividing each sub-index into a corresponding virtual group according to the keywords; calculating the relative address space offset of each sub-index, sorting in each virtual group according to the relative address space offset of each sub-index, and storing the sorted virtual groups in a database;
Index retrieval optimization algorithm: according to the original space address OFFSET_O of the target small file and the fixed space SIZE SEG_SIZE, calculating a keyword KEY_O of the target small file, traversing the database, and searching a virtual group meeting the keyword; in the retrieved virtual group, screening all sub-indexes meeting the conditions according to the comparison of the original space address and the original file size, and recording; splitting the target small file according to the recorded sub-index to obtain a split sub-file; and correspondingly reading the aggregated space address in the sub-index meeting the condition according to the original space address of each split sub-file, and combining the index retrieval result into the form of the aggregated space address/file size of the target small file so as to read the file from the storage bottom layer.
According to the invention, the original structured data of the index information after the small files are aggregated is divided into a plurality of virtual groups, so that the index information in the database can be reduced by one order of magnitude or even several orders of magnitude, and the index information storage and retrieval efficiency is greatly improved; by optimizing index information retrieval, the rapid query of unstructured indexes can be realized, the index query efficiency is improved from the original O (n) to O (log n), and the reading time delay of the aggregated file is greatly reduced.
In the index information simplification scheme of the scheme, the method for dividing the original structured data of the index information after the small file aggregation into virtual groups according to the preset fixed space size based on the original space address specifically comprises the following steps: the method comprises the steps of presetting a fixed space SIZE as a segmentation unit SEG_SIZE, aggregating small files to obtain original structured data of index information, and segmenting the original structured data into a plurality of virtual groups according to an original space address in the original structured data and the preset segmentation unit.
In this scheme, calculate the keyword of each virtual group according to the original space address and fixed space size, include: and according to the original space address OFFSET and the fixed space SIZE SEG_SIZE, dividing the original space address OFFSET by the fixed space SIZE SEG_SIZE, and correspondingly calculating the KEY words KEY of each virtual group. When offset=1001, seg_size=8, OFFSET/seg_size=125 (125.125 rounded). In this embodiment, the calculation mode of the keyword key_o of the target doclet is consistent with the calculation mode.
Wherein calculating the relative address space offset of each sub-index comprises: the remainder after dividing the original space address OFFSET of each sub-index by the fixed space SIZE SEG_SIZE is calculated, and the remainder obtained corresponds to the relative address space OFFSET of each sub-index. In the example above, 1001 divided by 8 by 125 by more than 1, the relative address space offset is 1.
In this scheme, deposit in the database, include: the virtual groups are stored in the database in the form of KEYs/VALUEs, where KEYs are keywords and VALUEs may be sub-indices.
As shown in fig. 2, the index information structure of the virtual group is as follows, and in this example, one virtual group contains three sub-indexes. Wherein the three sub-indices are ordered (sequentially increased) in size according to a "intra-group relative offset," which herein corresponds to the relative address space offset described above.
The implementation of the index information simplification scheme greatly reduces the index information items, accelerates the retrieval speed of the index information to a certain extent, and belongs to an implicit optimization algorithm for index information retrieval. In addition, the optimization of the display index information retrieval algorithm is performed based on the simplified index information data structure characteristics, namely, the index information simplification scheme is an implementation basis of the index information retrieval optimization algorithm.
As shown in fig. 1, in the index search optimization algorithm in this solution, in the searched virtual group, according to the comparison between the original space address and the original file size, all sub-indexes meeting the conditions are screened and recorded, and the method includes: in the retrieved virtual group, searching a sub-index of which the first original space address is greater than the original space address OFFSET_O of the target small file by a dichotomy, and recording; and comparing the original file size of the target small file with the original file size of the sub-index, if the original file size of the target small file is smaller than the original file size of the sub-index, reserving and recording the sub-index, if the target small file is larger than the original file size of the sub-index, accumulating the original file size of the sub-index with the original file sizes of the sub-indexes before and after the original file size, and comparing the accumulated result with the original file size of the target small file, if the target small file is still larger than the original file size of the sub-index, continuing to accumulating the original file sizes of a plurality of sub-indexes backwards until the target small file is smaller than the original file size of the sub-index, and reserving all sub-indexes meeting the conditions and recording at the moment.
In this scheme, according to the sub-index recorded, split the target small file, get the sub-file after splitting, including: and correspondingly splitting the target small file according to the number of the recorded sub-indexes, the original space address of each sub-index and the original file size, and obtaining a split sub-file. In this embodiment, if necessary, the original space address offset_o of the target small file is split into multiple offsets_o1, offset_o2, and the like, and the split sub-file sizes LENGTH1, LENGTH2, and the like are calculated by comparing with the sub-index original space address OFFSET/original file size.
According to the above embodiment, the corresponding aggregated space address in the sub-index satisfying the condition may be specifically read in the sub-index according to the split original space address of each sub-file, and the corresponding aggregated space address, such as offset_o1_ N, OFFSET _o2_n, may be specifically read in the sub-index according to the split offset_o1, offset_o2, and the like, where offset_o1_ N, OFFSET _o2_n is the aggregated space address in the sub-index.
In the scheme, splitting a target small file according to the recorded sub-index to obtain a split sub-file; wherein the number of subfiles is consistent with the number of recorded sub-indexes.
The index retrieval optimization algorithm principle in the invention is that a virtual group is quickly positioned through a keyword KEY of a target small file, and then sub-indexes related to analysis are calculated according to an original space address, an original file size and relative offsets in the group. When searching the sub-indexes in the group, searching the first sub-index meeting the condition in the group by a dichotomy, and comparing the original file size of the target small file with the original file size in the sub-index to calculate whether a plurality of sub-indexes are involved or not.
In the invention, the index information simplifying scheme enables the storage and retrieval of the index information of unstructured data to have practicality through the conversion of the space address of the proposed structural data and the KEY word KEY of the virtual group. Meanwhile, index information items are reduced by one order of magnitude or even several orders of magnitude in a mode of combining and managing a plurality of sub-indexes, so that the index information storage and retrieval efficiency is greatly improved.
The index retrieval optimization algorithm provides a rapid index algorithm of unstructured file indexes, and the index retrieval efficiency of unstructured files is improved from O (n) to O (log n) through combined mathematical operation on the information such as an original space address, an original space address of a target small file, a file size of the target small file and the like, so that the reading time delay of the aggregated files is greatly reduced. The index retrieval optimization algorithm is a core protection point of the patent, and the index information simplification scheme serves the index retrieval optimization algorithm.
In another embodiment, the present invention provides a computer-readable storage medium storing a computer program for causing a computer to execute a distributed block storage small file aggregate index management method as described above.
In another embodiment, the present invention provides an electronic device, including: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the distributed block storage small file aggregation index management method when executing the computer program.
In the disclosed embodiments, a computer storage medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The computer storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a computer storage medium would include one or more wire-based electrical connections, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.

Claims (10)

1. The distributed block storage small file aggregation index management method is characterized by comprising the following steps of:
Dividing the original data of the index information after the small file aggregation into virtual groups according to a preset fixed space size based on an original space address, and calculating keywords of each virtual group according to the original space address and the fixed space size; taking index information of each small file as sub-index information, and dividing each sub-index into a corresponding virtual group according to the keywords; calculating the relative address space offset of each sub-index, sorting in each virtual group according to the relative address space offset of each sub-index, and storing the sorted virtual groups in a database;
Calculating the keywords of the target small file, traversing the database, and searching the virtual group meeting the keywords; in the retrieved virtual group, screening all sub-indexes meeting the conditions according to the comparison of the original space address and the original file size, and recording; splitting the target small file according to the recorded sub-index to obtain a split sub-file; and correspondingly reading the aggregated space address in the sub-index meeting the condition according to the original space address of each split sub-file, and combining the index retrieval result into the aggregated space address/file size of the target small file so as to read the file from the storage bottom layer.
2. The method for managing the aggregation index of distributed block storage small files according to claim 1, wherein the dividing the original data of the index information after the small file aggregation into virtual groups according to a preset fixed space size based on the original space address comprises: the method comprises the steps of presetting a fixed space SIZE as a segmentation unit SEG_SIZE, aggregating small files to obtain original data of index information, and segmenting the original data into a plurality of virtual groups according to an original space address in the original data and the preset segmentation unit.
3. The method for managing an aggregate index of distributed block storage small files according to claim 1, wherein said calculating keywords of each virtual group according to an original space address and a fixed space size comprises: and according to the original space address OFFSET and the fixed space SIZE SEG_SIZE, dividing the original space address OFFSET by the fixed space SIZE SEG_SIZE, and correspondingly calculating the KEY words KEY of each virtual group.
4. A distributed block store doclet aggregation index management method according to claim 3, wherein said calculating the relative address space offset of each sub-index comprises: the remainder after dividing the original space address OFFSET of each sub-index by the fixed space SIZE SEG_SIZE is calculated, and the remainder obtained corresponds to the relative address space OFFSET of each sub-index.
5. The method for managing an aggregate index of distributed block storage doclets as recited in claim 1, wherein said storing in a database comprises: the virtual groups are stored in the database in KEY/VALUE form, where KEY is a KEY and VALUE is a sub-index.
6. The method for managing the aggregate index of distributed block storage small files according to claim 1, wherein in the retrieved virtual group, all sub-indexes satisfying the condition are screened and recorded according to the comparison of the original space address and the original file size, and the method comprises the steps of: in the retrieved virtual group, searching a sub-index of which the first original space address is larger than the original space address of the target small file by a dichotomy, and recording; and comparing the original file size of the target small file with the original file size of the sub-index, if the original file size of the target small file is smaller than the original file size of the sub-index, reserving and recording the sub-index, if the target small file is larger than the original file size of the sub-index, accumulating the original file size of the sub-index with the original file sizes of the sub-indexes before and after the original file size, and comparing the accumulated result with the original file size of the target small file, if the target small file is still larger than the original file size of the sub-index, continuing to accumulating the original file sizes of a plurality of sub-indexes backwards until the target small file is smaller than the original file size of the sub-index, and reserving all sub-indexes meeting the conditions and recording at the moment.
7. The method for managing the aggregate index of distributed block storage small files according to claim 1, wherein splitting the target small file according to the recorded sub-index to obtain the split sub-file comprises: and correspondingly splitting the target small file according to the number of the recorded sub-indexes, the original space address of each sub-index and the original file size, and obtaining a split sub-file.
8. The distributed block storage small file aggregation index management method according to claim 1, wherein: splitting the target small file according to the recorded sub-index to obtain a split sub-file; wherein the number of subfiles is consistent with the number of recorded sub-indexes.
9. A computer-readable storage medium storing a computer program, wherein the computer program causes a computer to execute a distributed block storage small file aggregation index management method according to any one of claims 1 to 8.
10. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a distributed block storage doclet aggregation index management method according to any one of claims 1-8 when the computer program is executed.
CN202311667919.1A 2023-12-07 2023-12-07 Distributed block storage small file aggregation index management method Pending CN118051478A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311667919.1A CN118051478A (en) 2023-12-07 2023-12-07 Distributed block storage small file aggregation index management method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311667919.1A CN118051478A (en) 2023-12-07 2023-12-07 Distributed block storage small file aggregation index management method

Publications (1)

Publication Number Publication Date
CN118051478A true CN118051478A (en) 2024-05-17

Family

ID=91043853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311667919.1A Pending CN118051478A (en) 2023-12-07 2023-12-07 Distributed block storage small file aggregation index management method

Country Status (1)

Country Link
CN (1) CN118051478A (en)

Similar Documents

Publication Publication Date Title
US9672235B2 (en) Method and system for dynamically partitioning very large database indices on write-once tables
CN110825748B (en) High-performance and easily-expandable key value storage method by utilizing differentiated indexing mechanism
CN100458779C (en) Index and its extending and searching method
CN103229173B (en) Metadata management method and system
US20040205044A1 (en) Method for storing inverted index, method for on-line updating the same and inverted index mechanism
CN110083601A (en) Index tree constructing method and system towards key assignments storage system
WO2013152678A1 (en) Method and device for metadata query
US8566308B2 (en) Intelligent adaptive index density in a database management system
CN103229164B (en) Data access method and device
CN110569245A (en) Fingerprint index prefetching method based on reinforcement learning in data de-duplication system
CN113535670B (en) Virtual resource mirror image storage system and implementation method thereof
WO2014106418A1 (en) Method and apparatus for storing and reading files
CN110134335A (en) A kind of RDF data management method, device and storage medium based on key-value pair
CN109189343B (en) Metadata disk-dropping method, device, equipment and computer-readable storage medium
CN109299143B (en) Knowledge fast indexing method of data interoperation test knowledge base based on Redis cache
CN113253932B (en) Read-write control method and system for distributed storage system
CN113094336B (en) Cuckoo hash-based file system directory management method and system
CN109800208B (en) Network traceability system and its data processing method, computer storage medium
CN110110034A (en) A kind of RDF data management method, device and storage medium based on figure
Fevgas et al. A spatial index for hybrid storage
CN109213760A (en) The storage of high load business and search method of non-relation data storage
CN118051478A (en) Distributed block storage small file aggregation index management method
CN113360551B (en) Method and system for storing and rapidly counting time sequence data in shooting range
CN114461635A (en) MySQL database data storage method and device and electronic equipment
Wu et al. SequenceFile storage optimization method based on pile structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination