CN114610952A

CN114610952A - Effective data indexing method, system, device and storage medium

Info

Publication number: CN114610952A
Application number: CN202210185915.9A
Authority: CN
Inventors: 马立珂; 王贤达; 黄律棋; 王子骏
Original assignee: Guangzhou Dingjia Computer Technology Co ltd
Current assignee: Guangzhou Dingjia Computer Technology Co ltd
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2022-06-10
Anticipated expiration: 2042-02-28
Also published as: CN114610952B

Abstract

The invention discloses an effective data indexing method, a system, a device and a storage medium. The effective data indexing method comprises the following steps: setting the value of the first block number to zero; decompressing the first binary group to obtain a second block number and a first block number; calculating the difference between the second block number and the first block number; setting the value of the second block number to the value of the first block number; generating a second tuple according to the difference value and the first block number; confirming that the first binary group is not completely decompressed, returning to decompress the first binary group, and acquiring a second block number and a first block number; confirming that the first binary group is completely decompressed, and completing the indexing of the valid data according to the second binary group. On the basis of the original first binary group, the numerical value of the block number is reduced by calculating the difference value, so that the number of bits required for representing the block number is reduced; the second binary group is generated through the block number of the effective data block and the corresponding block number, and the bit number required by each effective data block indexed by each second binary group is further reduced.

Description

Effective data indexing method, system, device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, a system, an apparatus, and a storage medium for efficient data indexing.

Background

Database data, operating system backup data, and virtual machine data are typically stored sparsely (spares) in files or block devices, such as data files of a database, block devices of an operating system Partition (Partition), and virtual machine image files. In order to save storage space, only used or used data blocks, i.e. valid data blocks, are usually backed up when backing up the above data. Therefore, when backing up the above data, the location of the valid data block in the original data needs to be found, i.e. the index of the valid data needs to be performed. The conventional effective data indexing method records the position of effective data in original data through a bitmap or a position vector. Each bit in the bitmap represents a block of data with fixed length, and when the bit is set to be 1, the corresponding data block is an effective data block. A group of effective data blocks can be represented by a bitmap formed by a plurality of bits; the position vector is an array of binary groups, and a plurality of valid data blocks in succession are represented as binary groups, wherein each binary group comprises a starting block address and the length of the valid data blocks in succession.

However, on the one hand, when the raw data is very much and the valid data is very little, the bitmap will be filled with a large number of 0 bits, and the effective utilization of the bitmap is not high. The size of the bitmap is only related to the size of the original data and cannot be changed due to the distribution and the size of the effective data; on the other hand, when the original data is very large and the valid data blocks are very discrete, i.e. there are few consecutive valid data blocks or the fragmentation is severe, the number of tuples in the position vector will be very large. In the extreme case, each valid data block corresponds to a doublet. Therefore, the traditional effective data indexing method needs more bits for indexing each effective data block, and occupies large resources.

Disclosure of Invention

The present invention aims to solve at least to some extent one of the technical problems existing in the prior art.

Therefore, an object of the embodiments of the present invention is to provide a method, a system, an apparatus and a storage medium for indexing valid data, so as to reduce the number of bits required for indexing each valid data block.

In order to achieve the technical purpose, the technical scheme adopted by the embodiment of the invention comprises the following steps:

in a first aspect, an embodiment of the present invention provides an effective data indexing method, including the following steps:

setting the value of the first block number to zero;

decompressing the first binary group, and acquiring a second block number and a first block number, wherein the second block number is the initial block address of each first binary group, and the first block number is the block number of continuous effective data blocks in each first binary group;

calculating a difference between the second block number and the first block number;

setting a value of the second block number to a value of the first block number;

generating a second tuple according to the difference value and the first block number;

confirming that the first binary group is not completely decompressed, returning to decompress the first binary group, and acquiring a second block number and a first block number;

confirming that the first binary group is completely decompressed, and completing the indexing of valid data according to the second binary group.

According to the effective data indexing method, on the basis of the original first binary group, the difference value between the block number of the first binary group and the block number of the previous first binary group is used as a new block number, so that the value of the block number is reduced, and the number of bits required for representing the block number is reduced; the number of the blocks of the continuous effective data blocks in each first binary group is obtained, and the second binary group is generated by the corresponding block number, so that the number of bits required by each second binary group to index the effective data blocks is further reduced.

In addition, the effective data indexing method according to the above embodiment of the present invention may further have the following additional technical features:

further, in an effective data indexing method according to an embodiment of the present invention, the generating a second tuple according to the difference and the first block number includes:

transmitting the difference value into a pack function to generate a third block number;

transmitting the first block number into a pack function to generate a second block number;

and generating the second binary group according to the third block number and the second block number.

Further, in an embodiment of the present invention, the passing the difference value into a pack function to generate a third block number includes:

transmitting the difference value into a pack function, and returning to a first encoding buffer area;

and writing the first coding buffer area into a target buffer area to generate the third block number.

Further, in an embodiment of the present invention, the inputting the first block number into a pack function to generate a second block number includes:

transmitting the first block number into a pack function, and returning to a second coding buffer area;

and writing the second coding buffer area into a target buffer area to generate the second block number.

In a second aspect, an embodiment of the present invention provides an effective data indexing system, including:

a first block number assignment module for setting a value of a first block number to zero and for setting a value of the second block number to the value of the first block number;

the first binary decompression module is used for decompressing the first binary to obtain a second block number and a first block number;

a difference value calculating module for calculating a difference value between the second block number and the first block number;

the second tuple generating module is used for generating a second tuple according to the difference value and the first block number;

and the judging module is used for confirming that the first binary group is not completely decompressed, returning to the step of decompressing the first binary group, acquiring the second block number and the first block number, confirming that the first binary group is completely decompressed, and finishing the index of the effective data according to the second binary group.

Further, in an embodiment of the present invention, the second tuple generation module includes:

the third block number generation module is used for transmitting the difference value into a pack function to generate a third block number;

and the second block number generation module is used for transmitting the first block number into a pack function to generate a second block number.

Further, in an embodiment of the present invention, the third block number generation module includes:

the first coding buffer area returning module is used for transmitting the difference value into a pack function and returning the difference value to the first coding buffer area;

and the first writing module is used for writing the first coding buffer area into a target buffer area to generate the third block number.

Further, in an embodiment of the present invention, the second block number generating module includes:

the second coding buffer area returning module is used for transmitting the first block data into a pack function and returning the first block data to the second coding buffer area;

and the second writing module is used for writing the second coding buffer area into a target buffer area to generate the second block number.

In a third aspect, an embodiment of the present invention provides an effective data indexing apparatus, including:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, the at least one program causes the at least one processor to implement the efficient data indexing method.

In a fourth aspect, an embodiment of the present invention provides a storage medium, in which a processor-executable program is stored, the processor-executable program being configured to implement the effective data indexing method when executed by a processor.

Advantages and benefits of the present invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application:

the embodiment of the invention reduces the value of the block number by taking the difference value between the block number of the first binary group and the block number of the previous first binary group as a new block number on the basis of the original first binary group, so that the number of bits required for representing the block number is reduced; the number of the blocks of the continuous effective data blocks in each first binary group is obtained, and the second binary group is generated by the corresponding block number, so that the number of bits required by each second binary group to index the effective data blocks is further reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description is made on the drawings of the embodiments of the present application or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart illustrating an embodiment of a method for efficient data indexing according to the present invention;

FIG. 2 is a schematic illustration of an unpack flow chart of an embodiment of a method for indexing valid data according to the present invention;

FIG. 3 is a block flow diagram illustrating an embodiment of a valid data indexing method according to the present invention;

FIG. 4 is a schematic diagram of an embodiment of an efficient data indexing system according to the present invention;

FIG. 5 is a schematic structural diagram of an embodiment of an apparatus for indexing valid data according to the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative and are only for the purpose of explaining the present application and are not to be construed as limiting the present application. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of the invention and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The invention provides an effective data indexing method and a system, which are different from the traditional effective data indexing method and have the problems of more bits required for indexing each effective data block and large resource occupation; the number of the blocks of the continuous effective data blocks in each first binary group is obtained, and the second binary group is generated by the corresponding block number, so that the number of bits required by each second binary group to index the effective data blocks is further reduced.

Hereinafter, a valid data indexing method and system according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings, and first, a valid data indexing method according to an embodiment of the present invention will be described with reference to the accompanying drawings.

Referring to fig. 1, an embodiment of the present invention provides an effective data indexing method, and the effective data indexing method in the embodiment of the present invention may be applied to a terminal, a server, software running in the terminal or the server, or the like. The terminal may be, but is not limited to, a tablet computer, a notebook computer, a desktop computer, and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The effective data indexing method in the embodiment of the invention mainly comprises the following steps:

s101, setting the value of a first block number to be zero;

specifically, a block number (prev _ block _ num) is set, that is, the first block number is 0.

S102, decompressing the first binary group, and acquiring a second block number and a first block number;

the second block number is the starting block address of each first binary group, and the first block number is the number of the continuous effective data blocks in each first binary group.

Specifically, referring to fig. 2, in the conventional valid data indexing method, a position vector is an array of two-tuples (first two-tuples), and consecutive valid data blocks are represented as the two-tuples, which include a start block address and a length of the consecutive valid data blocks. In an embodiment of the invention, the block number (block _ num) and the number of blocks (count) of the first tuple are returned by an unpack function.

In an embodiment of the present invention, the block start address is represented by a block number. If the block size is 2ⁿ(n>9), a maximum of 64-n bits is occupied. The length of the continuous adjacent effective data blocks is expressed by the number of blocks, and the length of the continuous adjacent effective data blocks occupies 64-n bits at most.

S103, calculating the difference value between the second block number and the first block number;

specifically, the difference between the second block number and the first block number is calculated (delta _ block _ num-prev _ block _ num).

S104, setting the value of the second block number as the value of the first block number;

specifically, after the difference delta between the second block number and the first block number is calculated, the value of the first block number is updated, and the value of the second block number is assigned to the first block number (prev _ block _ num).

S105, generating a second tuple according to the difference value and the first block number;

specifically, the difference calculated in step S103 is used as the block number of the second tuple, i.e., the third block number, and compared with the first tuple, the second tuple reduces the value of the block number, so that the number of bits required for representing the block number is reduced; the other component of the second doublet is the number of blocks, which is obtained by passing the first number of blocks into the pack function. In the embodiment of the invention, the number of the blocks of the continuous effective data blocks in each first binary group is obtained, and the second binary group is generated by the corresponding block number, so that the number of bits required by each second binary group to index the effective data blocks in the second binary group is further reduced. Under the condition that the original data is very much and the effective data blocks are very discrete, the number of the blocks in the second tuple is small, and the number of bits required is correspondingly reduced.

In the embodiment of the invention, the variable length coding binary group is adopted, so that the bit number occupied by the position vector is further reduced.

S105 may be further divided into the following steps S1051-S1053:

step S1051, transmitting the difference value into a pack function to generate a third block number;

in the embodiment of the invention, the difference value is transmitted into a pack function, a first coding buffer packet is returned, and the length n of the first coding buffer packet is returned at the same time; and writing the first coding buffer area packet into a target buffer area to generate the third block number.

Specifically, referring to fig. 3, one byte is read, the low 7-bit of the byte is stored to the bit shift corresponding to the result, and the bit shift is increased by 7. If the highest bit of the byte is 1, decoding the difference value is completed, a third block number is generated, and if not, the next byte is continuously read and the operation is circulated.

Step 1052, transmitting the first block number into a pack function to generate a second block number;

in the embodiment of the invention, the first block number is transmitted into a pack function, a second coding buffer area packet is returned, and the length n of the second coding buffer area packet is returned at the same time; and writing the second coding buffer area into a target buffer area to generate the second block number.

Specifically, referring to fig. 3, one byte is read, the low 7-bit of the byte is stored to the bit shift corresponding to the result, and the bit shift is increased by 7. If the highest bit of the byte is 1, decoding of the first block number is completed, a second block number is generated, and if not, the next byte is continuously read and the operation is circulated.

Step S1053, generating the second tuple according to the third block number and the second block number.

In an embodiment of the invention, the second tuple is stored in little endian (little byte endian) from the third block number (low byte sequence) to the second block number (high byte sequence).

Specifically, the third block number and the second block number are encoded in the little endian into a binary system, which is stored in several (non-fixed number) bytes. When the highest bit of the current byte is 1, the next byte exists and is used for recording the residual information of the block number; otherwise, the current byte is the last byte of the recording block number, that is, the low 7-bit of only one byte is used for storing information, and the high 1-bit is used as a mark. Such as:

when the third block number is in the range of [0,2 ]⁷-1]When the binary number is 1B, corresponding binary numbers are from 00000000 to 01111111;

when the third block number range is [128,2 ]¹⁴-1]When 2B is needed, corresponding binary numbers are from 0000000110000000 to 0111111111111111.

S106, confirming that the first binary group is not completely decompressed, returning to the step of decompressing the first binary group and acquiring a second block number and a first block number;

specifically, if the first tuple is not completely decompressed, the remaining first tuple is continuously decompressed, and the initial block address, i.e. the second block number, in the first tuple and the block number of the valid data block, i.e. the first block number, are obtained.

S107, confirming that the first binary group is completely decompressed, and finishing the index of the effective data according to the second binary group.

Specifically, if the first tuple is completely decompressed, the index of the valid data is completed according to the second tuple.

In one embodiment of the present invention, the block size of the valid data block is 512B, and the indexing of the valid data is completed by the second tuple in the case that the original data is very much and the valid data block is very discrete or fragmented seriously:

indexing the worst required number of bits per valid data block:

of these, 64-log₂The maximum occupied bit number of the third block number is calculated as 512-55

Occupies 8B;

indicating that there is only one valid data block, calculating

Occupying 1B.

If the traditional effective data index method is adopted:

(1) by using a bitmap, if the size of original data is 1TB (the maximum bit number required for indexing the effective data by the effective data indexing method of the embodiment of the present invention is irrelevant to the size of the original data), the size of the bitmap is indexed:

if only 1% of the original data is valid data, the index bitmap occupancy rate is as follows:

thus, indexing each valid data block requires an average of 100 bits.

(2) Position vector constructed with a first binary:

in the worst case, each valid data block corresponds to a first tuple. If the start address of the first binary group is 8B and the length is 8B (the maximum number of bits required for indexing the valid data by the valid data indexing method of the embodiment of the present invention is irrelevant to the length of the binary group, and the less the valid data blocks corresponding to the second binary group, that is, the smaller the second block number, the less the required number of bits), the worst required number of bits for each valid data block is reduced:

(8+8)×8＝128bit

therefore, the effective data indexing method of the embodiment of the invention saves 28% of bit number compared with the bitmap and 43% of bit number compared with the position vector formed by the first binary group.

In an embodiment of the invention, the number of second tuples will be smaller when there are more consecutively adjacent valid data blocks. Taking the example that the size of each block is 512B, the average continuous adjacent block size is 1MB, and the original data is not larger than 1TB, the method is applied to index the average bit number required by each effective data block:

wherein, the original data is not more than 1TB, the maximum block number is not more than

Maximum bit number occupied:

computing

The average continuous block size is 1MB, and the number of occupied bits

Computing

Next, an efficient data indexing system proposed according to an embodiment of the present application is described with reference to the accompanying drawings.

FIG. 4 is a block diagram of an efficient data indexing system according to an embodiment of the present application.

The system specifically comprises:

a first block number assignment module 401, configured to set a value of a first block number to zero, and set a value of the second block number to the value of the first block number;

a first binary decompression module 402, configured to decompress the first binary to obtain a second block number and a first block number;

a difference calculation module 403, configured to calculate a difference between the second block number and the first block number;

a second tuple generating module 404, configured to generate a second tuple according to the difference and the first block number;

the determining module 405 is configured to confirm that the first tuple is not completely decompressed, return to the step of decompressing the first tuple, obtain the second block number and the first block number, and confirm that the first tuple is completely decompressed, and complete the index of the valid data according to the tuple.

In an embodiment of the present invention, the second tuple generation module includes:

a third block number generation module, configured to transmit the difference value into a pack function, and generate a third block number;

In an embodiment of the present invention, the third block number generating module includes:

In an embodiment of the present invention, the second block number generating module includes:

It can be seen that the contents in the foregoing method embodiments are all applicable to this system embodiment, the functions specifically implemented by this system embodiment are the same as those in the foregoing method embodiment, and the advantageous effects achieved by this system embodiment are also the same as those achieved by the foregoing method embodiment.

Referring to fig. 5, an embodiment of the present application provides an effective data indexing apparatus, including:

at least one processor 501;

at least one memory 502 for storing at least one program;

the at least one program, when executed by the at least one processor 501, causes the at least one processor 501 to implement the efficient data indexing method.

Similarly, the contents of the method embodiments are all applicable to the apparatus embodiments, the functions specifically implemented by the apparatus embodiments are the same as the method embodiments, and the beneficial effects achieved by the apparatus embodiments are also the same as the beneficial effects achieved by the method embodiments.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present application are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present application is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion regarding the actual implementation of each module is not necessary for an understanding of the present application. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the present application as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the application, which is defined by the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium, which includes programs for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable programs that can be considered for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with a program execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the programs from the program execution system, apparatus, or device and execute the programs. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the program execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable program execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the foregoing description of the specification, reference to the description of "one embodiment/example," "another embodiment/example," or "certain embodiments/examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: numerous changes, modifications, substitutions and variations can be made to the embodiments without departing from the principles and spirit of the application, the scope of which is defined by the claims and their equivalents.

While the present application has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for efficient data indexing, comprising the steps of:

setting the value of the first block number to zero;

2. The method of claim 1, wherein the generating the second tuple according to the difference value and the first block number comprises:

and generating the second tuple according to the third block number and the second block number.

3. The method according to claim 2, wherein said passing said difference value into a pack function to generate a third block number comprises:

4. The method of claim 1, wherein the passing the first block number into a pack function to generate a second block number comprises:

5. A valid data indexing system, comprising:

6. The efficient data indexing system of claim 5, wherein the second tuple generation module comprises:

7. The valid data indexing system of claim 6, wherein the third block number generation module comprises:

8. The system for efficient data indexing of claim 6, wherein the second block number generation module comprises:

9. An apparatus for indexing valid data, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement a method of efficient data indexing as claimed in any one of claims 1 to 4.

10. A storage medium having stored therein a program executable by a processor, characterized in that: the processor-executable program when executed by a processor is for implementing an efficient data indexing method as claimed in any one of claims 1 to 4.