WO2023108360A1 - Method and apparatus for managing data in storage system - Google Patents

Method and apparatus for managing data in storage system Download PDF

Info

Publication number
WO2023108360A1
WO2023108360A1 PCT/CN2021/137522 CN2021137522W WO2023108360A1 WO 2023108360 A1 WO2023108360 A1 WO 2023108360A1 CN 2021137522 W CN2021137522 W CN 2021137522W WO 2023108360 A1 WO2023108360 A1 WO 2023108360A1
Authority
WO
WIPO (PCT)
Prior art keywords
target data
data
data blocks
length
window
Prior art date
Application number
PCT/CN2021/137522
Other languages
French (fr)
Chinese (zh)
Inventor
张海波
郭小东
唐飞龙
李旭
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/137522 priority Critical patent/WO2023108360A1/en
Publication of WO2023108360A1 publication Critical patent/WO2023108360A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation

Definitions

  • the embodiments of the present application relate to the field of data storage, and in particular, to a data management method and device in a storage system.
  • Merkle-DAG Merkle directed acyclic graph
  • Merkle-DAG is composed of at least one tree, that is, data management is based on the tree structure.
  • a method for managing data in a storage system based on Merkle-DAG is: divide the target data (that is, the object to be managed) into multiple data blocks according to a fixed size (such as a fixed byte), and calculate the The hash value of the block; then, store the hash values of the multiple data blocks and the multiple data blocks according to the corresponding relationship between the data blocks and the hash values of the data blocks.
  • the Merkle-DAG of the target data is called an index tree), and this index tree is saved.
  • the application layer updates the target data (such as data insertion, data modification, and data deletion)
  • the total data volume of the target data may change. If the target data is divided into blocks according to the above-mentioned fixed size, it may cause The number of divided data blocks changes, which in turn leads to a large change in the number of groups of data blocks, so that the index tree needs to be rebuilt. Therefore, more resources need to be consumed in the process of data management.
  • Embodiments of the present application provide a data management method and device in a storage system, which can save resources required for data management.
  • an embodiment of the present application provides a data management method and device in a storage system, the method comprising: dividing the target data into M candidate data blocks based on the content of the target data, where M is an integer greater than or equal to 2;
  • the M candidate data blocks are divided into N target data blocks, N is a positive integer less than or equal to M, and each target data block includes at least one candidate data block; store N target data blocks and the fingerprint features of the N target data blocks, the target data block and the fingerprint features of the target data blocks have a one-to-one correspondence; according to the N target data blocks, an index tree of the target data is generated, and the index tree Used to address the contents of object data.
  • the data management device divides the target data into M candidate data blocks based on the content of the target data, and then divides the M candidate data blocks according to the respective fingerprint features of the M candidate data blocks.
  • the candidate data blocks are divided into N target data blocks, and finally, an index tree for addressing the target data is generated according to the N target data blocks.
  • the target data is first divided into multiple candidate data blocks based on the content of the target data, and then the candidate data blocks are divided into target data blocks according to the fingerprint characteristics of the candidate data blocks, instead of dividing the target data according to a fixed size
  • the data is divided into multiple target data blocks, so the size of the target data block in the embodiment of the present application is not specifically limited.
  • the number of target data blocks may not necessarily change after the target data after the inserted data is divided into target data blocks. Therefore, To a certain extent, it can save the resources consumed by data management.
  • the above-mentioned target data is divided into M candidate data blocks based on the content of the target data, which specifically includes: determining M-1 division points of the target data according to the fingerprint characteristics of the target data; The M-1 dividing points divide the target data into M candidate data blocks.
  • the determination of the M-1 division points of the target data according to the fingerprint features of the target data includes: for any one of the above-mentioned M-1 division points, the first division point in the sliding window When the fingerprint feature of a data satisfies the first preset condition, the end position of the sliding window is determined as the division point, the first data is part of the data in the target data, and the first preset condition is the target in the sliding window
  • the modulo value of the fingerprint feature of the data and the first threshold is equal to the second threshold.
  • the determination of the M-1 division points of the target data according to the fingerprint features of the target data includes: for any one of the above-mentioned M-1 division points, the first division point in the sliding window When the fingerprint feature of a data does not meet the first preset condition, slide the sliding window along the preset direction for a preset length, and when the fingerprint feature of the second data in the sliding window satisfies the first preset condition, slide the sliding window
  • the end position is determined as a division point
  • the second data is part of the target data, and the second data is different from the above-mentioned first data; wherein, the preset length is less than or equal to the length of the sliding window, and the above-mentioned first preset condition
  • a modulo value between the fingerprint feature of the target data in the sliding window and the first threshold is equal to the second threshold.
  • the above-mentioned division of the target data into M candidate data blocks based on the content of the target data includes: determining M-1 division points of the target data according to the transformation value of the above-mentioned target data, and the transformation The value is a value converted into a digital form of each data in the preset window based on a preset rule; the target data is divided into M candidate data blocks according to the M-1 division points.
  • the above-mentioned M-1 division points of the target data are determined according to the transformation value of the target data, including: the preset window includes a first fixed-length window, a variable-length window, and a second Fixed-length window; the length of the second fixed-length window is the length of a data in the target data or the length of the transformation value of a data in the target data; for any one of the above-mentioned M-1 division points,
  • the end position of the second fixed-length window is determined as the division point, and the second preset condition is The transformation value of the data is greater than the maximum value of the transformation values of each data in the first fixed-length window, and greater than the maximum value of the transformation values of each data in the variable-length window.
  • the above-mentioned M-1 division points of the target data are determined according to the transformation value of the target data, including: the above-mentioned preset window includes the first fixed-length window, the variable-length window and the first Two fixed-length windows; the length of the second fixed-length window is the length of a data in the above-mentioned target data or the length of the transformation value of a data in the target data; for any division point, in the second fixed-length window
  • the conversion value of the data does not meet the second preset condition, increase the length of the variable-length window in the preset window, and when the conversion value of the data in the second fixed-length window satisfies the second preset condition, the The end position of the second fixed-length window is determined as a division point, and the above-mentioned second preset condition is that the conversion value of the data in the second fixed-length window is greater than the maximum value of the conversion values of each data in the first fixed-length window, and Greater than the maximum value of the transformation
  • the M candidate data blocks are divided into N target data blocks, which specifically includes: the fingerprint features of the M candidate data blocks satisfy the third predetermined Assume that the end position of the candidate data block of the condition is determined as the N-1 division points of the candidate data block, wherein the third preset condition is that the fingerprint feature of the data of the candidate data block and the modulo value of the third threshold are within the preset within the range; according to the N-1 dividing points, the M candidate data blocks are divided into N target data blocks.
  • the fingerprint feature of the data in the candidate data block is the fingerprint feature of all the data in the candidate data block; or, the fingerprint feature of the data in the candidate data block is the fingerprint feature of some data in the candidate data block.
  • the above-mentioned generation of the index tree of the target data based on the N target data blocks specifically includes: dividing the N target data blocks into at least one data block according to the respective fingerprint features of the N target data blocks group; generate an index tree of target data based on the respective fingerprint characteristics of multiple data groups.
  • the N target data blocks are divided into at least one data group, which specifically includes: among the fingerprint features of the N target data blocks satisfying the fourth A target data block with preset conditions is determined as at least one division point of the N target data blocks; and the N target data blocks are divided into multiple data groups according to at least one division point of the target data block.
  • the fourth preset condition is: a modulo value between the fingerprint feature of the target data block and the fourth threshold is greater than or equal to the fifth threshold.
  • the above fourth preset condition is: the modulo value between the fingerprint feature of the target data block and the fourth threshold is greater than or equal to the fifth threshold, and the number of target data blocks in the data group is within the first number Between the threshold and the second threshold number, the first number threshold is greater than the second number threshold.
  • the fingerprint feature is a hash value.
  • the embodiment of the present application provides a data management device in a storage system
  • the data management device includes: a processing module, a storage module and a generation module; the processing module is used to divide the target data into M based on the content of the target data Candidate data blocks, M is an integer greater than or equal to 2; the processing module is also used to divide the M candidate data blocks into N target data blocks according to the respective fingerprint characteristics of the M candidate data blocks, and N is less than or equal to M is a positive integer, each target data block includes at least one candidate data block; the storage module is used to store N target data blocks and the fingerprint features of the N target data blocks, and the target data block and the fingerprint feature of the target data block have the same One-to-one relationship; the generation module is used to generate an index tree of the target data according to the N target data blocks, and the index tree is used to address the contents of the target data.
  • the determination module is configured to determine M-1 division points of the target data according to the fingerprint characteristics of the target data; the processing module is specifically configured to divide the target data according to the M-1 division points are M candidate data blocks.
  • the data management device in the above-mentioned storage system further includes: a determination module; the determination module is configured to, when the fingerprint feature of the first data in the sliding window meets the first preset condition, set the The end position is determined as a division point, the first data is part of the target data, and the first preset condition is that the modulo value between the fingerprint feature of the target data in the sliding window and the first threshold is equal to the second threshold.
  • the data management device in the above-mentioned storage system further includes: a sliding module; the sliding module is used for any one of the above-mentioned M-1 dividing points, the fingerprint of the first data in the sliding window When the feature does not meet the first preset condition, slide the sliding window along the preset direction for a preset length; when the fingerprint feature of the second data in the sliding window meets the first preset condition, the determination module will slide the sliding window The end position is determined as a division point, the second data is part of the target data, and the second data is different from the above-mentioned first data; wherein, the preset length is less than or equal to the length of the sliding window, and the above-mentioned first preset condition A modulo value between the fingerprint feature of the target data in the sliding window and the first threshold is equal to the second threshold.
  • the data management device in the above-mentioned storage system further includes: a determination module; the determination module is configured to determine M-1 division points of the target data according to the transformation value of the above-mentioned target data, and the transformation value is based on the predetermined A rule is set to convert each data in the preset window into a value in digital form; the processing module is used to divide the target data into M candidate data blocks according to the M-1 division points.
  • the preset window includes a first fixed-length window, a variable-length window, and a second fixed-length window that are sequentially adjacent;
  • the length of the second fixed-length window is the length of one of the target data length or the length of the converted value of a data in the target data;
  • the above-mentioned processing module is used to convert the converted value of the data included in the second fixed-length window to the second preset condition.
  • the end position is determined as the division point, and the second preset condition is that the transformation value of the data in the second fixed-length window is greater than the maximum value of the transformation values of each data in the first fixed-length window, and greater than the maximum value of the transformation values of the data in the variable-length window. The maximum value of the transformation value of each data.
  • the preset window includes a first fixed-length window, a variable-length window, and a second fixed-length window that are sequentially adjacent;
  • the length of the second fixed-length window is one of the above-mentioned target data length or the length of the conversion value of a data in the target data;
  • the processing module is used to increase the available data in the preset window when the conversion value of the data in the second fixed-length window does not meet the second preset condition
  • the length of the variable-length window when the conversion value of the data in the second fixed-length window satisfies the second preset condition, the end position of the second fixed-length window is determined as the division point, and the above-mentioned second preset condition is the second
  • the transformation value of the data in the fixed-length window is greater than the maximum value of the transformation values of each data in the first fixed-length window, and greater than the maximum value of the transformation values of each data in the variable-length window.
  • the determination module is used to determine the end positions of the candidate data blocks whose fingerprint features meet the third preset condition among the M candidate data blocks as the N-1 division points of the candidate data blocks, where the The three preset conditions are that the fingerprint feature of the data of the candidate data block and the modulo value of the third threshold are within the preset range; the processing module is used to divide the M candidate data blocks into three according to the N-1 dividing points N target data blocks.
  • the above-mentioned processing module is used to divide the N target data blocks into at least one data group according to the respective fingerprint features of the above-mentioned N target data blocks; Fingerprint features to generate an index tree of the target data.
  • the determination module is configured to determine a target data block satisfying the fourth preset condition among the respective fingerprint features of the N target data blocks as at least one division point of the N target data blocks;
  • the module is used for dividing the N target data blocks into multiple data groups according to at least one dividing point of the target data blocks.
  • the fourth preset condition is: a modulo value between the fingerprint feature of the target data block and the fourth threshold is greater than or equal to the fifth threshold.
  • the above fourth preset condition is: the modulo value between the fingerprint feature of the target data block and the fourth threshold is greater than or equal to the fifth threshold, and the number of target data blocks in the data group is within the first number Between the threshold and the second threshold number, the first number threshold is greater than the second number threshold.
  • the fingerprint feature is a hash value.
  • an embodiment of the present application provides a data management device in a storage system, wherein the memory is coupled to the processor; the memory is used to store computer program codes, wherein the computer program codes include computer instructions; when the computer instructions are executed by the processor , make the data management device in the storage system execute the method described in any one of the first aspect and its possible implementation manners.
  • an embodiment of the present application provides a computer storage medium, including computer instructions.
  • the computing device is made to execute the above-mentioned method described in any one of the first aspect and its possible implementations. method.
  • the embodiments of the present application provide a computer program product, which, when run on a computer, causes the computer to execute the method described in any one of the above first aspect and possible implementations thereof.
  • FIG. 1 is a schematic diagram of a block and group flow process of target data provided by an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of an index tree provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a construction process of an index tree provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a hardware structure of a storage system provided by an embodiment of the present application.
  • FIG. 5 is a first schematic flowchart of a data management method in a storage system provided by an embodiment of the present application
  • FIG. 6 is a second schematic flow diagram of a data management method in a storage system provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a division process of a candidate data block provided in an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a method for dividing candidate data blocks provided by an embodiment of the present application.
  • FIG. 9 is a third schematic flowchart of a data management method in a storage system provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of another candidate data block division process provided by the embodiment of the present application.
  • FIG. 11 is a schematic flowchart of another method for dividing candidate data blocks provided by the embodiment of the present application.
  • FIG. 12 is a fourth schematic flowchart of a data management method in a storage system provided by an embodiment of the present application.
  • FIG. 13 is a schematic flow diagram V of a data management method in a storage system provided by an embodiment of the present application.
  • FIG. 14 is a sixth schematic flow diagram of a data management method in a storage system provided by an embodiment of the present application.
  • FIG. 15 is a schematic diagram of a data management device in a storage system provided by an embodiment of the present application.
  • FIG. 16 is a schematic diagram of another data management device in a storage system provided by an embodiment of the present application.
  • first and second in the description and claims of the embodiments of the present application are used to distinguish different objects, rather than to describe a specific order of objects.
  • first threshold and the second threshold are used to distinguish different thresholds, but not to describe a specific order of the thresholds.
  • words such as “exemplary” or “for example” are used as examples, illustrations or illustrations. Any embodiment or design scheme described as “exemplary” or “for example” in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as “exemplary” or “such as” is intended to present related concepts in a concrete manner.
  • plurality means two or more.
  • a plurality of data groups refers to two or more data groups.
  • Merkle-DAG is composed of at least one tree, that is, data management is based on the tree structure.
  • the method of managing data in the storage system based on Merkle-DAG is as follows: first, divide the target data into multiple data blocks according to a fixed size, assuming that the size of the target data is 1024 bytes, as shown in Figure 1 As shown in (A) in Figure 1, the target data is divided into 4 data blocks according to the size of each data block is 256 bytes, that is, data block 1-data block shown in Figure (B) in Figure 1 4; Secondly, calculate the hash value of each data block separately, for example, the hash value of the above-mentioned data block 1-data block 4 is hash value 1-hash value 4, and according to the corresponding relationship between the data block and the hash value Store the data block and the hash value of the data block in the preset table, and the following table 1 is an example of the corresponding relationship between the above-mentioned data block 1-data block 4 and its hash
  • the generation method of the first index tree of the above target data is as follows: divide the target data into hash values corresponding to data block 1-data block 4 1-Hash value 4 is used as the leaf node of the first index tree, which can be recorded as leaf node 1, leaf node 2, leaf node 3, and leaf node 4; then, calculate the two data groups divided by these 4 leaf nodes
  • the hash value of the leaf node that is, calculate the hash value of hash value 1 and hash value 2, and the hash value of hash value 3 and hash value 4, and combine hash value 1 and hash value 2
  • the parent node of leaf node 1 and leaf node 2 (called the first parent node, the first parent node is a child node in the first index tree)
  • hash value 3 and hash value 4 as the parent node of leaf node 3 and leaf node 4 (called the second parent node, the second parent node is another child node in the first index tree)
  • the target data When the target data is updated, it may cause the total data volume of the target data to change. For example, with reference to FIG.
  • the data volume of the target data is increased from 1024 bytes to 1280 bytes. If the size of each data block is fixed at 256 bytes, the target data is divided into blocks, and 5 data blocks can be obtained, and then the After the 5 data blocks are grouped according to the above grouping method, 3 data groups can be obtained.
  • Data group 1 of the 3 data groups includes data block 1 and data block 5
  • data group 2 includes data block 2 and data block 3.
  • Data group 3 includes data block 4, and before new data is inserted, the target data is divided into two data groups, data group 1 includes data block 2 and data block 2, data group 2 includes data block 3 and data block 4, It can be seen that the total data volume of the target data has changed, resulting in a large change in the grouping of the target data. In this way, the second index tree shown in Figure 3 is generated based on the hash values of data blocks 1-5.
  • the second index tree has a large change, such as: the shaded part in Figure 3 is the part that has changed, therefore, it takes a lot of time to generate the index tree for the target data after inserting the data
  • the amount of calculation leads to the consumption of more resources in the process of data management in the above-mentioned prior art.
  • the embodiment of the present application provides a data management method and device in a storage system.
  • the data management device divides the target data into M candidate data blocks based on the content of the target data, and M is greater than or an integer equal to 2; according to the respective fingerprint characteristics of the M candidate data blocks, the M candidate data blocks are divided into N target data blocks, N is a positive integer less than or equal to M, and each target data block includes at least one candidate Data block; store N target data blocks and fingerprint features of N target data blocks, and target data blocks have a one-to-one correspondence with the fingerprint features of target data blocks; generate an index tree of target data according to N target data blocks, The index tree is used to address the content of the target data.
  • the data management method and device in the storage system provided by the embodiment of the present application can be applied to the storage system shown in Figure 4.
  • the storage system can be a storage system composed of a solid-state hard disk, or a storage system composed of other types of storage media. system.
  • the storage system includes a controller (abbreviation: main control) 401 and a plurality of hard disks 405, wherein the main control 401 includes: a processor 402, optionally, the controller 401 also includes a host interface 404, and n (n>0) channel controllers 403 .
  • the above-mentioned master control 401 is used to issue executable commands to multiple hard disks 405 , so as to read or update data on the hard disks 405 .
  • the above-mentioned host interface 404 is used to communicate with the host, and then receive the command request sent by the host, and forward the command request to the processor 402, wherein the above-mentioned host is not limited to any device such as server, personal computer or array controller.
  • the above-mentioned processor 402 sends executable commands to the above-mentioned multiple hard disks 405 according to the command request forwarded by the host interface 404.
  • the above-mentioned processor 402 is used to execute the data management method in the storage system provided by the embodiment of the present application, for example, processing
  • the implementer 402 is used to block target data, group data blocks, and generate an index tree.
  • the processor 402 may include one or more CPUs, and the CPUs may be single-core CPUs (single-CPU) or multi-core CPUs (multi-CPU).
  • the channel controller 403 is used to carry the executable commands issued by the processor 402 to the hard disk 405 .
  • the storage system further includes a bus 406, and the processor 402, the channel controller 403, the host interface 404, and the hard disk 405 are generally connected to each other through the bus 406, or are connected to each other in other ways.
  • the host interface 404 in the main control 401 forwards the target data to the processor 402 in the main control 401, and the processor 402 divides the target data into M candidate data blocks, According to the respective fingerprint features of the M candidate data blocks, the M candidate data blocks are divided into N target data blocks, and then, the processor 402 divides the N target data blocks according to the corresponding relationship between the target data blocks and the fingerprint features of the target data blocks.
  • the target data block and the fingerprint features of the N target data blocks are sent to n hard disks 405 through the channel controller 403 to be stored in the hard disks 405; finally, the processor 402 generates the target data according to the N target data blocks. index tree, and store the index tree in the hard disk 405.
  • the device for executing the data management method in the storage system may be the processor 402 in the controller in the storage system shown in FIG. 4 above.
  • the data management method in the storage system may include S510-S540.
  • the data management device divides the target data into M candidate data blocks based on the content of the target data.
  • M is an integer greater than or equal to 2.
  • the above M candidate data blocks may be M data blocks that the data management device determines M-1 division points based on the content of the target data, and then cuts the target data into M data blocks according to the M-1 division points;
  • the content of the target data determines M-1 division points in the content of the target data, and the M-1 division points divide the target data into M intervals, and there is no need to cut the target data according to the M-1 division points , each interval is a candidate data block, and the specific embodiment of the present application does not limit the division method of the above M candidate data blocks.
  • the content of the above-mentioned target data refers to all elements that make up the target data, and the sizes of the above-mentioned M candidate data blocks may be all the same, may be partly the same, or may be different from each other.
  • the data management device divides the M candidate data blocks into N target data blocks according to the respective fingerprint features of the M candidate data blocks.
  • N is a positive integer less than or equal to M.
  • the fingerprint feature of the above candidate block may be the hash value of the candidate block, or other features that can uniquely identify a data block, which are determined according to actual needs. To limit.
  • the above S520 is specifically: the data management device determines N-1 division points according to the respective fingerprint characteristics of the M candidate data blocks, and then, the data management device divides the M candidate data blocks into N according to the N-1 division points. target data blocks, and each target data block includes at least one candidate data block.
  • the data management device stores N target data blocks and fingerprint features of the N target data blocks.
  • the above steps are specifically: storing the fingerprint feature of a certain target data block and the content of the target data block in the same row of the preset table, or storing the fingerprint feature of the target data block and the target data block ( index of the target data block), the index of the target data block and the content of the target data block are stored in the preset table 2, and the application does not specify the storage method of the N target data blocks and the fingerprint features of the N target data blocks To limit.
  • the data management device generates an index tree of the target data according to the N target data blocks, and the index tree is used for addressing content of the target data.
  • the corresponding index tree of the target data needs to be determined, and then the leaf nodes of the index tree are searched recursively according to the root node of the index tree, and then according to the hash in the leaf node Hash the content of the data block corresponding to the leaf node to obtain the target data
  • the data management device searches the child nodes (the first parent node and the second parent node) of the index tree according to the root node of the index tree. node), and then find 4 leaf nodes according to the child nodes, and finally, query the content of the data block corresponding to the 4 hash values in the preset table according to the hash values in the 4 leaf nodes, so as to obtain the target data.
  • the data management device searches the child nodes (the first parent node and the second parent node) of the index tree according to the root node of the index tree. node), and then find 4 leaf nodes according to the child nodes, and finally, query the content of the data block corresponding to the 4 hash values in the preset table according to the hash values in the 4 leaf nodes, so as to obtain the target data.
  • the data management device divides the target data into M candidate data blocks based on the content of the target data, and then divides the M candidate data blocks according to the respective fingerprint features of the M candidate data blocks.
  • the candidate data blocks are divided into N target data blocks, and finally, an index tree for addressing the target data is generated according to the N target data blocks.
  • the target data is first divided into multiple candidate data blocks based on the content of the target data, and then the candidate data blocks are divided into target data blocks according to the fingerprint characteristics of the candidate data blocks, instead of dividing the target data according to a fixed size
  • the data is divided into multiple target data blocks, so the size of the target data block in the embodiment of the present application is not specifically limited.
  • the number of target data blocks may not necessarily change after the target data after the inserted data is divided into target data blocks. Therefore, To a certain extent, it can save the resources consumed by data management.
  • the method for dividing the target data into M candidate data blocks based on the content of the target data may specifically include: S610-S620.
  • the data management device determines M ⁇ 1 division points of the target data according to the fingerprint feature of the target data.
  • the method of determining each division point is the same. As shown in FIG. 8 , the method of determining a division point of the target data includes S810-S830.
  • the data management device judges whether the fingerprint feature of the data in the sliding window satisfies a first preset condition.
  • the size of the above-mentioned sliding window is fixed, the sliding step of the sliding window is a preset length, and the sliding step of the sliding window can be set according to actual needs, for example, the sliding step is 1 byte or 2 bytes, etc. This embodiment of the present application does not limit it.
  • the above-mentioned first preset condition is that the modulo value between the fingerprint feature of the data in the sliding window and the first threshold is equal to the second threshold, where the first threshold and the second threshold may be pre-configured.
  • the modulo value of the hash value 2113 of the data in the sliding window and the first threshold 20 is 13, and the second threshold is 9, that is: the hash value of the data in the sliding window and the first threshold
  • the modulo value is not equal to the second threshold; therefore, the hash value of the data in the sliding window does not satisfy the first preset condition.
  • the data management device determines an end position of the sliding window as a division point.
  • the above-mentioned first data is part of the target data.
  • the end position of the sliding window is the position closest to the sliding direction on the sliding window.
  • the end position of the sliding window at this time is the letters "C" and "D”. "The junction position.
  • the data management device slides the sliding window along a preset direction for a preset length.
  • the preset length is the sliding step of the sliding window, and the preset length is less than or equal to the length of the sliding window.
  • the sliding direction of the above-mentioned sliding window is pre-configured, and the sliding direction can be from right to left, or from left to right. In the embodiment of this application, the sliding direction of the sliding window is rightward An example is used for description, and details will not be described later.
  • the data management device executes S830, it continues to execute the above S810.
  • the end position of the sliding window is determined as the division point, and the second data is part of the data in the target data; that is, after the data management device executes S830, it continues to execute the above S810 until the fingerprint feature of the data in the sliding window satisfies the first preset condition, and determines the end position of the sliding window as dividing point.
  • the hash value of the data (that is, BC) in the sliding window does not meet the first preset condition , slide the sliding window to the right by one letter, currently, the data in the sliding window is "CD", as shown in (B) in Figure 7; at this time, determine the hash of the data "CD" in the sliding window Whether the value satisfies the first preset condition, and if so, determine the boundary position between "C" and "D” as the dividing point.
  • the data management device divides the target data into M candidate data blocks according to the M-1 division points.
  • the target data as "ABCDEF...XYZ" as an example, assuming that according to the above S610, 4 division points of the target data are determined, and the target data is divided into 5 intervals according to these 4 division points, then the The 5 intervals correspond to 5 candidate data blocks, and the 5 candidate data blocks are respectively ⁇ ABCDEF ⁇ , ⁇ GHIJK ⁇ , ⁇ LMNOP ⁇ , ⁇ QRSTU ⁇ and ⁇ VWXYZ ⁇ .
  • the target data is updated (such as inserting data into the target data)
  • the content of the target data has changed, but the position in the updated target data that satisfies the first preset condition may not occur Changes, and then after the updated target data is divided into candidate blocks, the number of candidate blocks will not change, and the content of most candidate blocks may also remain unchanged.
  • the first threshold is 20 and the second threshold is 9; when the data "AALMXXWX" is inserted into the target data as shown in (A) or (B) , assuming that the insertion position is between "C” and "D” in the above candidate data block ⁇ ABCDEF ⁇ , then for the new target data, execute the above S610-S620, because the target data after inserting the data satisfies the first preset The position of the condition has not changed, so the number and position of the division points are determined based on the above-mentioned technical solution of S810-S830, and the target data is divided into 5 candidate data blocks according to the 4 division points, and the 5 candidate data blocks
  • the blocks are ⁇ ABCAALMXXWXDEF ⁇ , ⁇ GHIJK ⁇ , ⁇ LMNOP ⁇ , ⁇ QRSTU ⁇ , and ⁇ VWXYZ ⁇ .
  • the above-mentioned method of dividing the target data into M candidate data blocks (that is, S510) based on the content of the target data may include: S910- S920.
  • the data management device determines M-1 division points of the target data according to the conversion value of the target data.
  • the above-mentioned conversion value is based on a preset rule to convert each data in the preset window into a value in digital form.
  • the transformation value of each data above is the value obtained after transforming each data in the target data by using a transformation method.
  • the conversion value of a letter can be the ASCII (American standard code for information interchange, ASCII) corresponding to the letter, or the hash value corresponding to the letter or other values represented in digital form .
  • ASCII American standard code for information interchange
  • FIG. 10 it is the conversion value text of the target data obtained by converting the content of the target data into numbers according to preset rules.
  • each division point of the target data is determined by using a preset window
  • the method is specifically shown in Figure 11, including S1110-S1130.
  • the data management device judges whether the conversion value of the data in the second fixed-length window in the preset window satisfies a second preset condition.
  • the above-mentioned preset window includes the first fixed-length window, the variable-length window and the second fixed-length window adjacent in sequence; wherein, the length of the first fixed-length window is an integer greater than 0; the variable-length window The initial length is a preset value, and the preset value is an integer greater than or equal to 0; the length of the second fixed-length window is the length of a data in the target data or the length of a transformed value of a data in the target data, That is to say, when the length of the variable-length window is greater than 0, the length of the second fixed-length window is the smallest unit of data corresponding to the variable-length window and the first fixed-length window, for example: when the length of the first fixed-length window is 4 bytes, and when the length of the variable-length window is 2 bytes, the length of the second fixed-length window is 1 byte.
  • the above-mentioned second preset condition is that the conversion value of the data in the second fixed-length window is greater than the maximum value of the conversion value of each data in the first fixed-length window, and is greater than the maximum value of the conversion value of each data in the variable-length window. value.
  • the data management device determines the maximum value of each transformation value in the first fixed-length window, and the maximum value is called the first maximum value; the data management device then determines the maximum value of each transformation value in the variable-length window , that is: the second maximum value; then, the data management device judges whether the transformation value in the second fixed-length window is greater than the first maximum value and also greater than the second maximum value.
  • the data management device determines an end position of the second fixed-length window as a division point.
  • the data management device increases the length of the variable-length window in the preset window.
  • the above-mentioned data management device may increase the length of the variable-length window in the preset window according to a pre-configured length, and the pre-configured length may be determined according to actual conditions, for example, configured as 2 bytes.
  • variable-length window is increased by a preset length
  • second fixed-length window is adjacent to the variable-length window, the second fixed-length window will move backward by a preset length.
  • the data management device judges whether the transformation value of the data in the current second fixed-length window satisfies the second preset condition, and if so, determines the end position of the current second fixed-length window as the dividing point ; If not satisfied, then continue to increase the length of the variable-length window in the preset window until the conversion value of the data in the second fixed-length window meets the second preset condition, and determine the end position of the second fixed-length window as the dividing point.
  • the data management device determines the end position of the second fixed-length window as the division point.
  • the initial position of the preset window for determining the next division point is conversion value 5
  • the transformation values in the first fixed-length window in the preset window include 5 and 9
  • the transformation values in the variable-length window include 36
  • the transformation values in the second fixed-length window include 5.
  • the data management device divides the target data into M candidate data blocks according to the M-1 division points.
  • the conversion value text of the target data can be divided into four candidate data blocks, for example, respectively: ⁇ 12,18,2,6,45 ⁇ , ⁇ 5,9,36,5, 5,65 ⁇ , ⁇ 56,5,9,7,62 ⁇ , and ⁇ 8,8,432,9,81,20 ⁇ .
  • the target data is updated (such as: inserting data in the target data)
  • the content of the target data has changed, but the position in the updated target data that satisfies the second preset condition may not occur Changes, and then after the updated target data is divided into candidate blocks, the number of candidate blocks will not change, and the content of most candidate blocks may also remain unchanged.
  • the conversion value set of the inserted data is ⁇ 9,10,12,1,-40 ⁇ , assuming the position of the inserted data is the middle of 18 and 2 in the candidate data block ⁇ 12, 18, 2, 6, 45 ⁇ , since the position in the target data after inserting the data that satisfies the second preset condition has not changed, so based on the above S1110-S1130
  • the technical solution determines the number and position of the division points, and divides the target data into 4 candidate data blocks according to the 3 division points, and the 4 candidate data blocks are respectively ⁇ 12, 18, 9, 10, 12, 1 ,-40,2,6,45 ⁇ , ⁇ 5,9,36,5,5,65 ⁇ , ⁇ 56,5,9,7,62 ⁇ , and ⁇ 8,8,432,9,81,20 ⁇ .
  • the M candidate data blocks are divided into N target data blocks according to the respective fingerprint features of the M candidate data blocks (ie: S520), including: S1210-S1220.
  • the data management device determines, among the M candidate data blocks, the end positions of the candidate data blocks whose fingerprint features meet the third preset condition as N ⁇ 1 division points of the candidate data blocks.
  • the fingerprint feature of the above candidate data block can be a fingerprint feature corresponding to all the data in the candidate data block. For example, if the data in the candidate data block is "WANHH", then the hash value of the candidate data block is "WANHH".
  • the overall hash value can also be the fingerprint feature corresponding to some data in the candidate data block; for example, if the data in the candidate data block is "WANHH", then the hash value of the candidate data block corresponds to "NHH".
  • the fingerprint features of the candidate data blocks are described by taking the fingerprint features corresponding to all the data in the candidate data blocks as an example, and will not be described in detail later.
  • the above-mentioned third preset condition is that the modulo value between the fingerprint feature of the data of the candidate data block and the third threshold is within a preset range.
  • the data management device divides the M candidate data blocks into N target data blocks according to the N-1 division points.
  • the target data is divided into 5 candidate data blocks, and the 5 candidate data blocks are respectively ⁇ ABCAALMXXWXDEF ⁇ , ⁇ GHIJK ⁇ , ⁇ LMNOP ⁇ , ⁇ QRSTU ⁇ , and ⁇ VWXYZ ⁇
  • the calculated hash values of the five candidate data blocks are: -1130721247, 67787465, 72558990, 77330515, and 82102040; then, calculate the hash values of the five candidate data blocks and the first
  • the modulo values of the three thresholds are: -7, 5, 30, 55 and 20 respectively.
  • the candidate data block ⁇ LMNOP ⁇ satisfies the third preset condition, so the candidate data block ⁇ LMNOP ⁇ is used as the above
  • the division points of the 5 candidate data blocks can divide the candidate data blocks into 2 target data blocks, namely ⁇ ABCAALMXXWXDEFGHIJKLMNOP ⁇ and ⁇ QRSTU ⁇ .
  • the method for dividing the target data block in the above S520 may be similar to the method for dividing candidate data blocks S610-S620 and S910-S920.
  • the method for dividing candidate data blocks S610-S620 and S910-S920 may be similar to the method for dividing candidate data blocks S610-S620 and S910-S920.
  • S610-S620 and S910-S920 refer to the relevant descriptions of the above-mentioned S610-S620 and S910-S920, which will not be repeated here.
  • the data management device divides the target data into M candidate data blocks, and then, the data management device divides the candidate data whose fingerprint characteristics satisfy the third preset condition in the M candidate data blocks
  • the end positions of the data blocks are determined as N-1 division points of the candidate data blocks, and then N target data blocks are determined.
  • the index tree to be constructed is compared to the target data before inserting data. In the index tree, only the hash values of individual nodes have changed, thereby saving the resources consumed by data management.
  • the above S540 includes: S1310-S1320.
  • the data management device divides the N target data blocks into at least one data group according to the respective fingerprint features of the N target data blocks.
  • the above-mentioned fingerprint feature of the target data block can be a hash value of all data in the target data block, or a hash value corresponding to some data in the target data block.
  • the target data The fingerprint features of a block are illustrated by taking the hash value corresponding to all data in the target data block as an example.
  • the data in the target data block is ⁇ ABCAALMXXWXDEFGHIJKLMNOP ⁇
  • the fingerprint feature of the target data block is the hash value of "ABCAALMXXWXDEFGHIJKLMNOP”.
  • the foregoing S1310 specifically includes: S1410-S1420.
  • the data management device determines a target data block satisfying a fourth preset condition among the fingerprint features of each of the N target data blocks as at least one division point of the N target data blocks.
  • the fourth preset condition above is:
  • a modulo value between the fingerprint feature of the target data block and the fourth threshold is greater than or equal to the fifth threshold.
  • the number of target data blocks in the data group is between the first number threshold and the second number threshold, and the modulo value of the fingerprint feature of the target data block and the fourth threshold is greater than or equal to the fifth threshold, wherein the first number The threshold is greater than the second number threshold.
  • the data management device divides the N target data blocks into multiple data groups according to at least one division point of the target data blocks.
  • the target data is divided into 4 target data blocks, namely: ⁇ ALMXXWXDEFGHIJK ⁇ , ⁇ ABRFGRTGRTRGE ⁇ , ⁇ DEWFRTNEBJ ⁇ and ⁇ JDIEOFJDEJFOEW ⁇ ; the fourth threshold is 80, and the fifth threshold is 70; obtained by calculation
  • the modulo values of the hash values of the above four target data blocks and the fourth threshold 80 are respectively 25, 75, 33 and 55. It can be seen that the target data block ⁇ ABRFGRTGRTRGE ⁇ satisfies the fourth preset condition.
  • the target data blocks are divided into two groups: ⁇ ALMXXWXDEFGHIJK ⁇ and ⁇ ABRFGRTGRTRGE ⁇ , and ⁇ DEWFRTNEBJ ⁇ and ⁇ JDIEOFJDEJFOEW ⁇ .
  • the above S1310 may also use the above methods S610-S620 and/or S910-S920 for dividing candidate blocks, for details, refer to the relevant descriptions of S610-S620 and/or S910-S920, which will not be repeated here.
  • the data management device generates an index tree of the target data based on the respective fingerprint features of the multiple data groups.
  • the data management device determines the target data block that satisfies the fourth preset condition among the respective fingerprint features of the N target data blocks as The at least one division point of the target data block is used to divide the N target data blocks into multiple data groups according to the at least one division point of the target data block.
  • the target data when the target data is updated, even if the number of target data blocks of the target data changes, but when the updated data of the target data (such as inserted data) does not meet the fourth preset condition, then the The division points for grouping the target data block remain unchanged, so the number of groups of the target data block is exactly the same as the number of groups of the target data block before updating the target data.
  • the index tree of the updated target data only the Part of the leaf nodes, part of the hash values of child nodes and the hash value of the root node of the index tree of the target data before the data is inserted can be modified, thereby saving resources required for data management.
  • the embodiment of the present application provides a data management device in the storage system, the data management device in the storage system is used to execute each step in the above-mentioned fingerprint verification method, the embodiment of the present application can use the example of the above-mentioned method for the storage system
  • the data management device divides the functional modules.
  • each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules.
  • the division of modules in the embodiment of the present application is schematic, and is only a logical function division, and there may be other division methods in actual implementation.
  • FIG. 15 shows a possible structural diagram of the data management device in the storage system involved in the above embodiment.
  • the data management device in the storage system includes: a processing module 1501 , a storage module 1502 and a generating module 1503 .
  • the processing module 1501 is configured to divide the target data into M candidate data blocks based on the content of the target data, for example, execute step S510 in the above method embodiment.
  • the processing module 1501 is further configured to divide the M candidate data blocks into N target data blocks according to their respective fingerprint features, for example, execute step S520 in the above method embodiment.
  • the storage module 1502 is configured to store N target data blocks and fingerprint features of the N target data blocks, for example, execute step S530 in the above method embodiment.
  • the generating module 1503 is configured to generate an index tree of the target data according to the N target data blocks, and the index tree is used to address the content of the target data, for example, execute step S540 in the above method embodiment.
  • the data management device in the storage system provided in the embodiment of the present application further includes a determination module 1504;
  • the determining module 1504 is configured to determine M-1 division points of the target data according to the fingerprint feature of the target data, for example, execute step S610 in the above method embodiment.
  • the processing module 1501 is specifically configured to divide the target data into M candidate data blocks according to the M-1 division points, for example, execute step S620 in the above method embodiment.
  • the data management device in the storage system provided in the embodiment of the present application further includes a sliding module 1505 .
  • the determining module 1504 is configured to determine the end position of the sliding window as the division point when the fingerprint feature of the first data in the sliding window satisfies the first preset condition, for example, execute step S820 in the above method embodiment.
  • the sliding module 1505 is configured to slide the sliding window along a preset direction for a preset length when the fingerprint feature of the first data in the sliding window does not satisfy the first preset condition, for example, execute step S830 in the above method embodiment.
  • the determination module 1504 is further configured to determine M-1 division points of the target data according to the transformation value of the target data, for example, execute step S910 in the above method embodiment.
  • the processing module 1501 is specifically configured to divide the target data into M candidate data blocks according to the M-1 division points, for example, execute step S920 in the above method embodiment.
  • the determination module 1504 is further configured to determine the end position of the second fixed-length window as the dividing point when the transformation value of the data included in the second fixed-length window satisfies the second preset condition, for example, execute the above-mentioned Step S1120 in the method embodiment.
  • the above-mentioned processing module 1501 is also used to increase the length of the variable-length window in the preset window when the conversion value of the data in the second fixed-length window does not meet the second preset condition, and within the second fixed-length window When the transformation value of the data of the above-mentioned data satisfies the second preset condition, the end position of the second fixed-length window is determined as the division point, for example, step S1130 in the above method embodiment is executed.
  • the determination module 1504 is configured to determine the end positions of the candidate data blocks whose fingerprint features meet the third preset condition among the M candidate data blocks as the N-1 division points of the candidate data blocks, for example, execute the above method to implement Step S1210 in the example.
  • the processing module 1501 divides the M candidate data blocks into N target data blocks according to the N-1 division points, for example, executes step S1220 in the above method embodiment.
  • the above processing module 1501 is further configured to divide the N target data blocks into at least one data group according to their respective fingerprint features, for example, perform step S1310 in the above method embodiment.
  • the generating module 1503 is further configured to generate an index tree of target data based on the respective fingerprint features of multiple data groups, for example, execute step S1320 in the above method embodiment.
  • the determination module 1504 is configured to determine the target data block that satisfies the fourth preset condition among the fingerprint features of each of the N target data blocks as at least one division point of the N target data blocks, for example, execute the above method to implement Step S1410 in the example.
  • the processing module 1501 is configured to divide the N target data blocks into multiple data groups according to at least one division point of the target data blocks, for example, execute step S1420 in the above method embodiment.
  • Each module of the data management device in the above-mentioned storage system can also be used to perform other actions in the above-mentioned method embodiment. All relevant content of each step involved in the above-mentioned method embodiment can be referred to the function description of the corresponding functional module, which is not described here. Let me repeat.
  • FIG. 16 a schematic structural diagram of a data management device in a storage system provided by an embodiment of the present application is shown in FIG. 16 .
  • the electronic device includes: a processing module 1601 and a communication module 1602 .
  • the processing module 1601 is used to control and manage the actions of the data management device in the storage system, for example, to execute the steps performed by the processing module 1501, the generation module 1503, the determination module 1504, and the sliding module 1505, and/or to execute the steps described herein. other processes of the technology.
  • the communication module 1602 is used to support the interaction between the data management device and other devices in the storage system.
  • the data management device in the storage system may also include a storage module 1603, which is used to store the program code of the data management device in the storage system and to relationship etc.
  • the processing module 1601 may be a processor or a controller, for example, the controller 401 or the processor 402 in FIG. 4 .
  • the communication module 1602 may be a transceiver, an RF circuit, or a communication interface, etc., such as the bus 406 and/or the channel controller 403 in FIG. 4 .
  • the storage module 1603 may be a memory, such as the hard disk 405 in FIG. 4 .
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • a software program When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computer, all or part of the processes or functions according to the embodiments of the present application will be generated.
  • the computer can be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, e.g.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device including a server, a data center, and the like integrated with one or more available media.
  • the available medium may be a magnetic medium (for example, a floppy disk, a magnetic disk, a magnetic tape), an optical medium (for example, a digital video disc (digital video disc, DVD)), or a semiconductor medium (for example, a solid state drive (solid state drives, SSD)), etc. .
  • the disclosed system, device and method can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be Incorporation may either be integrated into another system, or some features may be omitted, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage medium includes: flash memory, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk, and other various media capable of storing program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the present application provide a method and apparatus for managing data in a storage system and relate to the field of data storage. The method and apparatus can save resources for data management. The method comprises: on the basis of content of target data, dividing the target data into M candidate data blocks, M being an integer greater than or equal to 2; then, according to respective fingerprint features of the M candidate data blocks, dividing the M candidate data blocks into N target data blocks, N being a positive integer less than or equal to M, wherein each target data block comprises at least one candidate data block; storing the N target data blocks and fingerprint features of the N target data blocks, the target data blocks and the fingerprint features of the target data blocks being in a one-to-one correspondence; and finally, generating an index tree of the target data according to the N target data blocks, the index tree being used for addressing the content of the target data.

Description

一种存储系统中数据管理方法及装置Data management method and device in a storage system 技术领域technical field
本申请实施例涉及数据存储领域,尤其涉及一种存储系统中数据管理方法及装置。The embodiments of the present application relate to the field of data storage, and in particular, to a data management method and device in a storage system.
背景技术Background technique
在数据存储领域,使用默克尔有向无环图(merkle directed acyclic graph,Merkle-DAG)对数据进行管理的方式受到各大企业的青睐。In the field of data storage, the method of using Merkle directed acyclic graph (Merkle-DAG) to manage data is favored by major enterprises.
众所周知,Merkle-DAG是由至少一个树构成,即:数据的管理是基于树结构进行的。目前,一种基于Merkle-DAG对存储系统中的数据进行管理的方法是:按照固定大小(例如固定字节)将目标数据(即被管理的对象)分割为多个数据块,并计算各个数据块的哈希值;然后,按照数据块与数据块的哈希值的对应关系存储多个数据块与多个数据块的哈希值。其后,按照固定数量的数据块对上述多个数据块进行分组;最后,基于该多个分组以及多个分组中每个数据块的哈希值生成用于管理目标数据的Merkle-DAG(以下将目标数据的Merkle-DAG称为索引树),并保存该索引树。As we all know, Merkle-DAG is composed of at least one tree, that is, data management is based on the tree structure. At present, a method for managing data in a storage system based on Merkle-DAG is: divide the target data (that is, the object to be managed) into multiple data blocks according to a fixed size (such as a fixed byte), and calculate the The hash value of the block; then, store the hash values of the multiple data blocks and the multiple data blocks according to the corresponding relationship between the data blocks and the hash values of the data blocks. Thereafter, the above-mentioned multiple data blocks are grouped according to a fixed number of data blocks; finally, a Merkle-DAG for managing target data is generated based on the multiple groups and the hash value of each data block in the multiple groups (hereinafter The Merkle-DAG of the target data is called an index tree), and this index tree is saved.
然而,当应用层对目标数据进行更新(例如数据插入、数据修改以及数据删除)时,可能会导致目标数据的总数据量发生变化,若按照上述固定大小对目标数据进行分块,可能会使划分的数据块的数量发生变化,进而导致数据块的分组数量也发生较大的变化,从而需要重新构建索引树,因此,在数据管理的过程中需要消耗较多的资源。However, when the application layer updates the target data (such as data insertion, data modification, and data deletion), the total data volume of the target data may change. If the target data is divided into blocks according to the above-mentioned fixed size, it may cause The number of divided data blocks changes, which in turn leads to a large change in the number of groups of data blocks, so that the index tree needs to be rebuilt. Therefore, more resources need to be consumed in the process of data management.
发明内容Contents of the invention
本申请实施例提供一种存储系统中数据管理方法及装置,能够节约数据管理所需消耗的资源。Embodiments of the present application provide a data management method and device in a storage system, which can save resources required for data management.
为达到上述目的,本申请实施例采用如下技术方案:In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:
第一方面,本申请实施例提供一种存储系统中数据管理方法及装置,该方法包括:基于目标数据的内容,将目标数据划分为M个候选数据块,M为大于或等于2的整数;In the first aspect, an embodiment of the present application provides a data management method and device in a storage system, the method comprising: dividing the target data into M candidate data blocks based on the content of the target data, where M is an integer greater than or equal to 2;
根据M个候选数据块各自的指纹特征,将该M个候选数据块划分为N个目标数据块,N为小于或等于M的正整数,每个目标数据块包括至少一个候选数据块;存储N个目标数据块和该N个目标数据块的指纹特征,目标数据块与该目标数据块的指纹特征具有一一对应的关系;根据N个目标数据块,生成目标数据的索引树,该索引树用于对目标数据的内容进行寻址。According to the respective fingerprint characteristics of the M candidate data blocks, the M candidate data blocks are divided into N target data blocks, N is a positive integer less than or equal to M, and each target data block includes at least one candidate data block; store N target data blocks and the fingerprint features of the N target data blocks, the target data block and the fingerprint features of the target data blocks have a one-to-one correspondence; according to the N target data blocks, an index tree of the target data is generated, and the index tree Used to address the contents of object data.
本申请实施例提供的存储系统中数据管理方法,数据管理装置是基于目标数据的内容,将目标数据划分为M个候选数据块,然后,再根据M个候选数据块各自的指纹特征,将M个候选数据块划分为N个目标数据块,最后,根据N个目标数据块,生成用于对目标数据进行寻址的索引树。在本申请实施例中,是先基于目标数据的内容将目标数据划分为多个候选数据块,然后根据候选数据块的指纹特征将候选数据块划分为目标数据块,并不是按照固定大小将目标数据划分为多个目标数据块的,所以本申请实施例中目标数据块的大小并没有具体的限制。当对目标数据插入数据时,虽 然插入数据后的目标数据的大小发生了变化,但对插入数据后的目标数据进行目标数据块的划分后,目标数据块的数量不一定会发生改变,因此,在一定程度能够节约数据管理所需消耗的资源。In the data management method in the storage system provided by the embodiment of the present application, the data management device divides the target data into M candidate data blocks based on the content of the target data, and then divides the M candidate data blocks according to the respective fingerprint features of the M candidate data blocks. The candidate data blocks are divided into N target data blocks, and finally, an index tree for addressing the target data is generated according to the N target data blocks. In this embodiment of the application, the target data is first divided into multiple candidate data blocks based on the content of the target data, and then the candidate data blocks are divided into target data blocks according to the fingerprint characteristics of the candidate data blocks, instead of dividing the target data according to a fixed size The data is divided into multiple target data blocks, so the size of the target data block in the embodiment of the present application is not specifically limited. When inserting data into the target data, although the size of the target data after inserting the data has changed, the number of target data blocks may not necessarily change after the target data after the inserted data is divided into target data blocks. Therefore, To a certain extent, it can save the resources consumed by data management.
一种可能的实现方式中,上述基于目标数据的内容,将目标数据划分为M个候选数据块,具体包括:根据上述目标数据的指纹特征,确定该目标数据的M-1个划分点;根据该M-1个划分点将目标数据划分为M个候选数据块。In a possible implementation manner, the above-mentioned target data is divided into M candidate data blocks based on the content of the target data, which specifically includes: determining M-1 division points of the target data according to the fingerprint characteristics of the target data; The M-1 dividing points divide the target data into M candidate data blocks.
一种可能的实现方式中,上述根据目标数据的指纹特征,确定目标数据的M-1个划分点,包括:对于上述M-1个划分点中的任一个划分点,在滑动窗内的第一数据的指纹特征满足第一预设条件的情况下,将滑动窗的结束位置确定为划分点,该第一数据为目标数据中的部分数据,该第一预设条件为滑动窗内的目标数据的指纹特征与第一阈值的取模值等于第二阈值。In a possible implementation, the determination of the M-1 division points of the target data according to the fingerprint features of the target data includes: for any one of the above-mentioned M-1 division points, the first division point in the sliding window When the fingerprint feature of a data satisfies the first preset condition, the end position of the sliding window is determined as the division point, the first data is part of the data in the target data, and the first preset condition is the target in the sliding window The modulo value of the fingerprint feature of the data and the first threshold is equal to the second threshold.
一种可能的实现方式中,上述根据目标数据的指纹特征,确定目标数据的M-1个划分点,包括:对于上述M-1个划分点中的任一个划分点,在滑动窗内的第一数据的指纹特征不满足第一预设条件的情况下,将滑动窗沿预设方向滑动预设长度,在滑动窗内的第二数据的指纹特征满足第一预设条件时,将滑动窗的结束位置确定为划分点,该第二数据为目标数据中的部分数据,该第二数据与上述第一数据不同;其中,预设长度小于或等于该滑动窗的长度,上述第一预设条件为滑动窗内的目标数据的指纹特征与第一阈值的取模值等于第二阈值。In a possible implementation, the determination of the M-1 division points of the target data according to the fingerprint features of the target data includes: for any one of the above-mentioned M-1 division points, the first division point in the sliding window When the fingerprint feature of a data does not meet the first preset condition, slide the sliding window along the preset direction for a preset length, and when the fingerprint feature of the second data in the sliding window satisfies the first preset condition, slide the sliding window The end position is determined as a division point, the second data is part of the target data, and the second data is different from the above-mentioned first data; wherein, the preset length is less than or equal to the length of the sliding window, and the above-mentioned first preset condition A modulo value between the fingerprint feature of the target data in the sliding window and the first threshold is equal to the second threshold.
一种可能的实现方式中,上述基于目标数据的内容,将目标数据划分为M个候选数据块,具体包括:根据上述目标数据的变换值,确定目标数据的M-1个划分点,该变换值是基于预设规则将预设窗内的各个数据转换为数字形式的值;根据该M-1个划分点将目标数据划分为M个候选数据块。In a possible implementation manner, the above-mentioned division of the target data into M candidate data blocks based on the content of the target data includes: determining M-1 division points of the target data according to the transformation value of the above-mentioned target data, and the transformation The value is a value converted into a digital form of each data in the preset window based on a preset rule; the target data is divided into M candidate data blocks according to the M-1 division points.
一种可能的实现方式中,上述根据目标数据的变换值,确定目标数据的M-1个划分点,包括:预设窗包括依次相邻的第一定长窗、可变长窗以及第二定长窗;该第二定长窗的长度为目标数据中的一个数据的长度或者该目标数据中的一个数据的变换值的长度;对于上述M-1个划分点中的任一个划分点,在第二定长窗包括的数据的变换值满足第二预设条件的情况下,将该第二定长窗的结束位置确定为划分点,第二预设条件为第二定长窗内的数据的变换值大于第一定长窗内的各个数据的变换值的最大值,并且大于该可变长窗内的各个数据的变换值的最大值。In a possible implementation manner, the above-mentioned M-1 division points of the target data are determined according to the transformation value of the target data, including: the preset window includes a first fixed-length window, a variable-length window, and a second Fixed-length window; the length of the second fixed-length window is the length of a data in the target data or the length of the transformation value of a data in the target data; for any one of the above-mentioned M-1 division points, In the case that the conversion value of the data included in the second fixed-length window satisfies the second preset condition, the end position of the second fixed-length window is determined as the division point, and the second preset condition is The transformation value of the data is greater than the maximum value of the transformation values of each data in the first fixed-length window, and greater than the maximum value of the transformation values of each data in the variable-length window.
一种可能的实现方式中,上述根据目标数据的变换值,确定目标数据的M-1个划分点,包括:上述预设窗包括依次相邻的第一定长窗、可变长窗以及第二定长窗;第二定长窗的长度为上述目标数据中的一个数据的长度或者该目标数据中的一个数据的变换值的长度;对于任一个划分点,在第二定长窗内的数据的变换值不满足第二预设条件的情况下,增加该预设窗中的可变长窗的长度,在第二定长窗内的数据的变换值满足第二预设条件时,将该第二定长窗的结束位置确定为划分点,上述第二预设条件为第二定长窗内的数据的变换值大于第一定长窗内的各个数据的变换值的最大值,并且大于该可变长窗内的各个数据的变换值的最大值。In a possible implementation manner, the above-mentioned M-1 division points of the target data are determined according to the transformation value of the target data, including: the above-mentioned preset window includes the first fixed-length window, the variable-length window and the first Two fixed-length windows; the length of the second fixed-length window is the length of a data in the above-mentioned target data or the length of the transformation value of a data in the target data; for any division point, in the second fixed-length window When the conversion value of the data does not meet the second preset condition, increase the length of the variable-length window in the preset window, and when the conversion value of the data in the second fixed-length window satisfies the second preset condition, the The end position of the second fixed-length window is determined as a division point, and the above-mentioned second preset condition is that the conversion value of the data in the second fixed-length window is greater than the maximum value of the conversion values of each data in the first fixed-length window, and Greater than the maximum value of the transformation value of each data in the variable length window.
一种可能的实现方式中,上述根据M个候选数据块各自的指纹特征,将M个候选数据块划分为N个目标数据块,具体包括:将M个候选数据块中指纹特征满足第三 预设条件的候选数据块的结束位置确定为候选数据块的N-1个划分点,其中,第三预设条件为该候选数据块的数据的指纹特征与第三阈值的取模值在预设范围之内;根据该N-1个划分点,将M个候选数据块划分为N个目标数据块。In a possible implementation manner, according to the respective fingerprint features of the M candidate data blocks, the M candidate data blocks are divided into N target data blocks, which specifically includes: the fingerprint features of the M candidate data blocks satisfy the third predetermined Assume that the end position of the candidate data block of the condition is determined as the N-1 division points of the candidate data block, wherein the third preset condition is that the fingerprint feature of the data of the candidate data block and the modulo value of the third threshold are within the preset within the range; according to the N-1 dividing points, the M candidate data blocks are divided into N target data blocks.
一种可能的实现方式中,候选数据块的数据的指纹特征是候选数据块中所有数据的指纹特征;或,候选数据块的数据的指纹特征是候选数据块中部分数据的指纹特征。In a possible implementation manner, the fingerprint feature of the data in the candidate data block is the fingerprint feature of all the data in the candidate data block; or, the fingerprint feature of the data in the candidate data block is the fingerprint feature of some data in the candidate data block.
一种可能的实现方式中,上述根据N个目标数据块,生成目标数据的索引树,具体包括:根据上述N个目标数据块各自的指纹特征,将该N个目标数据块划分为至少一个数据组;基于多个数据组各自的指纹特征,生成目标数据的索引树。In a possible implementation manner, the above-mentioned generation of the index tree of the target data based on the N target data blocks specifically includes: dividing the N target data blocks into at least one data block according to the respective fingerprint features of the N target data blocks group; generate an index tree of target data based on the respective fingerprint characteristics of multiple data groups.
一种可能的实现方式中,上述根据N个目标数据块各自的指纹特征,将N个目标数据块划分为至少一个数据组,具体包括:将N个目标数据块各自的指纹特征中满足第四预设条件的目标数据块,确定为该N个目标数据块的至少一个划分点;根据目标数据块的至少一个划分点,将该N个目标数据块划分多个数据组。In a possible implementation manner, according to the respective fingerprint features of the N target data blocks, the N target data blocks are divided into at least one data group, which specifically includes: among the fingerprint features of the N target data blocks satisfying the fourth A target data block with preset conditions is determined as at least one division point of the N target data blocks; and the N target data blocks are divided into multiple data groups according to at least one division point of the target data block.
一种可能的实现方式中,上述第四预设条件为:上述目标数据块的指纹特征与第四阈值的取模值大于或等于第五阈值。In a possible implementation manner, the fourth preset condition is: a modulo value between the fingerprint feature of the target data block and the fourth threshold is greater than or equal to the fifth threshold.
一种可能的实现方式中,上述第四预设条件为:目标数据块的指纹特征与第四阈值的取模值大于或等于第五阈值,并且数据组中目标数据块的数量在第一数量阈值和第二数量阈值之间,该第一数量阈值大于第二数量阈值。In a possible implementation, the above fourth preset condition is: the modulo value between the fingerprint feature of the target data block and the fourth threshold is greater than or equal to the fifth threshold, and the number of target data blocks in the data group is within the first number Between the threshold and the second threshold number, the first number threshold is greater than the second number threshold.
一种可能的实现方式中,指纹特征为哈希值。In a possible implementation manner, the fingerprint feature is a hash value.
第二方面,本申请实施例提供一种存储系统中数据管理装置,该数据管理装置包括:处理模块、存储模块和生成模块;处理模块用于基于目标数据的内容,将目标数据划分为M个候选数据块,M为大于或等于2的整数;处理模块还用于根据M个候选数据块各自的指纹特征,将该M个候选数据块划分为N个目标数据块,N为小于或等于M的正整数,每个目标数据块包括至少一个候选数据块;存储模块用于存储N个目标数据块和该N个目标数据块的指纹特征,目标数据块与该目标数据块的指纹特征具有一一对应的关系;生成模块用于根据N个目标数据块,生成目标数据的索引树,该索引树用于对目标数据的内容进行寻址。In the second aspect, the embodiment of the present application provides a data management device in a storage system, the data management device includes: a processing module, a storage module and a generation module; the processing module is used to divide the target data into M based on the content of the target data Candidate data blocks, M is an integer greater than or equal to 2; the processing module is also used to divide the M candidate data blocks into N target data blocks according to the respective fingerprint characteristics of the M candidate data blocks, and N is less than or equal to M is a positive integer, each target data block includes at least one candidate data block; the storage module is used to store N target data blocks and the fingerprint features of the N target data blocks, and the target data block and the fingerprint feature of the target data block have the same One-to-one relationship; the generation module is used to generate an index tree of the target data according to the N target data blocks, and the index tree is used to address the contents of the target data.
一种可能的实现方式中,上述确定模块用于根据上述目标数据的指纹特征,确定该目标数据的M-1个划分点;处理模块具体用于根据该M-1个划分点将目标数据划分为M个候选数据块。In a possible implementation manner, the determination module is configured to determine M-1 division points of the target data according to the fingerprint characteristics of the target data; the processing module is specifically configured to divide the target data according to the M-1 division points are M candidate data blocks.
一种可能的实现方式中,上述存储系统中数据管理装置还包括:确定模块;确定模块用于在滑动窗内的第一数据的指纹特征满足第一预设条件的情况下,将滑动窗的结束位置确定为划分点,该第一数据为目标数据中的部分数据,该第一预设条件为滑动窗内的目标数据的指纹特征与第一阈值的取模值等于第二阈值。In a possible implementation manner, the data management device in the above-mentioned storage system further includes: a determination module; the determination module is configured to, when the fingerprint feature of the first data in the sliding window meets the first preset condition, set the The end position is determined as a division point, the first data is part of the target data, and the first preset condition is that the modulo value between the fingerprint feature of the target data in the sliding window and the first threshold is equal to the second threshold.
一种可能的实现方式中,上述存储系统中数据管理装置还包括:滑动模块;滑动模块用于对于上述M-1个划分点中的任一个划分点,在滑动窗内的第一数据的指纹特征不满足第一预设条件的情况下,将滑动窗沿预设方向滑动预设长度;确定模块用于在滑动窗内的第二数据的指纹特征满足第一预设条件时,将滑动窗的结束位置确定为划分点,该第二数据为目标数据中的部分数据,该第二数据与上述第一数据不同;其中,预设长度小于或等于该滑动窗的长度,上述第一预设条件为滑动窗内的目标数据 的指纹特征与第一阈值的取模值等于第二阈值。In a possible implementation manner, the data management device in the above-mentioned storage system further includes: a sliding module; the sliding module is used for any one of the above-mentioned M-1 dividing points, the fingerprint of the first data in the sliding window When the feature does not meet the first preset condition, slide the sliding window along the preset direction for a preset length; when the fingerprint feature of the second data in the sliding window meets the first preset condition, the determination module will slide the sliding window The end position is determined as a division point, the second data is part of the target data, and the second data is different from the above-mentioned first data; wherein, the preset length is less than or equal to the length of the sliding window, and the above-mentioned first preset condition A modulo value between the fingerprint feature of the target data in the sliding window and the first threshold is equal to the second threshold.
一种可能的实现方式中,上述存储系统中数据管理装置还包括:确定模块;确定模块用于根据上述目标数据的变换值,确定目标数据的M-1个划分点,该变换值是基于预设规则将预设窗内的各个数据转换为数字形式的值;处理模块用于根据该M-1个划分点将目标数据划分为M个候选数据块。In a possible implementation manner, the data management device in the above-mentioned storage system further includes: a determination module; the determination module is configured to determine M-1 division points of the target data according to the transformation value of the above-mentioned target data, and the transformation value is based on the predetermined A rule is set to convert each data in the preset window into a value in digital form; the processing module is used to divide the target data into M candidate data blocks according to the M-1 division points.
一种可能的实现方式中,上述预设窗包括依次相邻的第一定长窗、可变长窗以及第二定长窗;该第二定长窗的长度为目标数据中的一个数据的长度或者该目标数据中的一个数据的变换值的长度;上述处理模块用于在第二定长窗包括的数据的变换值满足第二预设条件的情况下,将该第二定长窗的结束位置确定为划分点,第二预设条件为第二定长窗内的数据的变换值大于第一定长窗内的各个数据的变换值的最大值,并且大于该可变长窗内的各个数据的变换值的最大值。In a possible implementation manner, the preset window includes a first fixed-length window, a variable-length window, and a second fixed-length window that are sequentially adjacent; the length of the second fixed-length window is the length of one of the target data length or the length of the converted value of a data in the target data; the above-mentioned processing module is used to convert the converted value of the data included in the second fixed-length window to the second preset condition. The end position is determined as the division point, and the second preset condition is that the transformation value of the data in the second fixed-length window is greater than the maximum value of the transformation values of each data in the first fixed-length window, and greater than the maximum value of the transformation values of the data in the variable-length window. The maximum value of the transformation value of each data.
一种可能的实现方式中,上述预设窗包括依次相邻的第一定长窗、可变长窗以及第二定长窗;第二定长窗的长度为上述目标数据中的一个数据的长度或者该目标数据中的一个数据的变换值的长度;处理模块用于在第二定长窗内的数据的变换值不满足第二预设条件的情况下,增加该预设窗中的可变长窗的长度,在第二定长窗内的数据的变换值满足第二预设条件时,将该第二定长窗的结束位置确定为划分点,上述第二预设条件为第二定长窗内的数据的变换值大于第一定长窗内的各个数据的变换值的最大值,并且大于该可变长窗内的各个数据的变换值的最大值。In a possible implementation manner, the preset window includes a first fixed-length window, a variable-length window, and a second fixed-length window that are sequentially adjacent; the length of the second fixed-length window is one of the above-mentioned target data length or the length of the conversion value of a data in the target data; the processing module is used to increase the available data in the preset window when the conversion value of the data in the second fixed-length window does not meet the second preset condition The length of the variable-length window, when the conversion value of the data in the second fixed-length window satisfies the second preset condition, the end position of the second fixed-length window is determined as the division point, and the above-mentioned second preset condition is the second The transformation value of the data in the fixed-length window is greater than the maximum value of the transformation values of each data in the first fixed-length window, and greater than the maximum value of the transformation values of each data in the variable-length window.
一种可能的实现方式中,确定模块用于将M个候选数据块中指纹特征满足第三预设条件的候选数据块的结束位置确定为候选数据块的N-1个划分点,其中,第三预设条件为该候选数据块的数据的指纹特征与第三阈值的取模值在预设范围之内;处理模块用于根据该N-1个划分点,将M个候选数据块划分为N个目标数据块。In a possible implementation, the determination module is used to determine the end positions of the candidate data blocks whose fingerprint features meet the third preset condition among the M candidate data blocks as the N-1 division points of the candidate data blocks, where the The three preset conditions are that the fingerprint feature of the data of the candidate data block and the modulo value of the third threshold are within the preset range; the processing module is used to divide the M candidate data blocks into three according to the N-1 dividing points N target data blocks.
一种可能的实现方式中,上述处理模块用于根据上述N个目标数据块各自的指纹特征,将该N个目标数据块划分为至少一个数据组;生成模块用于基于多个数据组各自的指纹特征,生成目标数据的索引树。In a possible implementation, the above-mentioned processing module is used to divide the N target data blocks into at least one data group according to the respective fingerprint features of the above-mentioned N target data blocks; Fingerprint features to generate an index tree of the target data.
一种可能的实现方式中,上述确定模块用于将N个目标数据块各自的指纹特征中满足第四预设条件的目标数据块,确定为该N个目标数据块的至少一个划分点;处理模块用于根据目标数据块的至少一个划分点,将该N个目标数据块划分多个数据组。In a possible implementation manner, the determination module is configured to determine a target data block satisfying the fourth preset condition among the respective fingerprint features of the N target data blocks as at least one division point of the N target data blocks; The module is used for dividing the N target data blocks into multiple data groups according to at least one dividing point of the target data blocks.
一种可能的实现方式中,上述第四预设条件为:上述目标数据块的指纹特征与第四阈值的取模值大于或等于第五阈值。In a possible implementation manner, the fourth preset condition is: a modulo value between the fingerprint feature of the target data block and the fourth threshold is greater than or equal to the fifth threshold.
一种可能的实现方式中,上述第四预设条件为:目标数据块的指纹特征与第四阈值的取模值大于或等于第五阈值,并且数据组中目标数据块的数量在第一数量阈值和第二数量阈值之间,该第一数量阈值大于第二数量阈值。In a possible implementation, the above fourth preset condition is: the modulo value between the fingerprint feature of the target data block and the fourth threshold is greater than or equal to the fifth threshold, and the number of target data blocks in the data group is within the first number Between the threshold and the second threshold number, the first number threshold is greater than the second number threshold.
一种可能的实现方式中,指纹特征为哈希值。In a possible implementation manner, the fingerprint feature is a hash value.
第三方面,本申请实施例提供一种存储系统中数据管理装置,其中,存储器与处理器耦合;存储器用于存储计算机程序代码,其中,计算机程序代码包括计算机指令;当计算机指令被处理器执行时,使得存储系统中数据管理装置执行第一方面及其可能的实现方式中任意之一所述的方法。In a third aspect, an embodiment of the present application provides a data management device in a storage system, wherein the memory is coupled to the processor; the memory is used to store computer program codes, wherein the computer program codes include computer instructions; when the computer instructions are executed by the processor , make the data management device in the storage system execute the method described in any one of the first aspect and its possible implementation manners.
第四方面,本申请实施例提供一种计算机存储介质,包括计算机指令,当计算机 指令在计算设备上运行时,使得计算设备执行上述第一方面及其可能的实现方式中任意之一所述的方法。In a fourth aspect, an embodiment of the present application provides a computer storage medium, including computer instructions. When the computer instructions are run on the computing device, the computing device is made to execute the above-mentioned method described in any one of the first aspect and its possible implementations. method.
第五方面,本申请实施例提供一种的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面及其可能的实现方式中任意之一所述的方法。In the fifth aspect, the embodiments of the present application provide a computer program product, which, when run on a computer, causes the computer to execute the method described in any one of the above first aspect and possible implementations thereof.
应当理解的是,本申请实施例的第二方面至第五方面技术方案及对应的可能的实施方式所取得的有益效果可以参见上述对第一方面及其对应的可能的实施方式的技术效果,此处不再赘述。It should be understood that the beneficial effects obtained by the technical solutions of the second aspect to the fifth aspect of the embodiment of the present application and the corresponding possible implementation manners can refer to the technical effects of the above-mentioned first aspect and the corresponding possible implementation manners, I won't repeat them here.
附图说明Description of drawings
图1为本申请实施例提供的一种目标数据的分块、分组流程示意图;FIG. 1 is a schematic diagram of a block and group flow process of target data provided by an embodiment of the present application;
图2为本申请实施例提供的一种索引树的结构示意图;FIG. 2 is a schematic structural diagram of an index tree provided by an embodiment of the present application;
图3为本申请实施例提供的一种索引树的构建流程示意图;FIG. 3 is a schematic diagram of a construction process of an index tree provided by an embodiment of the present application;
图4为本申请实施例提供的一种存储系统硬件结构示意图;FIG. 4 is a schematic diagram of a hardware structure of a storage system provided by an embodiment of the present application;
图5为本申请实施例提供的一种存储系统中数据管理方法的流程示意图一;FIG. 5 is a first schematic flowchart of a data management method in a storage system provided by an embodiment of the present application;
图6为本申请实施例提供的一种存储系统中数据管理方法的流程示意图二;FIG. 6 is a second schematic flow diagram of a data management method in a storage system provided by an embodiment of the present application;
图7为本申请实施例提供的一种候选数据块的划分过程示意图;FIG. 7 is a schematic diagram of a division process of a candidate data block provided in an embodiment of the present application;
图8为本申请实施例提供的一种候选数据块划分方法的流程示意图;FIG. 8 is a schematic flowchart of a method for dividing candidate data blocks provided by an embodiment of the present application;
图9为本申请实施例提供的一种存储系统中数据管理方法的流程示意图三;FIG. 9 is a third schematic flowchart of a data management method in a storage system provided by an embodiment of the present application;
图10为本申请实施例提供的又一种候选数据块的划分过程示意图;FIG. 10 is a schematic diagram of another candidate data block division process provided by the embodiment of the present application;
图11为本申请实施例提供的又一种候选数据块划分方法的流程示意图;FIG. 11 is a schematic flowchart of another method for dividing candidate data blocks provided by the embodiment of the present application;
图12为本申请实施例提供的一种存储系统中数据管理方法的流程示意图四;FIG. 12 is a fourth schematic flowchart of a data management method in a storage system provided by an embodiment of the present application;
图13为本申请实施例提供的一种存储系统中数据管理方法的流程示意图五;FIG. 13 is a schematic flow diagram V of a data management method in a storage system provided by an embodiment of the present application;
图14为本申请实施例提供的一种存储系统中数据管理方法的流程示意图六;FIG. 14 is a sixth schematic flow diagram of a data management method in a storage system provided by an embodiment of the present application;
图15为本申请实施例提供的一种存储系统中数据管理装置示意图;FIG. 15 is a schematic diagram of a data management device in a storage system provided by an embodiment of the present application;
图16为本申请实施例提供的另一种存储系统中数据管理装置示意图。FIG. 16 is a schematic diagram of another data management device in a storage system provided by an embodiment of the present application.
具体实施方式Detailed ways
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。The term "and/or" in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations.
本申请实施例的说明书和权利要求书中的术语“第一”和“第二”等是用于区别不同的对象,而不是用于描述对象的特定顺序。例如,第一阈值和第二阈值等是用于区别不同的阈值,而不是用于描述阈值的特定顺序。The terms "first" and "second" in the description and claims of the embodiments of the present application are used to distinguish different objects, rather than to describe a specific order of objects. For example, the first threshold and the second threshold are used to distinguish different thresholds, but not to describe a specific order of the thresholds.
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。In the embodiments of the present application, words such as "exemplary" or "for example" are used as examples, illustrations or illustrations. Any embodiment or design scheme described as "exemplary" or "for example" in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete manner.
在本申请实施例的描述中,除非另有说明,“多个”的含义是指两个或两个以上。例如,多个数据组是指两个或两个以上的数据组。In the description of the embodiments of the present application, unless otherwise specified, "plurality" means two or more. For example, a plurality of data groups refers to two or more data groups.
随着互联网技术的发展,使用Merkle-DAG对数据进行管理的方式在存储系统中的应用越来越广泛。With the development of Internet technology, the way of using Merkle-DAG to manage data is more and more widely used in storage systems.
众所周知,Merkle-DAG是由至少一个树构成,即:数据的管理是基于树结构进行的。现有技术中基于Merkle-DAG对存储系统中的数据进行管理的方法是:首先,将该目标数据按固定大小划分为多个数据块,假设目标数据的大小为1024个字节,如图1中的(A)图所示,按照每个数据块的大小为256个字节将该目标数据划分为4个数据块,即图1中的(B)图所示的数据块1-数据块4;其次,分别计算各个数据块的哈希值,例如上述的数据块1-数据块4的哈希值为哈希值1-哈希值4,并按照数据块与哈希值的对应关系在预设表中存储数据块与数据块的哈希值,下述表1为上述数据块1-数据块4与其哈希值的对应关系的示例;然后,按照固定数量对上述多个数据块进行分组,例如2个数据块为一组,将该4个数据块划分为2个数据组,例如图1中的(C)图所示的数据组1-数据组2,其中,数据组1包括数据块1和数据块2,数据组2包括数据块3和数据块4,完成数据块的分组后,基于各个数据块的哈希值生成并保存用于管理目标数据的第一索引树。As we all know, Merkle-DAG is composed of at least one tree, that is, data management is based on the tree structure. In the prior art, the method of managing data in the storage system based on Merkle-DAG is as follows: first, divide the target data into multiple data blocks according to a fixed size, assuming that the size of the target data is 1024 bytes, as shown in Figure 1 As shown in (A) in Figure 1, the target data is divided into 4 data blocks according to the size of each data block is 256 bytes, that is, data block 1-data block shown in Figure (B) in Figure 1 4; Secondly, calculate the hash value of each data block separately, for example, the hash value of the above-mentioned data block 1-data block 4 is hash value 1-hash value 4, and according to the corresponding relationship between the data block and the hash value Store the data block and the hash value of the data block in the preset table, and the following table 1 is an example of the corresponding relationship between the above-mentioned data block 1-data block 4 and its hash value; then, according to a fixed number of the above-mentioned multiple data blocks Carry out grouping, for example 2 data blocks are one group, this 4 data blocks are divided into 2 data groups, for example data group 1-data group 2 shown in (C) figure among Fig. 1, wherein, data group 1 Including data block 1 and data block 2, data group 2 includes data block 3 and data block 4, after the grouping of data blocks is completed, a first index tree for managing target data is generated and saved based on the hash value of each data block.
表1Table 1
数据块索引data block index 数据块内容data block content 哈希值hash value 哈希值hash value
数据块1data block 1 AASDECCAASDECC 哈希值1hash value 1 -502161580-502161580
数据块2data block 2 FWEQFWEFWEQFWE 哈希值2Hash 2 257690423257690423
数据块3data block 3 FWEQFFTFWEQFFT 哈希值3Hash 3 257689911257689911
数据块4data block 4 JYTEWQCJYTEWQC 哈希值4Hash 4 -416492375-416492375
仍以上述1024个字节的目标数据为例,如图2所示,上述目标数据的第一索引树的生成方式为:将目标数据划分成的数据块1-数据块4对应的哈希值1-哈希值4作为第一索引树的叶节点,可以记为叶节点1、叶节点2、叶节点3以及叶节点4;然后,计算这4个叶节点划分成的两个数据组中的叶节点的哈希值,即计算哈希值1和哈希值2的哈希值,以及哈希值3和哈希值4的哈希值,并且将哈希值1和哈希值2的哈希值作为叶节点1和叶节点2的父节点(称为第一父节点,该第一父节点是第一索引树中的一个子节点),将哈希值3和哈希值4的哈希值作为叶节点3和叶节点4的父节点(称为第二父节点,该第二父节点为第一索引树中的另一子节点),最后,再计算第一父节点和第二父节点的哈希值,将该第一父节点和第二父节点的哈希值作为根节点。Still taking the above target data of 1024 bytes as an example, as shown in Figure 2, the generation method of the first index tree of the above target data is as follows: divide the target data into hash values corresponding to data block 1-data block 4 1-Hash value 4 is used as the leaf node of the first index tree, which can be recorded as leaf node 1, leaf node 2, leaf node 3, and leaf node 4; then, calculate the two data groups divided by these 4 leaf nodes The hash value of the leaf node, that is, calculate the hash value of hash value 1 and hash value 2, and the hash value of hash value 3 and hash value 4, and combine hash value 1 and hash value 2 As the parent node of leaf node 1 and leaf node 2 (called the first parent node, the first parent node is a child node in the first index tree), hash value 3 and hash value 4 as the parent node of leaf node 3 and leaf node 4 (called the second parent node, the second parent node is another child node in the first index tree), and finally, calculate the first parent node and For the hash value of the second parent node, use the hash values of the first parent node and the second parent node as the root node.
当对目标数据进行更新时,可能会导致目标数据的总数据量发生变化,例如,参考图3,若在上述目标数据的数据块1的结束位置处插入256个字节的数据,如此,该目标数据的数据量由1024个字节增加至1280个字节,若按照上述的每个数据块的大小是固定的256个字节对目标数据进行分块,可以得到5个数据块,进而对这5个数据块按照上述分组方式进行分组后可以得到3个数据组,该3个数据组中的数据组1包括数据块1和数据块5,数据组2包括数据块2和数据块3,数据组3包括数据块4,而未插入新的数据之前,目标数据被划分为两个数据组,数据组1包括数据块2和数据块2,数据组2包括数据块3和数据块4,可见,目标数据的总数据量发生了变化,导致目标数据的分组发生较大的变化,如此,基于数据块1-5的哈希值生成如图3所示的第二索引树,可见,与该第二索引树相较于第一索引树有较大的变化,如:图3中的阴影部分为发生变化的部分,因此,插入数据后的目标数据在生成索引树时, 需要花费大量的计算量,进而导致上述现有技术在数据管理的过程中需要消耗较多的资源。When the target data is updated, it may cause the total data volume of the target data to change. For example, with reference to FIG. The data volume of the target data is increased from 1024 bytes to 1280 bytes. If the size of each data block is fixed at 256 bytes, the target data is divided into blocks, and 5 data blocks can be obtained, and then the After the 5 data blocks are grouped according to the above grouping method, 3 data groups can be obtained. Data group 1 of the 3 data groups includes data block 1 and data block 5, and data group 2 includes data block 2 and data block 3. Data group 3 includes data block 4, and before new data is inserted, the target data is divided into two data groups, data group 1 includes data block 2 and data block 2, data group 2 includes data block 3 and data block 4, It can be seen that the total data volume of the target data has changed, resulting in a large change in the grouping of the target data. In this way, the second index tree shown in Figure 3 is generated based on the hash values of data blocks 1-5. It can be seen that the same as Compared with the first index tree, the second index tree has a large change, such as: the shaded part in Figure 3 is the part that has changed, therefore, it takes a lot of time to generate the index tree for the target data after inserting the data The amount of calculation leads to the consumption of more resources in the process of data management in the above-mentioned prior art.
针对上述方案中由于对目标数据插入数据,使目标数据的分块和分组发生了变换,以使在重插入数据后的目标数据生成索引树时,需要花费大量的计算量,而导致在数据管理的过程中消耗资源较多的问题,本申请实施例提供的一种存储系统中数据管理方法及装置,数据管理装置基于目标数据的内容,将目标数据划分为M个候选数据块,M为大于或等于2的整数;根据M个候选数据块各自的指纹特征,将M个候选数据块划分为N个目标数据块,N为小于或等于M的正整数,每个目标数据块包括至少一个候选数据块;存储N个目标数据块和N个目标数据块的指纹特征,目标数据块与目标数据块的指纹特征具有一一对应的关系;根据N个目标数据块,生成目标数据的索引树,索引树用于对目标数据的内容进行寻址。In the above scheme, due to the insertion of data into the target data, the block and grouping of the target data have been transformed, so that when the target data is re-inserted to generate an index tree, it takes a lot of calculations, resulting in For the problem of consuming more resources in the process, the embodiment of the present application provides a data management method and device in a storage system. The data management device divides the target data into M candidate data blocks based on the content of the target data, and M is greater than or an integer equal to 2; according to the respective fingerprint characteristics of the M candidate data blocks, the M candidate data blocks are divided into N target data blocks, N is a positive integer less than or equal to M, and each target data block includes at least one candidate Data block; store N target data blocks and fingerprint features of N target data blocks, and target data blocks have a one-to-one correspondence with the fingerprint features of target data blocks; generate an index tree of target data according to N target data blocks, The index tree is used to address the content of the target data.
通过本申请实施例提供的技术方案,能够节省数据管理的过程中资源消耗。Through the technical solutions provided by the embodiments of the present application, resource consumption in the process of data management can be saved.
本申请实施例提供的存储系统中数据管理方法及装置可以应用于图4所示的存储系统,该存储系统可以是由固态硬盘组成的存储系统,也可以是由其他类型的存储介质组成的存储系统。如图4所示,存储系统包括控制器(简称:主控)401和多个硬盘405,其中,主控401包括:处理器402,可选的,控制器401还包括主机接口404、和n(n>0)个通道控制器403。The data management method and device in the storage system provided by the embodiment of the present application can be applied to the storage system shown in Figure 4. The storage system can be a storage system composed of a solid-state hard disk, or a storage system composed of other types of storage media. system. As shown in Figure 4, the storage system includes a controller (abbreviation: main control) 401 and a plurality of hard disks 405, wherein the main control 401 includes: a processor 402, optionally, the controller 401 also includes a host interface 404, and n (n>0) channel controllers 403 .
上述主控401用于向多个硬盘405发布可执行命令,从而实现在硬盘405上数据的读取或更新。The above-mentioned master control 401 is used to issue executable commands to multiple hard disks 405 , so as to read or update data on the hard disks 405 .
上述主机接口404用于与主机通信,进而接收主机发送的命令请求,并将该命令请求转发至处理器402,其中,上述主机不限于服务器、个人电脑或者阵列控制器等任何设备。The above-mentioned host interface 404 is used to communicate with the host, and then receive the command request sent by the host, and forward the command request to the processor 402, wherein the above-mentioned host is not limited to any device such as server, personal computer or array controller.
上述处理器402根据主机接口404转发的命令请求,向上述多个硬盘405发送可执行命令,具体的,上述处理器402用于执行本申请实施例提供的存储系统中数据管理方法,例如,处理器402用于对目标数据进行分块、对数据块进行分组以及生成索引树。可选的,上述处理器402可以包括一个或多个CPU,该CPU可以为单核CPU(single-CPU)或多核CPU(multi-CPU)。The above-mentioned processor 402 sends executable commands to the above-mentioned multiple hard disks 405 according to the command request forwarded by the host interface 404. Specifically, the above-mentioned processor 402 is used to execute the data management method in the storage system provided by the embodiment of the present application, for example, processing The implementer 402 is used to block target data, group data blocks, and generate an index tree. Optionally, the processor 402 may include one or more CPUs, and the CPUs may be single-core CPUs (single-CPU) or multi-core CPUs (multi-CPU).
上述通道控制器403用于承载处理器402向上述硬盘405发布的可执行命令。The channel controller 403 is used to carry the executable commands issued by the processor 402 to the hard disk 405 .
可选地,存储系统还包括总线406,上述处理器402、通道控制器403、主机接口404以及硬盘405通常通过总线406相互连接,或采用其他方式相互连接。Optionally, the storage system further includes a bus 406, and the processor 402, the channel controller 403, the host interface 404, and the hard disk 405 are generally connected to each other through the bus 406, or are connected to each other in other ways.
上述存储系统接收到主机传输的目标数据时,主控401中的主机接口404将该目标数据转发至主控401中的处理器402,处理器402将该目标数据划分为M个候选数据块,再根据M个候选数据块各自的指纹特征,将M个候选数据块划分为N个目标数据块,然后,处理器402根据目标数据块与该目标数据块的指纹特征的对应关系,将N个目标数据块和该N个目标数据块的指纹特征通过通道控制器403发送至n个硬盘405,以将其存储在硬盘405中;最后,处理器402根据N个目标数据块,生成目标数据的索引树,并将该索引树存储在硬盘405中。When the above-mentioned storage system receives the target data transmitted by the host, the host interface 404 in the main control 401 forwards the target data to the processor 402 in the main control 401, and the processor 402 divides the target data into M candidate data blocks, According to the respective fingerprint features of the M candidate data blocks, the M candidate data blocks are divided into N target data blocks, and then, the processor 402 divides the N target data blocks according to the corresponding relationship between the target data blocks and the fingerprint features of the target data blocks. The target data block and the fingerprint features of the N target data blocks are sent to n hard disks 405 through the channel controller 403 to be stored in the hard disks 405; finally, the processor 402 generates the target data according to the N target data blocks. index tree, and store the index tree in the hard disk 405.
可选的,执行本申请实施例提供的存储系统中数据管理方法的装置可以是上述图4所示的存储系统中的控制器中的处理器402。Optionally, the device for executing the data management method in the storage system provided in the embodiment of the present application may be the processor 402 in the controller in the storage system shown in FIG. 4 above.
结合上述图4所示的存储系统的架构示意图,如图5所示,本申请实施例提供的存储系统中数据管理方法可以包括S510-S540。With reference to the schematic architecture diagram of the storage system shown in FIG. 4 above, as shown in FIG. 5 , the data management method in the storage system provided by the embodiment of the present application may include S510-S540.
S510、数据管理装置基于目标数据的内容,将目标数据划分为M个候选数据块。S510. The data management device divides the target data into M candidate data blocks based on the content of the target data.
其中,M为大于或等于2的整数。Wherein, M is an integer greater than or equal to 2.
上述M个候选数据块可以是数据管理装置基于目标数据的内容,确定出M-1个划分点,然后,根据M-1个划分点将目标数据切割成的M个数据块;也可以是根据目标数据的内容在该目标数据的内容中确定出M-1个划分点,该M-1个划分点的将目标数据分隔成M个区间,无需根据M-1个划分点对目标数据进行切割,每个区间为一个候选数据块,具体的本申请实施例对上述M个候选数据块的划分方式不进行限定。The above M candidate data blocks may be M data blocks that the data management device determines M-1 division points based on the content of the target data, and then cuts the target data into M data blocks according to the M-1 division points; The content of the target data determines M-1 division points in the content of the target data, and the M-1 division points divide the target data into M intervals, and there is no need to cut the target data according to the M-1 division points , each interval is a candidate data block, and the specific embodiment of the present application does not limit the division method of the above M candidate data blocks.
应理解,上述目标数据的内容是组成目标数据的所有元素,上述M个候选数据块的大小可以全部相同、可以部分相同或者可以各不相同。It should be understood that the content of the above-mentioned target data refers to all elements that make up the target data, and the sizes of the above-mentioned M candidate data blocks may be all the same, may be partly the same, or may be different from each other.
S520、数据管理装置根据M个候选数据块各自的指纹特征,将M个候选数据块划分为N个目标数据块。S520. The data management device divides the M candidate data blocks into N target data blocks according to the respective fingerprint features of the M candidate data blocks.
上述N为小于或等于M的正整数,。The above N is a positive integer less than or equal to M.
可选的,上述候选块的指纹特征可以是候选块的哈希值,也可以是其他的能够唯一标识一个数据块的特征,具体根据实际需求确定,本申请实施例对数据块的指纹特征不进行限定。Optionally, the fingerprint feature of the above candidate block may be the hash value of the candidate block, or other features that can uniquely identify a data block, which are determined according to actual needs. To limit.
上述S520具体为:数据管理装置根据M个候选数据块各自的指纹特征,确定出N-1个划分点,然后,数据管理装置根据该N-1个划分点将M个候选数据块划分为N个目标数据块,每个目标数据块包括至少一个候选数据块。The above S520 is specifically: the data management device determines N-1 division points according to the respective fingerprint characteristics of the M candidate data blocks, and then, the data management device divides the M candidate data blocks into N according to the N-1 division points. target data blocks, and each target data block includes at least one candidate data block.
S530、数据管理装置存储N个目标数据块和N个目标数据块的指纹特征。S530. The data management device stores N target data blocks and fingerprint features of the N target data blocks.
需要说明的是,目标数据块与目标数据块的指纹特征具有一一对应的关系。It should be noted that there is a one-to-one correspondence between the target data block and the fingerprint feature of the target data block.
上述步骤具体为:在预设表的同一行存储某一目标数据块的指纹特征和该目标数据块的内容,也可以是在预设表1中存储目标数据块的指纹特征与目标数据块(目标数据块的索引)的对应关系,在预设表2中存储目标数据块的索引和该目标数据块的内容,本申请不对N个目标数据块和N个目标数据块的指纹特征的存储方式进行限定。The above steps are specifically: storing the fingerprint feature of a certain target data block and the content of the target data block in the same row of the preset table, or storing the fingerprint feature of the target data block and the target data block ( index of the target data block), the index of the target data block and the content of the target data block are stored in the preset table 2, and the application does not specify the storage method of the N target data blocks and the fingerprint features of the N target data blocks To limit.
S540、数据管理装置根据N个目标数据块,生成目标数据的索引树,该索引树用于对目标数据的内容进行寻址。S540. The data management device generates an index tree of the target data according to the N target data blocks, and the index tree is used for addressing content of the target data.
需要说明的是,当需要查询目标数据时,需要确定目标数据的对应的索引树,然后,根据该索引树的根节点以递归的方式查找该索引树的叶节点,然后根据叶节点中的哈希值该叶节点对应的数据块的内容,从而得到目标数据It should be noted that when the target data needs to be queried, the corresponding index tree of the target data needs to be determined, and then the leaf nodes of the index tree are searched recursively according to the root node of the index tree, and then according to the hash in the leaf node Hash the content of the data block corresponding to the leaf node to obtain the target data
示例性的,数据管理装置需要获取如图2所示的索引树所对应的目标数据时,数据管理装置根据该索引树的根节点查找该索引树的子节点(第一父节点和第二父节点),然后,根据子节点查找到4个叶节点,最后,根据4个叶节点中的哈希值在预设表中查询4个哈希值对应的数据块的内容,从而得到目标数据。Exemplarily, when the data management device needs to obtain the target data corresponding to the index tree shown in Figure 2, the data management device searches the child nodes (the first parent node and the second parent node) of the index tree according to the root node of the index tree. node), and then find 4 leaf nodes according to the child nodes, and finally, query the content of the data block corresponding to the 4 hash values in the preset table according to the hash values in the 4 leaf nodes, so as to obtain the target data.
本申请实施例提供的存储系统中数据管理方法,数据管理装置是基于目标数据的内容,将目标数据划分为M个候选数据块,然后,再根据M个候选数据块各自的指纹特征,将M个候选数据块划分为N个目标数据块,最后,根据N个目标数据块,生成用于对目标数据进行寻址的索引树。在本申请实施例中,是先基于目标数据的内 容将目标数据划分为多个候选数据块,然后根据候选数据块的指纹特征将候选数据块划分为目标数据块,并不是按照固定大小将目标数据划分为多个目标数据块的,所以本申请实施例中目标数据块的大小并没有具体的限制。当对目标数据插入数据时,虽然插入数据后的目标数据的大小发生了变化,但对插入数据后的目标数据进行目标数据块的划分后,目标数据块的数量不一定会发生改变,因此,在一定程度能够节约数据管理所需消耗的资源。In the data management method in the storage system provided by the embodiment of the present application, the data management device divides the target data into M candidate data blocks based on the content of the target data, and then divides the M candidate data blocks according to the respective fingerprint features of the M candidate data blocks. The candidate data blocks are divided into N target data blocks, and finally, an index tree for addressing the target data is generated according to the N target data blocks. In this embodiment of the application, the target data is first divided into multiple candidate data blocks based on the content of the target data, and then the candidate data blocks are divided into target data blocks according to the fingerprint characteristics of the candidate data blocks, instead of dividing the target data according to a fixed size The data is divided into multiple target data blocks, so the size of the target data block in the embodiment of the present application is not specifically limited. When inserting data into the target data, although the size of the target data after inserting the data has changed, the number of target data blocks may not necessarily change after the target data after the inserted data is divided into target data blocks. Therefore, To a certain extent, it can save the resources consumed by data management.
结合图5,如图6所示,在一种实现方式中,上述基于目标数据的内容,将目标数据划分为M个候选数据块(即S510)的方法可以具体包括:S610-S620。Referring to FIG. 5 , as shown in FIG. 6 , in an implementation manner, the method for dividing the target data into M candidate data blocks based on the content of the target data (that is, S510 ) may specifically include: S610-S620.
S610、数据管理装置根据目标数据的指纹特征,确定目标数据的M-1个划分点。S610. The data management device determines M−1 division points of the target data according to the fingerprint feature of the target data.
可以理解的是,确定目标数据的M-1个划分点的过程中,确定每一个划分点的方法是相同的,如图8所示,确定目标数据的一个划分点的方法包括S810-S830。It can be understood that, in the process of determining the M-1 division points of the target data, the method of determining each division point is the same. As shown in FIG. 8 , the method of determining a division point of the target data includes S810-S830.
S810、数据管理装置判断滑动窗内的数据的指纹特征是否满足第一预设条件。S810. The data management device judges whether the fingerprint feature of the data in the sliding window satisfies a first preset condition.
上述滑动窗的大小是固定的,该滑动窗的滑动步长为预设长度,滑动窗的滑动步长可以根据实际需求设定,例如滑动步长为1个字节或2个字节等,本申请实施例对此不进行限定。The size of the above-mentioned sliding window is fixed, the sliding step of the sliding window is a preset length, and the sliding step of the sliding window can be set according to actual needs, for example, the sliding step is 1 byte or 2 bytes, etc. This embodiment of the present application does not limit it.
上述第一预设条件是滑动窗内的数据的指纹特征与第一阈值的取模值等于第二阈值,其中,第一阈值和第二阈值可以是预先配置的。The above-mentioned first preset condition is that the modulo value between the fingerprint feature of the data in the sliding window and the first threshold is equal to the second threshold, where the first threshold and the second threshold may be pre-configured.
示例性的,假设第一阈值为20,第二阈值为9;如图7中的(A)图所示,目标数据的内容为ABCDEF......XYZ,滑动窗的大小对应两个字母,若当前滑动窗内的数据为“BC”,“BC”的哈希值为2113,首先,计算滑动窗内的数据的哈希值与第一阈值的取模值如下:2113mod 20=13,其中mod用于表示取模运算,然后,判断滑动窗内的数据的指纹特征与第一阈值的取模值等于第二阈值。Exemplarily, assume that the first threshold is 20, and the second threshold is 9; as shown in Figure 7 (A), the content of the target data is ABCDEF...XYZ, and the size of the sliding window corresponds to two Letter, if the data in the current sliding window is "BC", and the hash value of "BC" is 2113, first, calculate the modulo value between the hash value of the data in the sliding window and the first threshold as follows: 2113 mod 20 = 13 , where mod is used to represent a modulo operation, and then it is judged that the modulo value between the fingerprint feature of the data in the sliding window and the first threshold is equal to the second threshold.
由上述计算可知,滑动窗内的数据的哈希值2113与第一阈值20的取模值为13,而第二阈值为9,即:滑动窗内的数据的哈希值与第一阈值的取模值并不等于第二阈值;因此,滑动窗内的数据的哈希值不满足第一预设条件。It can be seen from the above calculation that the modulo value of the hash value 2113 of the data in the sliding window and the first threshold 20 is 13, and the second threshold is 9, that is: the hash value of the data in the sliding window and the first threshold The modulo value is not equal to the second threshold; therefore, the hash value of the data in the sliding window does not satisfy the first preset condition.
S820、在滑动窗内的第一数据的指纹特征满足第一预设条件的情况下,数据管理装置将滑动窗的结束位置确定为划分点。S820. In a case where the fingerprint feature of the first data in the sliding window satisfies a first preset condition, the data management device determines an end position of the sliding window as a division point.
上述第一数据为目标数据中的部分数据。The above-mentioned first data is part of the target data.
需要说明的是,滑动窗的结束位置是滑动窗上最靠近滑动方向的位置,例如,如图7中的(A)图所示,滑动窗此时的结束位置为字母“C”和“D”的交界位置。It should be noted that the end position of the sliding window is the position closest to the sliding direction on the sliding window. For example, as shown in Figure 7 (A), the end position of the sliding window at this time is the letters "C" and "D". "The junction position.
S830、在滑动窗内的第一数据的指纹特征不满足第一预设条件的情况下,数据管理装置将滑动窗沿预设方向滑动预设长度。S830. When the fingerprint feature of the first data in the sliding window does not satisfy the first preset condition, the data management device slides the sliding window along a preset direction for a preset length.
该预设长度即为滑动窗的滑动步长,该预设长度小于或等于滑动窗的长度。The preset length is the sliding step of the sliding window, and the preset length is less than or equal to the length of the sliding window.
需要说明的是,上述滑动窗的滑动方向是预先配置的,该滑动方向可以是从右向左,还可以是从左向右,在本申请实施例中滑动窗的滑动方向均以向右滑动为例进行说明,后续不再赘述。It should be noted that the sliding direction of the above-mentioned sliding window is pre-configured, and the sliding direction can be from right to left, or from left to right. In the embodiment of this application, the sliding direction of the sliding window is rightward An example is used for description, and details will not be described later.
应注意,在数据管理装置执行完S830后,继续执行上述S810,在滑动窗内的第二数据的指纹特征满足第一预设条件时,将滑动窗的结束位置确定为划分点,第二数据为目标数据中的部分数据;也就是说,数据管理装置执行完S830后,继续执行上述 S810,直至滑动窗内的数据的指纹特征满足第一预设条件时,将滑动窗的结束位置确定为划分点。It should be noted that after the data management device executes S830, it continues to execute the above S810. When the fingerprint feature of the second data in the sliding window satisfies the first preset condition, the end position of the sliding window is determined as the division point, and the second data is part of the data in the target data; that is, after the data management device executes S830, it continues to execute the above S810 until the fingerprint feature of the data in the sliding window satisfies the first preset condition, and determines the end position of the sliding window as dividing point.
示例性的,基于S810的示例,假设预设长度为一个字母的大小,参考图7中的(A)图,在滑动窗内的数据(即BC)的哈希值不满足第一预设条件时,将滑动窗向右滑动一个字母,当前,滑动窗内的数据为“CD”,如图7中的(B)图所示;此时,判断滑动窗内的数据“CD”的哈希值是否满足第一预设条件,假设满足,则将“C”与“D”交界位置确定为划分点。Exemplarily, based on the example of S810, assuming that the preset length is the size of a letter, referring to (A) in FIG. 7 , the hash value of the data (that is, BC) in the sliding window does not meet the first preset condition , slide the sliding window to the right by one letter, currently, the data in the sliding window is "CD", as shown in (B) in Figure 7; at this time, determine the hash of the data "CD" in the sliding window Whether the value satisfies the first preset condition, and if so, determine the boundary position between "C" and "D" as the dividing point.
S620、数据管理装置根据M-1个划分点将目标数据划分为M个候选数据块。S620. The data management device divides the target data into M candidate data blocks according to the M-1 division points.
仍以目标数据为“ABCDEF......XYZ”为例,假设按照上述S610,确定出目标数据的4个划分点,根据这4个划分点将目标数据划分为5个区间,则的5个区间对应5个候选数据块,该5个候选数据块分别为{ABCDEF}、{GHIJK}、{LMNOP}、{QRSTU}以及{VWXYZ}。Still taking the target data as "ABCDEF...XYZ" as an example, assuming that according to the above S610, 4 division points of the target data are determined, and the target data is divided into 5 intervals according to these 4 division points, then the The 5 intervals correspond to 5 candidate data blocks, and the 5 candidate data blocks are respectively {ABCDEF}, {GHIJK}, {LMNOP}, {QRSTU} and {VWXYZ}.
需要说明的是,对目标数据进行更新(如:在目标数据中插入数据)后,该目标数据的内容发生了变化,但更新后的目标数据中满足第一预设条件的位置可能并未发生变化,进而对更新后的目标数据进行候选块划分后,候选块的数量并不会发生变化,且大部分候选块的内容也可能不变。It should be noted that after the target data is updated (such as inserting data into the target data), the content of the target data has changed, but the position in the updated target data that satisfies the first preset condition may not occur Changes, and then after the updated target data is divided into candidate blocks, the number of candidate blocks will not change, and the content of most candidate blocks may also remain unchanged.
示例性的,基于S620的示例,假设第一阈值为20,第二阈值为9;当在如图7中的(A)图或(B)图所示的目标数据中插入数据“AALMXXWX”时,假设插入位置为上述候选数据块{ABCDEF}中的“C”和“D”之间,那么对于新的目标数据,执行上述S610-S620,由于插入数据后的目标数据中满足第一预设条件的位置并未发生变化,所以基于上述S810-S830的技术方案确定划分点的个数以及位置不变,根据该4个划分点将目标数据划分为5个候选数据块,该5个候选数据块分别为{ABCAALMXXWXDEF}、{GHIJK}、{LMNOP}、{QRSTU}以及{VWXYZ}。Exemplarily, based on the example of S620, it is assumed that the first threshold is 20 and the second threshold is 9; when the data "AALMXXWX" is inserted into the target data as shown in (A) or (B) , assuming that the insertion position is between "C" and "D" in the above candidate data block {ABCDEF}, then for the new target data, execute the above S610-S620, because the target data after inserting the data satisfies the first preset The position of the condition has not changed, so the number and position of the division points are determined based on the above-mentioned technical solution of S810-S830, and the target data is divided into 5 candidate data blocks according to the 4 division points, and the 5 candidate data blocks The blocks are {ABCAALMXXWXDEF}, {GHIJK}, {LMNOP}, {QRSTU}, and {VWXYZ}.
可选的,结合图5,如图9所示,在另一种实现方式中,上述基于目标数据的内容,将目标数据划分为M个候选数据块(即S510)的方法可以包括:S910-S920。Optionally, referring to FIG. 5, as shown in FIG. 9, in another implementation manner, the above-mentioned method of dividing the target data into M candidate data blocks (that is, S510) based on the content of the target data may include: S910- S920.
S910、数据管理装置根据目标数据的变换值,确定目标数据的M-1个划分点。S910. The data management device determines M-1 division points of the target data according to the conversion value of the target data.
上述变换值是基于预设规则将预设窗内的各个数据转换为数字形式的值。The above-mentioned conversion value is based on a preset rule to convert each data in the preset window into a value in digital form.
上述各个数据的变换值是采用一种变换方法对目标数据中的各个数据进行变换后得到的值,本申请实施例中,变换值可以是将数据转换为数字后得到的值,如当目标数据为字母组成的文件时,一个字母的转换值可以是该字母对应的阿斯克码(american standard code for information interchange,ASCII),也可以是该字母对应的哈希值或其他以数字形式表征的值。如图10中的(A)图所示,是将目标数据的内容按照预设规则转换为数字而得到的目标数据的变换值文本。The transformation value of each data above is the value obtained after transforming each data in the target data by using a transformation method. When the file is composed of letters, the conversion value of a letter can be the ASCII (American standard code for information interchange, ASCII) corresponding to the letter, or the hash value corresponding to the letter or other values represented in digital form . As shown in (A) in FIG. 10 , it is the conversion value text of the target data obtained by converting the content of the target data into numbers according to preset rules.
可以理解的是,确定目标数据的M-1个划分点的过程中,确定每一个划分点的方法是相同的,在一种实现方式中采用预设窗的方式确定目标数据的每一个划分点的方法具体如图11所示,包括S1110-S1130。It can be understood that in the process of determining the M-1 division points of the target data, the method of determining each division point is the same, and in one implementation, each division point of the target data is determined by using a preset window The method is specifically shown in Figure 11, including S1110-S1130.
S1110、数据管理装置判断预设窗中的第二定长窗内的数据的变换值是否满足第二预设条件。S1110. The data management device judges whether the conversion value of the data in the second fixed-length window in the preset window satisfies a second preset condition.
需要说明的是,上述预设窗包括依次相邻的第一定长窗、可变长窗以及第二定长 窗;其中,第一定长窗的长度为大于0的整数;可变长窗的初始长度为预设值,该预设值为大于或等于0的整数;该第二定长窗的长度为目标数据中的一个数据的长度或者目标数据中的一个数据的变换值的长度,也就是说,当可变长窗的长度大于0时,第二定长窗的长度是可变长窗和第一定长窗对应的数据的最小单位,例如:当第一定长窗的长度为4个字节,可变长窗的长度为2个字节时,第二定长窗的长度为1个字节。It should be noted that the above-mentioned preset window includes the first fixed-length window, the variable-length window and the second fixed-length window adjacent in sequence; wherein, the length of the first fixed-length window is an integer greater than 0; the variable-length window The initial length is a preset value, and the preset value is an integer greater than or equal to 0; the length of the second fixed-length window is the length of a data in the target data or the length of a transformed value of a data in the target data, That is to say, when the length of the variable-length window is greater than 0, the length of the second fixed-length window is the smallest unit of data corresponding to the variable-length window and the first fixed-length window, for example: when the length of the first fixed-length window is 4 bytes, and when the length of the variable-length window is 2 bytes, the length of the second fixed-length window is 1 byte.
上述第二预设条件为第二定长窗内的数据的变换值大于第一定长窗内的各个数据的变换值的最大值,并且大于可变长窗内的各个数据的变换值的最大值。The above-mentioned second preset condition is that the conversion value of the data in the second fixed-length window is greater than the maximum value of the conversion value of each data in the first fixed-length window, and is greater than the maximum value of the conversion value of each data in the variable-length window. value.
具体的,数据管理装置确定第一定长窗内的各个变换值的最大值,将该最大值称为第一最大值;数据管理装置再确定出可变长窗内的各个变换值的最大值,即:第二最大值;然后,数据管理装置判断第二定长窗中的变换值是否大于第一最大值,并且也大于第二最大值。Specifically, the data management device determines the maximum value of each transformation value in the first fixed-length window, and the maximum value is called the first maximum value; the data management device then determines the maximum value of each transformation value in the variable-length window , that is: the second maximum value; then, the data management device judges whether the transformation value in the second fixed-length window is greater than the first maximum value and also greater than the second maximum value.
S1120、在第二定长窗包括的数据的变换值满足第二预设条件的情况下,数据管理装置将第二定长窗的结束位置确定为划分点。S1120. In a case where the transformation value of the data included in the second fixed-length window satisfies a second preset condition, the data management device determines an end position of the second fixed-length window as a division point.
S1130、在第二定长窗内的数据的变换值不满足第二预设条件的情况下,数据管理装置增加预设窗中的可变长窗的长度。S1130. In a case where the conversion value of the data in the second fixed-length window does not satisfy the second preset condition, the data management device increases the length of the variable-length window in the preset window.
上述数据管理装置可以按照预先配置的长度增加预设窗中的可变长窗的长度,该预先配置的长度可以根据实际情况确定,例如配置为2个字节。The above-mentioned data management device may increase the length of the variable-length window in the preset window according to a pre-configured length, and the pre-configured length may be determined according to actual conditions, for example, configured as 2 bytes.
可以理解的是,在可变长窗的长度增加预设长度后,由于第二定长窗与可变长度窗是相邻的,因此,第二定长窗会后移预设长度。It can be understood that after the length of the variable-length window is increased by a preset length, since the second fixed-length window is adjacent to the variable-length window, the second fixed-length window will move backward by a preset length.
应注意,数据管理装置在执行完S1130后判断当前第二定长窗内的数据的变换值是否满足第二预设条件,若满足,则将当前第二定长窗的结束位置确定为划分点;若不满足,则继续增加预设窗中的可变长窗的长度,直至第二定长窗内的数据的变换值满足第二预设条件时,将第二定长窗的结束位置确定为划分点。It should be noted that after the execution of S1130, the data management device judges whether the transformation value of the data in the current second fixed-length window satisfies the second preset condition, and if so, determines the end position of the current second fixed-length window as the dividing point ; If not satisfied, then continue to increase the length of the variable-length window in the preset window until the conversion value of the data in the second fixed-length window meets the second preset condition, and determine the end position of the second fixed-length window as the dividing point.
示例性的,如图10中的(A)图所示,第一定长窗内包括12和18,可变长窗内包括2,第二定长窗内包括6;其中,第二定长窗内的变换值6小于第一定长窗内的最大值18,因此,第二定长窗内的数据的变换值不满足第二预设条件;然后,数据管理装置将可变长窗的长度增加1个单位长度使其可变长窗的长度为2,如图10中的(B)图所示,此时,第一定长窗内包括12和18,可变长窗内包括2和6,第二定长窗内包括45,此时,第二定长窗内的变换值45大于第一定长窗内的最大值18的同时,也大于可变长窗内的最大值6,因此,数据管理装置将第二定长窗的结束位置确定为划分点。Exemplarily, as shown in Figure 10 (A), 12 and 18 are included in the first fixed-length window, 2 is included in the variable-length window, and 6 is included in the second fixed-length window; wherein, the second fixed-length The conversion value 6 in the window is less than the maximum value 18 in the first fixed-length window, therefore, the conversion value of the data in the second fixed-length window does not meet the second preset condition; The length increases by 1 unit length so that the length of the variable length window is 2, as shown in (B) figure in Figure 10, at this time, 12 and 18 are included in the first fixed length window, and 2 is included in the variable length window. and 6, including 45 in the second fixed-length window, at this time, while the transformation value 45 in the second fixed-length window is greater than the maximum value 18 in the first fixed-length window, it is also greater than the maximum value 6 in the variable-length window , therefore, the data management device determines the end position of the second fixed-length window as the division point.
需要说明的是,在确定出候选数据块的一个划分点之后,从该划分点之后的变换值开始,继续使用上述预设窗判断预设窗内的变换值是否满足第二预设条件以确定下一个划分点,应注意,用于确定下一个划分点的预设窗内的可变长窗的长度为初始长度。应理解,确定每一个划分点的方法是类似的。例如,参考图10中的(C),在确候选数据块的一个划分点(例如转换值45的位置)后,用于确定下一个划分点的预设窗的起始位置为转换值5,该预设窗内的第一定长窗内的变换值包括5和9,可变长窗内的变换值包括36,第二定长窗内的变换值包括5。It should be noted that after determining a division point of the candidate data block, starting from the transformation value after the division point, continue to use the above-mentioned preset window to judge whether the transformation value in the preset window satisfies the second preset condition to determine For the next division point, it should be noted that the length of the variable-length window in the preset window used to determine the next division point is the initial length. It should be understood that the method of determining each division point is similar. For example, with reference to (C) in FIG. 10, after determining a division point (for example, the position of conversion value 45) of the candidate data block, the initial position of the preset window for determining the next division point is conversion value 5, The transformation values in the first fixed-length window in the preset window include 5 and 9, the transformation values in the variable-length window include 36, and the transformation values in the second fixed-length window include 5.
S920、数据管理装置根据M-1个划分点将目标数据划分为M个候选数据块。S920. The data management device divides the target data into M candidate data blocks according to the M-1 division points.
示例性的,基于上述S1110-S1130可以将目标数据的转换值文本划分为4个候选数据块,例如分别为:{12,18,2,6,45}、{5,9,36,5,5,65}、{56,5,9,7,62}以及{8,8,432,9,81,20}。Exemplarily, based on the above S1110-S1130, the conversion value text of the target data can be divided into four candidate data blocks, for example, respectively: {12,18,2,6,45}, {5,9,36,5, 5,65}, {56,5,9,7,62}, and {8,8,432,9,81,20}.
需要说明的是,对目标数据进行更新(如:在目标数据中插入数据)后,该目标数据的内容发生了变化,但更新后的目标数据中满足第二预设条件的位置可能并未发生变化,进而对更新后的目标数据进行候选块划分后,候选块的数量并不会发生变化,且大部分候选块的内容也可能不变。It should be noted that after the target data is updated (such as: inserting data in the target data), the content of the target data has changed, but the position in the updated target data that satisfies the second preset condition may not occur Changes, and then after the updated target data is divided into candidate blocks, the number of candidate blocks will not change, and the content of most candidate blocks may also remain unchanged.
示例性的,向如图10中的(A)图所示的目标文本中插入数据,该插入数据的转换值集合为{9,10,12,1,-40},假设该插入数据的位置为候选数据块{12,18,2,6,45}中18和2的中间,由于插入数据后的目标数据中满足第二预设条件的位置并未发生变化,所以基于上述S1110-S1130的技术方案确定划分点的个数以及位置不变,根据该3个划分点将目标数据划分为4个候选数据块,该4个候选数据块分别为{12,18,9,10,12,1,-40,2,6,45}、{5,9,36,5,5,65}、{56,5,9,7,62}以及{8,8,432,9,81,20}。Exemplarily, insert data into the target text as shown in (A) in Figure 10, the conversion value set of the inserted data is {9,10,12,1,-40}, assuming the position of the inserted data is the middle of 18 and 2 in the candidate data block {12, 18, 2, 6, 45}, since the position in the target data after inserting the data that satisfies the second preset condition has not changed, so based on the above S1110-S1130 The technical solution determines the number and position of the division points, and divides the target data into 4 candidate data blocks according to the 3 division points, and the 4 candidate data blocks are respectively {12, 18, 9, 10, 12, 1 ,-40,2,6,45}, {5,9,36,5,5,65}, {56,5,9,7,62}, and {8,8,432,9,81,20}.
可选的,结合图6或图9,如图12所示,上述根据M个候选数据块各自的指纹特征,将M个候选数据块划分为N个目标数据块(即:S520),包括:S1210-S1220。Optionally, in combination with FIG. 6 or FIG. 9, as shown in FIG. 12, the M candidate data blocks are divided into N target data blocks according to the respective fingerprint features of the M candidate data blocks (ie: S520), including: S1210-S1220.
S1210、数据管理装置将M个候选数据块中指纹特征满足第三预设条件的候选数据块的结束位置确定为候选数据块的N-1个划分点。S1210. The data management device determines, among the M candidate data blocks, the end positions of the candidate data blocks whose fingerprint features meet the third preset condition as N−1 division points of the candidate data blocks.
上述候选数据块的指纹特征可以是该候选数据块中的所有数据对应的一个指纹特征,例如,候选数据块中的数据为“WANHH”,则该候选数据块的哈希值为“WANHH”这个整体的哈希值,也可以是该候选数据块中的部分数据对应的指纹特征;例如,候选数据块中的数据为“WANHH”,则该候选数据块的哈希值为“NHH”对应的哈希值,本申请实施例中,候选数据块的指纹特征均以候选数据块中的所有数据对应的的指纹特征为例进行说明,后续不再赘述。The fingerprint feature of the above candidate data block can be a fingerprint feature corresponding to all the data in the candidate data block. For example, if the data in the candidate data block is "WANHH", then the hash value of the candidate data block is "WANHH". The overall hash value can also be the fingerprint feature corresponding to some data in the candidate data block; for example, if the data in the candidate data block is "WANHH", then the hash value of the candidate data block corresponds to "NHH". For the hash value, in the embodiment of the present application, the fingerprint features of the candidate data blocks are described by taking the fingerprint features corresponding to all the data in the candidate data blocks as an example, and will not be described in detail later.
上述第三预设条件为候选数据块的数据的指纹特征与第三阈值的取模值在预设范围之内。The above-mentioned third preset condition is that the modulo value between the fingerprint feature of the data of the candidate data block and the third threshold is within a preset range.
S1220、数据管理装置根据N-1个划分点,将M个候选数据块划分为N个目标数据块。S1220. The data management device divides the M candidate data blocks into N target data blocks according to the N-1 division points.
示例性的,假设第三阈值为60,上述预设范围为50至60,目标数据被划分为5个候选数据块,该5个候选数据块分别为{ABCAALMXXWXDEF}、{GHIJK}、{LMNOP}、{QRSTU}以及{VWXYZ},计算得到该5个候选数据块的哈希值分别为:-1130721247、67787465、72558990、77330515以及82102040;然后,计算该5个候选数据块的哈希值与第三阈值的取模值分别为:-7、5、30、55以及20,可见,候选数据块{LMNOP}的哈希值满足第三预设条件,因此将该候选数据块{LMNOP}作为上述5个候选数据块的划分点,可以将候选数据块划分为2个目标数据块,分别为{ABCAALMXXWXDEFGHIJKLMNOP}、{QRSTU}。Exemplarily, assuming that the third threshold is 60, and the preset range is 50 to 60, the target data is divided into 5 candidate data blocks, and the 5 candidate data blocks are respectively {ABCAALMXXWXDEF}, {GHIJK}, {LMNOP} , {QRSTU}, and {VWXYZ}, the calculated hash values of the five candidate data blocks are: -1130721247, 67787465, 72558990, 77330515, and 82102040; then, calculate the hash values of the five candidate data blocks and the first The modulo values of the three thresholds are: -7, 5, 30, 55 and 20 respectively. It can be seen that the hash value of the candidate data block {LMNOP} satisfies the third preset condition, so the candidate data block {LMNOP} is used as the above The division points of the 5 candidate data blocks can divide the candidate data blocks into 2 target data blocks, namely {ABCAALMXXWXDEFGHIJKLMNOP} and {QRSTU}.
可选的,上述S520中划分目标数据块的方法可以采用划分候选数据块的方法S610-S620和S910-S920类似,具体参考上述S610-S620和S910-S920的相关描述,此处不再赘述。Optionally, the method for dividing the target data block in the above S520 may be similar to the method for dividing candidate data blocks S610-S620 and S910-S920. For details, refer to the relevant descriptions of the above-mentioned S610-S620 and S910-S920, which will not be repeated here.
本申请实施例提供的存储系统中数据管理方法,数据管理装置通过将目标数据划 分为M个候选数据块,然后,数据管理装置将M个候选数据块中指纹特征满足第三预设条件的候选数据块的结束位置确定为候选数据块的N-1个划分点,进而确定出N个目标数据块。如此,对目标数据进行更新(如:在目标数据中插入数据)后,插入数据的指纹特征或转换值即使使候选数据块的数量发生了变化,但变化后的多个候选数据块满足第三预设条件的位置也不一定发生变化,进而目标数据块的数量也不一定变化,因此,根据该多个目标数据块构建索引树时,待构建的索引树相对于插入数据前的目标数据的索引树只是个别节点的哈希值发生了变化,进而节约数据管理所需消耗的资源。In the data management method in the storage system provided by the embodiment of the present application, the data management device divides the target data into M candidate data blocks, and then, the data management device divides the candidate data whose fingerprint characteristics satisfy the third preset condition in the M candidate data blocks The end positions of the data blocks are determined as N-1 division points of the candidate data blocks, and then N target data blocks are determined. In this way, after updating the target data (such as inserting data in the target data), even if the fingerprint feature or conversion value of the inserted data changes the number of candidate data blocks, the changed multiple candidate data blocks satisfy the third The position of the preset condition does not necessarily change, and the number of target data blocks does not necessarily change. Therefore, when constructing an index tree based on the multiple target data blocks, the index tree to be constructed is compared to the target data before inserting data. In the index tree, only the hash values of individual nodes have changed, thereby saving the resources consumed by data management.
可选的,结合图6或图9,如图13所示,上述S540具有包括:S1310-S1320。Optionally, referring to FIG. 6 or FIG. 9 , as shown in FIG. 13 , the above S540 includes: S1310-S1320.
S1310、数据管理装置根据N个目标数据块各自的指纹特征,将N个目标数据块划分为至少一个数据组。S1310. The data management device divides the N target data blocks into at least one data group according to the respective fingerprint features of the N target data blocks.
需要说明的是,上述目标数据块的指纹特征可以是目标数据块中所有数据的一个哈希值,也可以是,目标数据块中部分数据对应的哈希值,本申请实施例中,目标数据块的指纹特征均以目标数据块中所有数据对应的哈希值为例进行说明。It should be noted that the above-mentioned fingerprint feature of the target data block can be a hash value of all data in the target data block, or a hash value corresponding to some data in the target data block. In the embodiment of the present application, the target data The fingerprint features of a block are illustrated by taking the hash value corresponding to all data in the target data block as an example.
示例性的,目标数据块内的数据为{ABCAALMXXWXDEFGHIJKLMNOP},该目标数据块的指纹特征是“ABCAALMXXWXDEFGHIJKLMNOP”的哈希值。Exemplarily, the data in the target data block is {ABCAALMXXWXDEFGHIJKLMNOP}, and the fingerprint feature of the target data block is the hash value of "ABCAALMXXWXDEFGHIJKLMNOP".
可选的,结合图13,如图14所示,在一种实现方式中上述S1310具体包括:S1410-S1420。Optionally, referring to FIG. 13 , as shown in FIG. 14 , in an implementation manner, the foregoing S1310 specifically includes: S1410-S1420.
S1410、数据管理装置将N个目标数据块各自的指纹特征中满足第四预设条件的目标数据块,确定为N个目标数据块的至少一个划分点。S1410. The data management device determines a target data block satisfying a fourth preset condition among the fingerprint features of each of the N target data blocks as at least one division point of the N target data blocks.
上述第四预设条件为:The fourth preset condition above is:
目标数据块的指纹特征与第四阈值的取模值大于或等于第五阈值。A modulo value between the fingerprint feature of the target data block and the fourth threshold is greater than or equal to the fifth threshold.
或,数据组中目标数据块的数量在第一数量阈值和第二数量阈值之间,并且目标数据块的指纹特征与第四阈值的取模值大于或等于第五阈值,其中,第一数量阈值大于第二数量阈值。Or, the number of target data blocks in the data group is between the first number threshold and the second number threshold, and the modulo value of the fingerprint feature of the target data block and the fourth threshold is greater than or equal to the fifth threshold, wherein the first number The threshold is greater than the second number threshold.
需要说明的是,上述最大阈值、最小阈值、第四阈值以及第五阈值是预先配置。It should be noted that the foregoing maximum threshold, minimum threshold, fourth threshold, and fifth threshold are pre-configured.
S1420、数据管理装置根据目标数据块的至少一个划分点,将N个目标数据块划分多个数据组。S1420. The data management device divides the N target data blocks into multiple data groups according to at least one division point of the target data blocks.
示例性的,假设目标数据被划分为4个目标数据块,分别为:{ALMXXWXDEFGHIJK}、{ABRFGRTGRTRGE}、{DEWFRTNEBJ}以及{JDIEOFJDEJFOEW};第四阈值为80,第五阈值为70;通过计算得到上述4个目标数据块的哈希值分别与第四阈值80的取模值分别为25、75、33以及55,由此可见,目标数据块{ABRFGRTGRTRGE}满足第四预设条件,如此,将目标数据块分为2组分别为{ALMXXWXDEFGHIJK}和{ABRFGRTGRTRGE}为一组,{DEWFRTNEBJ}和{JDIEOFJDEJFOEW}为一组。Exemplarily, it is assumed that the target data is divided into 4 target data blocks, namely: {ALMXXWXDEFGHIJK}, {ABRFGRTGRTRGE}, {DEWFRTNEBJ} and {JDIEOFJDEJFOEW}; the fourth threshold is 80, and the fifth threshold is 70; obtained by calculation The modulo values of the hash values of the above four target data blocks and the fourth threshold 80 are respectively 25, 75, 33 and 55. It can be seen that the target data block {ABRFGRTGRTRGE} satisfies the fourth preset condition. The target data blocks are divided into two groups: {ALMXXWXDEFGHIJK} and {ABRFGRTGRTRGE}, and {DEWFRTNEBJ} and {JDIEOFJDEJFOEW}.
可选的,上述S1310的还可以采用上述划分候选块的方法S610-S620和/或S910-S920,具体参考S610-S620和/或S910-S920的相关描述,此处不再赘述。Optionally, the above S1310 may also use the above methods S610-S620 and/or S910-S920 for dividing candidate blocks, for details, refer to the relevant descriptions of S610-S620 and/or S910-S920, which will not be repeated here.
S1320、数据管理装置基于多个数据组各自的指纹特征,生成目标数据的索引树。S1320. The data management device generates an index tree of the target data based on the respective fingerprint features of the multiple data groups.
本申请实施例提供的存储系统中数据管理方法,数据管理装置在划分目标数据块 后,数据管理装置将N个目标数据块各自的指纹特征中满足第四预设条件的目标数据块,确定为目标数据块的至少一个划分点,根据目标数据块的至少一个划分点,将N个目标数据块划分多个数据组。相较于现有技术,当更新目标数据后,即使目标数据的目标数据块的数量发生了变化,但当目标数据更新的数据(如插入的数据)不满足第四预设条件时,则对目标数据块进行分组的划分点不变,所以目标数据块的分组数量和更新目标数据前的目标数据块的分组数量完全一样,此时,构建更新后的目标数据的索引树时,仅需要将插入数据前的目标数据的索引树的部分叶节点、部分子节点哈希值以及根节点的哈希值进行修改即可,从而节约数据管理所需消耗的资源。In the data management method in the storage system provided by the embodiment of the present application, after the data management device divides the target data block, the data management device determines the target data block that satisfies the fourth preset condition among the respective fingerprint features of the N target data blocks as The at least one division point of the target data block is used to divide the N target data blocks into multiple data groups according to the at least one division point of the target data block. Compared with the prior art, when the target data is updated, even if the number of target data blocks of the target data changes, but when the updated data of the target data (such as inserted data) does not meet the fourth preset condition, then the The division points for grouping the target data block remain unchanged, so the number of groups of the target data block is exactly the same as the number of groups of the target data block before updating the target data. At this time, when constructing the index tree of the updated target data, only the Part of the leaf nodes, part of the hash values of child nodes and the hash value of the root node of the index tree of the target data before the data is inserted can be modified, thereby saving resources required for data management.
相应地,本申请实施例提供一种存储系统中数据管理装置,该存储系统中数据管理装置用于执行上述指纹验证方法中各个的步骤,本申请实施例可以根据上述方法示例对该存储系统中数据管理装置进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。Correspondingly, the embodiment of the present application provides a data management device in the storage system, the data management device in the storage system is used to execute each step in the above-mentioned fingerprint verification method, the embodiment of the present application can use the example of the above-mentioned method for the storage system The data management device divides the functional modules. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. The division of modules in the embodiment of the present application is schematic, and is only a logical function division, and there may be other division methods in actual implementation.
在采用对应各个功能划分各个功能模块的情况下,图15示出上述实施例中所涉及的存储系统中数据管理装置的一种可能的结构示意图。如图15所示,该存储系统中数据管理装置包括:处理模块1501、存储模块1502和生成模块1503。In the case of dividing each functional module corresponding to each function, FIG. 15 shows a possible structural diagram of the data management device in the storage system involved in the above embodiment. As shown in FIG. 15 , the data management device in the storage system includes: a processing module 1501 , a storage module 1502 and a generating module 1503 .
处理模块1501用于基于目标数据的内容,将目标数据划分为M个候选数据块,例如执行上述方法实施例中的步骤S510。The processing module 1501 is configured to divide the target data into M candidate data blocks based on the content of the target data, for example, execute step S510 in the above method embodiment.
处理模块1501还用于根据M个候选数据块各自的指纹特征,将M个候选数据块划分为N个目标数据块,例如执行上述方法实施例中的步骤S520。The processing module 1501 is further configured to divide the M candidate data blocks into N target data blocks according to their respective fingerprint features, for example, execute step S520 in the above method embodiment.
存储模块1502用于存储N个目标数据块和N个目标数据块的指纹特征,例如执行上述方法实施例中的步骤S530。The storage module 1502 is configured to store N target data blocks and fingerprint features of the N target data blocks, for example, execute step S530 in the above method embodiment.
生成模块1503用于根据N个目标数据块,生成目标数据的索引树,该索引树用于对目标数据的内容进行寻址,例如执行上述方法实施例中的步骤S540。The generating module 1503 is configured to generate an index tree of the target data according to the N target data blocks, and the index tree is used to address the content of the target data, for example, execute step S540 in the above method embodiment.
可选地,本申请实施例提供的存储系统中数据管理装置还包括确定模块1504;Optionally, the data management device in the storage system provided in the embodiment of the present application further includes a determination module 1504;
确定模块1504用于根据目标数据的指纹特征,确定目标数据的M-1个划分点,例如执行上述方法实施例中的步骤S610。The determining module 1504 is configured to determine M-1 division points of the target data according to the fingerprint feature of the target data, for example, execute step S610 in the above method embodiment.
上述处理模块1501具体用于根据M-1个划分点将目标数据划分为M个候选数据块,例如执行上述方法实施例中的步骤S620。The processing module 1501 is specifically configured to divide the target data into M candidate data blocks according to the M-1 division points, for example, execute step S620 in the above method embodiment.
可选地,本申请实施例提供的存储系统中数据管理装置还包括滑动模块1505。上述确定模块1504用于在滑动窗内的第一数据的指纹特征满足第一预设条件的情况下,将滑动窗的结束位置确定为划分点,例如执行上述方法实施例中的步骤S820。Optionally, the data management device in the storage system provided in the embodiment of the present application further includes a sliding module 1505 . The determining module 1504 is configured to determine the end position of the sliding window as the division point when the fingerprint feature of the first data in the sliding window satisfies the first preset condition, for example, execute step S820 in the above method embodiment.
上述滑动模块1505用于在滑动窗内的第一数据的指纹特征不满足第一预设条件的情况下,将滑动窗沿预设方向滑动预设长度,例如执行上述方法实施例中的步骤S830。The sliding module 1505 is configured to slide the sliding window along a preset direction for a preset length when the fingerprint feature of the first data in the sliding window does not satisfy the first preset condition, for example, execute step S830 in the above method embodiment.
可选地,上述确定模块1504还用于根据目标数据的变换值,确定目标数据的M-1个划分点,例如执行上述方法实施例中的步骤S910。Optionally, the determination module 1504 is further configured to determine M-1 division points of the target data according to the transformation value of the target data, for example, execute step S910 in the above method embodiment.
上述处理模块1501具体用于根据M-1个划分点将目标数据划分为M个候选数据块,例如执行上述方法实施例中的步骤S920。The processing module 1501 is specifically configured to divide the target data into M candidate data blocks according to the M-1 division points, for example, execute step S920 in the above method embodiment.
可选地,上述确定模块1504还用于在第二定长窗包括的数据的变换值满足第二预设条件的情况下,将第二定长窗的结束位置确定为划分点,例如执行上述方法实施例中的步骤S1120。Optionally, the determination module 1504 is further configured to determine the end position of the second fixed-length window as the dividing point when the transformation value of the data included in the second fixed-length window satisfies the second preset condition, for example, execute the above-mentioned Step S1120 in the method embodiment.
上述处理模块1501还用于在第二定长窗内的数据的变换值不满足第二预设条件的情况下,增加预设窗中的可变长窗的长度,在第二定长窗内的数据的变换值满足所述第二预设条件时,将第二定长窗的结束位置确定为划分点,例如执行上述方法实施例中的步骤S1130。The above-mentioned processing module 1501 is also used to increase the length of the variable-length window in the preset window when the conversion value of the data in the second fixed-length window does not meet the second preset condition, and within the second fixed-length window When the transformation value of the data of the above-mentioned data satisfies the second preset condition, the end position of the second fixed-length window is determined as the division point, for example, step S1130 in the above method embodiment is executed.
可选的,上述确定模块1504用于将M个候选数据块中指纹特征满足第三预设条件的候选数据块的结束位置确定为候选数据块的N-1个划分点,例如执行上述方法实施例中的步骤S1210。Optionally, the determination module 1504 is configured to determine the end positions of the candidate data blocks whose fingerprint features meet the third preset condition among the M candidate data blocks as the N-1 division points of the candidate data blocks, for example, execute the above method to implement Step S1210 in the example.
上述处理模块1501根据N-1个划分点,将M个候选数据块划分为N个目标数据块,例如执行上述方法实施例中的步骤S1220。The processing module 1501 divides the M candidate data blocks into N target data blocks according to the N-1 division points, for example, executes step S1220 in the above method embodiment.
可选的,上述处理模块1501还用于根据N个目标数据块各自的指纹特征,将N个目标数据块划分为至少一个数据组,例如执行上述方法实施例中的步骤S1310。Optionally, the above processing module 1501 is further configured to divide the N target data blocks into at least one data group according to their respective fingerprint features, for example, perform step S1310 in the above method embodiment.
上述生成模块1503还用于基于多个数据组各自的指纹特征,生成目标数据的索引树,例如执行上述方法实施例中的步骤S1320。The generating module 1503 is further configured to generate an index tree of target data based on the respective fingerprint features of multiple data groups, for example, execute step S1320 in the above method embodiment.
可选的,上述确定模块1504用于将N个目标数据块各自的指纹特征中满足第四预设条件的目标数据块,确定为N个目标数据块的至少一个划分点,例如执行上述方法实施例中的步骤S1410。Optionally, the determination module 1504 is configured to determine the target data block that satisfies the fourth preset condition among the fingerprint features of each of the N target data blocks as at least one division point of the N target data blocks, for example, execute the above method to implement Step S1410 in the example.
上述处理模块1501用于根据目标数据块的至少一个划分点,将N个目标数据块划分多个数据组,例如执行上述方法实施例中的步骤S1420。The processing module 1501 is configured to divide the N target data blocks into multiple data groups according to at least one division point of the target data blocks, for example, execute step S1420 in the above method embodiment.
上述储系统中数据管理装置的各个模块还可以用于执行上述方法实施例中的其他动作,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。Each module of the data management device in the above-mentioned storage system can also be used to perform other actions in the above-mentioned method embodiment. All relevant content of each step involved in the above-mentioned method embodiment can be referred to the function description of the corresponding functional module, which is not described here. Let me repeat.
在采用集成的单元的情况下,本申请实施例提供的储系统中数据管理装置的结构示意图如图16所示。在图16中,电子设备包括:处理模块1601和通信模块1602。处理模块1601用于对储系统中数据管理装置的动作进行控制管理,例如,执行上述处理模块1501、生成模块1503、确定模块1504以及滑动模块1505执行的步骤,和/或用于执行本文所描述的技术的其它过程。通信模块1602用于支持储系统中数据管理装置与其他设备之间的交互等。如图16所示,储系统中数据管理装置还可以包括存储模块1603,存储模块1603用于存储储系统中数据管理装置的程序代码和用于根据目标数据块与目标数据块的指纹特征的对应关系等。In the case of using an integrated unit, a schematic structural diagram of a data management device in a storage system provided by an embodiment of the present application is shown in FIG. 16 . In FIG. 16 , the electronic device includes: a processing module 1601 and a communication module 1602 . The processing module 1601 is used to control and manage the actions of the data management device in the storage system, for example, to execute the steps performed by the processing module 1501, the generation module 1503, the determination module 1504, and the sliding module 1505, and/or to execute the steps described herein. other processes of the technology. The communication module 1602 is used to support the interaction between the data management device and other devices in the storage system. As shown in Figure 16, the data management device in the storage system may also include a storage module 1603, which is used to store the program code of the data management device in the storage system and to relationship etc.
其中,处理模块1601可以是处理器或控制器,例如图4中的控制器401或处理器402。通信模块1602可以是收发器、RF电路或通信接口等,例如图4中的总线406和/或通道控制器403。存储模块1603可以是存储器,例如图4中的硬盘405。Wherein, the processing module 1601 may be a processor or a controller, for example, the controller 401 or the processor 402 in FIG. 4 . The communication module 1602 may be a transceiver, an RF circuit, or a communication interface, etc., such as the bus 406 and/or the channel controller 403 in FIG. 4 . The storage module 1603 may be a memory, such as the hard disk 405 in FIG. 4 .
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件程序实现时,可以全部或部分地以计算机程序产品的形式实现。该计 算机程序产品包括一个或多个计算机指令。在计算机上加载和执行该计算机指令时,全部或部分地产生按照本申请实施例中的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,该计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))方式或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包括一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质(例如,软盘、磁盘、磁带)、光介质(例如,数字视频光盘(digital video disc,DVD))、或者半导体介质(例如固态硬盘(solid state drives,SSD))等。In the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computer, all or part of the processes or functions according to the embodiments of the present application will be generated. The computer can be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, e.g. (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to another website site, computer, server or data center. The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device including a server, a data center, and the like integrated with one or more available media. The available medium may be a magnetic medium (for example, a floppy disk, a magnetic disk, a magnetic tape), an optical medium (for example, a digital video disc (digital video disc, DVD)), or a semiconductor medium (for example, a solid state drive (solid state drives, SSD)), etc. .
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Through the description of the above embodiments, those skilled in the art can clearly understand that for the convenience and brevity of the description, only the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned functions can be allocated according to needs It is completed by different functional modules, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. For the specific working process of the above-described system, device, and unit, reference may be made to the corresponding process in the foregoing method embodiments, and details are not repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device and method can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be Incorporation may either be integrated into another system, or some features may be omitted, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:快闪存储器、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor execute all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage medium includes: flash memory, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk, and other various media capable of storing program codes.
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,本发明的保护范围应以所述权利要求的保护范围为准。The above description is only a specific implementation mode of the present invention, but the protection scope of the present invention is not limited thereto, and the protection scope of the present invention should be based on the protection scope of the claims.

Claims (21)

  1. 一种存储系统中数据管理方法,其特征在于,包括:A method for managing data in a storage system, comprising:
    基于目标数据的内容,将所述目标数据划分为M个候选数据块,M为大于或等于2的整数;Based on the content of the target data, dividing the target data into M candidate data blocks, where M is an integer greater than or equal to 2;
    根据所述M个候选数据块各自的指纹特征,将所述M个候选数据块划分为N个目标数据块,N为小于或等于M的正整数,每个所述目标数据块包括至少一个所述候选数据块;According to the respective fingerprint features of the M candidate data blocks, divide the M candidate data blocks into N target data blocks, where N is a positive integer less than or equal to M, and each of the target data blocks includes at least one of the target data blocks Describe the candidate data block;
    存储所述N个目标数据块和所述N个目标数据块的指纹特征,所述目标数据块与所述目标数据块的指纹特征具有一一对应的关系;storing the N target data blocks and the fingerprint features of the N target data blocks, the target data blocks having a one-to-one correspondence with the fingerprint features of the target data blocks;
    根据所述N个目标数据块,生成所述目标数据的索引树,所述索引树用于对所述目标数据的内容进行寻址。An index tree of the target data is generated according to the N target data blocks, and the index tree is used to address content of the target data.
  2. 根据权利要求1所述的方法,其特征在于,所述基于目标数据的内容,将所述目标数据划分为M个候选数据块,具体包括:The method according to claim 1, wherein the target data is divided into M candidate data blocks based on the content of the target data, specifically comprising:
    根据所述目标数据的指纹特征,确定所述目标数据的M-1个划分点;Determine M-1 division points of the target data according to the fingerprint feature of the target data;
    根据所述M-1个划分点将所述目标数据划分为M个候选数据块。Divide the target data into M candidate data blocks according to the M-1 division points.
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述目标数据的指纹特征,确定所述目标数据的M-1个划分点,包括:The method according to claim 2, wherein said determining M-1 division points of said target data according to the fingerprint feature of said target data comprises:
    对于所述M-1个划分点中的任一个划分点,在滑动窗内的第一数据的指纹特征满足第一预设条件的情况下,将所述滑动窗的结束位置确定为划分点,所述第一数据为所述目标数据中的部分数据,所述第一预设条件为所述滑动窗内的目标数据的指纹特征与第一阈值的取模值等于第二阈值。For any one of the M-1 division points, when the fingerprint feature of the first data in the sliding window satisfies the first preset condition, the end position of the sliding window is determined as the division point, The first data is part of the target data, and the first preset condition is that a modulo value between the fingerprint feature of the target data in the sliding window and the first threshold is equal to the second threshold.
  4. 根据权利要求2所述的方法,其特征在于,所述根据所述目标数据的指纹特征,确定所述目标数据的M-1个划分点,包括:The method according to claim 2, wherein said determining M-1 division points of said target data according to the fingerprint feature of said target data comprises:
    对于所述M-1个划分点中的任一个划分点,在滑动窗内的第一数据的指纹特征不满足第一预设条件的情况下,将所述滑动窗沿预设方向滑动预设长度,在所述滑动窗内的第二数据的指纹特征满足所述第一预设条件时,将所述滑动窗的结束位置确定为划分点,所述第二数据为所述目标数据中的部分数据,所述第二数据与所述第一数据不同;其中,所述预设长度小于或等于所述滑动窗的长度,所述第一预设条件为所述滑动窗内的目标数据的指纹特征与第一阈值的取模值等于第二阈值。For any one of the M-1 division points, if the fingerprint feature of the first data in the sliding window does not satisfy the first preset condition, slide the sliding window along a preset direction for a preset length , when the fingerprint feature of the second data in the sliding window satisfies the first preset condition, determining the end position of the sliding window as a division point, the second data being a part of the target data data, the second data is different from the first data; wherein, the preset length is less than or equal to the length of the sliding window, and the first preset condition is the fingerprint of the target data in the sliding window The modulo value of the feature and the first threshold is equal to the second threshold.
  5. 根据权利要求1所述的方法,其特征在于,所述基于目标数据的内容,将所述目标数据划分为M个候选数据块,具体包括:The method according to claim 1, wherein the target data is divided into M candidate data blocks based on the content of the target data, specifically comprising:
    根据所述目标数据的变换值,确定所述目标数据的M-1个划分点,所述变换值是基于预设规则将预设窗内的各个数据转换为数字形式的值;determining M-1 division points of the target data according to the conversion value of the target data, the conversion value is based on a preset rule to convert each data in a preset window into a value in digital form;
    根据所述M-1个划分点将所述目标数据划分为M个候选数据块。Divide the target data into M candidate data blocks according to the M-1 division points.
  6. 根据权利要求5所述的方法,其特征在于,根据所述目标数据的变换值,确定所述目标数据的M-1个划分点,包括:The method according to claim 5, wherein, according to the transformation value of the target data, determining M-1 division points of the target data comprises:
    所述预设窗包括依次相邻的第一定长窗、可变长窗以及第二定长窗;所述第二定长窗的长度为所述目标数据中的一个数据的长度或者所述目标数据中的一个数据的变换值的长度;The preset window includes a first fixed-length window, a variable-length window, and a second fixed-length window that are sequentially adjacent; the length of the second fixed-length window is the length of one piece of data in the target data or the length of the the length of the transformation value of one of the data in the target data;
    对于所述M-1个划分点中的任一个划分点,在所述第二定长窗包括的数据的变换值 满足所述第二预设条件的情况下,将所述第二定长窗的结束位置确定为划分点,所述第二预设条件为第二定长窗内的数据的变换值大于所述第一定长窗内的各个数据的变换值的最大值,并且大于所述可变长窗内的各个数据的变换值的最大值。For any one of the M-1 division points, if the conversion value of the data included in the second fixed-length window satisfies the second preset condition, the second fixed-length window The end position of is determined as the division point, and the second preset condition is that the transformation value of the data in the second fixed-length window is greater than the maximum value of the transformation values of each data in the first fixed-length window, and greater than the maximum value of the transformation value of the data in the first fixed-length window. The maximum value of the transformation value of each data in the variable length window.
  7. 根据权利要求5所述的方法,其特征在于,根据所述目标数据的变换值,确定所述目标数据的M-1个划分点,包括:The method according to claim 5, wherein, according to the transformation value of the target data, determining M-1 division points of the target data comprises:
    所述预设窗包括依次相邻的第一定长窗、可变长窗以及第二定长窗;所述第二定长窗的长度为所述目标数据中的一个数据的长度或者所述目标数据中的一个数据的变换值的长度;The preset window includes a first fixed-length window, a variable-length window, and a second fixed-length window that are sequentially adjacent; the length of the second fixed-length window is the length of one piece of data in the target data or the length of the the length of the transformation value of one of the data in the target data;
    对于任一个划分点,在所述第二定长窗内的数据的变换值不满足所述第二预设条件的情况下,增加所述预设窗中的可变长窗的长度,在所述第二定长窗内的数据的变换值满足所述第二预设条件时,将所述第二定长窗的结束位置确定为划分点,所述第二预设条件为第二定长窗内的数据的变换值大于所述第一定长窗内的各个数据的变换值的最大值,并且大于所述可变长窗内的各个数据的变换值的最大值。For any division point, when the conversion value of the data in the second fixed-length window does not satisfy the second preset condition, increase the length of the variable-length window in the preset window, and When the conversion value of the data in the second fixed-length window satisfies the second preset condition, the end position of the second fixed-length window is determined as the dividing point, and the second preset condition is the second fixed-length The transformation value of the data in the window is greater than the maximum value of the transformation value of each data in the first fixed-length window, and greater than the maximum value of the transformation value of each data in the variable-length window.
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述根据所述M个候选数据块各自的指纹特征,将所述M个候选数据块划分为N个目标数据块,具体包括:The method according to any one of claims 1-7, wherein the M candidate data blocks are divided into N target data blocks according to the respective fingerprint features of the M candidate data blocks, specifically include:
    将所述M个候选数据块中指纹特征满足第三预设条件的候选数据块的结束位置确定为所述候选数据块的N-1个划分点,其中,所述第三预设条件为所述候选数据块的数据的指纹特征与第三阈值的取模值在预设范围之内;Determining the end positions of the candidate data blocks whose fingerprint characteristics satisfy the third preset condition among the M candidate data blocks are determined as the N-1 division points of the candidate data blocks, wherein the third preset condition is The fingerprint feature of the data of the candidate data block and the modulo value of the third threshold are within a preset range;
    根据所述N-1个划分点,将所述M个候选数据块划分为所述N个目标数据块。Divide the M candidate data blocks into the N target data blocks according to the N-1 division points.
  9. 根据权利要求1-8任一项所述的方法,其特征在于,所述根据所述N个目标数据块,生成所述目标数据的索引树,具体包括:The method according to any one of claims 1-8, wherein the generating an index tree of the target data according to the N target data blocks specifically includes:
    根据所述N个目标数据块各自的指纹特征,将所述N个目标数据块划分为至少一个数据组;Divide the N target data blocks into at least one data group according to the respective fingerprint features of the N target data blocks;
    基于所述多个数据组各自的指纹特征,生成所述目标数据的索引树。An index tree of the target data is generated based on the respective fingerprint features of the plurality of data groups.
  10. 根据权利要求9所述的方法,其特征在于,所述根据所述N个目标数据块各自的指纹特征,将所述N个目标数据块划分为至少一个数据组,具体包括:The method according to claim 9, wherein said dividing said N target data blocks into at least one data group according to the respective fingerprint features of said N target data blocks, specifically comprises:
    将所述N个目标数据块各自的指纹特征中满足第四预设条件的目标数据块,确定为所述N个目标数据块的至少一个划分点;Determining a target data block that satisfies a fourth preset condition among the respective fingerprint features of the N target data blocks as at least one division point of the N target data blocks;
    根据所述目标数据块的至少一个划分点,将所述N个目标数据块划分所述多个数据组。Divide the N target data blocks into the plurality of data groups according to at least one division point of the target data blocks.
  11. 根据权利要求10所述的方法,其特征在于,The method according to claim 10, characterized in that,
    所述第四预设条件为:所述目标数据块的指纹特征与第四阈值的取模值大于或等于第五阈值。The fourth preset condition is: a modulo value between the fingerprint feature of the target data block and the fourth threshold is greater than or equal to the fifth threshold.
  12. 根据权利要求10所述的方法,其特征在于,The method according to claim 10, characterized in that,
    所述第四预设条件为:所述目标数据块的指纹特征与第四阈值的取模值大于或等于第五阈值,并且所述数据组中目标数据块的数量在所述第一数量阈值和第二数量阈值之间,所述第一数量阈值大于所述第二数量阈值。The fourth preset condition is: the modulo value between the fingerprint feature of the target data block and the fourth threshold is greater than or equal to the fifth threshold, and the number of target data blocks in the data group is within the first number threshold and a second number threshold, the first number threshold being greater than the second number threshold.
  13. 根据权利要求1-12任一项所述的方法,其特征在于The method according to any one of claims 1-12, characterized in that
    所述指纹特征为哈希值。The fingerprint feature is a hash value.
  14. 一种存储系统中数据管理装置,其特征在于,包括:处理模块、存储模块和生成模块;A data management device in a storage system, characterized by comprising: a processing module, a storage module and a generating module;
    所述处理模块,用于基于目标数据的内容,将所述目标数据划分为M个候选数据块,M为大于或等于2的整数;The processing module is configured to divide the target data into M candidate data blocks based on the content of the target data, where M is an integer greater than or equal to 2;
    所述处理模块,还用于根据所述M个候选数据块各自的指纹特征,将所述M个候选数据块划分为N个目标数据块,N为小于或等于M的正整数,每个所述目标数据块包括至少一个所述候选数据块;The processing module is further configured to divide the M candidate data blocks into N target data blocks according to the respective fingerprint features of the M candidate data blocks, where N is a positive integer less than or equal to M, and each The target data block includes at least one of the candidate data blocks;
    所述存储模块,用于存储所述N个目标数据块和所述N个目标数据块的指纹特征,所述目标数据块与所述目标数据块的指纹特征具有一一对应的关系;The storage module is configured to store the N target data blocks and the fingerprint features of the N target data blocks, and the target data blocks have a one-to-one correspondence with the fingerprint features of the target data blocks;
    所述生成模块,用于根据所述N个目标数据块,生成所述目标数据的索引树,所述索引树用于对所述目标数据的内容进行寻址。The generating module is configured to generate an index tree of the target data according to the N target data blocks, and the index tree is used to address the content of the target data.
  15. 根据权利要求14所述的存储系统中数据管理装置,其特征在于,还包括:确定模块;The data management device in the storage system according to claim 14, further comprising: a determining module;
    所述确定模块,用于根据所述目标数据的指纹特征,确定所述目标数据的M-1个划分点;The determination module is configured to determine M-1 division points of the target data according to the fingerprint characteristics of the target data;
    所述处理模块,具体用于根据所述M-1个划分点将所述目标数据划分为M个候选数据块。The processing module is specifically configured to divide the target data into M candidate data blocks according to the M-1 division points.
  16. 根据权利要求14所述的存储系统中数据管理装置,其特征在于,还包括:确定模块;The data management device in the storage system according to claim 14, further comprising: a determining module;
    所述确定模块,用于根据所述目标数据的变换值,确定所述目标数据的M-1个划分点,所述变换值是基于预设规则将预设窗内的各个数据转换为数字形式的值;The determination module is configured to determine M-1 division points of the target data according to the conversion value of the target data, and the conversion value is to convert each data in the preset window into a digital form based on a preset rule value;
    所述处理模块,用于根据所述M-1个划分点将所述目标数据划分为M个候选数据块。The processing module is configured to divide the target data into M candidate data blocks according to the M-1 division points.
  17. 根据权利要求14-16任一项所述的存储系统中数据管理装置,其特征在于,The data management device in the storage system according to any one of claims 14-16, characterized in that,
    所述确定模块,用于将所述M个候选数据块中指纹特征满足第三预设条件的候选数据块的结束位置确定为所述候选数据块的N-1个划分点,其中,所述第三预设条件为所述候选数据块的数据的指纹特征与第三阈值的取模值在预设范围之内;The determining module is configured to determine the end positions of the candidate data blocks whose fingerprint features meet the third preset condition among the M candidate data blocks as the N-1 division points of the candidate data blocks, wherein the The third preset condition is that the modulo value between the fingerprint feature of the data of the candidate data block and the third threshold is within a preset range;
    所述处理模块,用于根据所述N-1个划分点,将所述M个候选数据块划分为所述N个目标数据块。The processing module is configured to divide the M candidate data blocks into the N target data blocks according to the N-1 division points.
  18. 根据权利要求14-16任一项所述的存储系统中数据管理装置,其特征在于,The data management device in the storage system according to any one of claims 14-16, characterized in that,
    所述处理模块,用于根据所述N个目标数据块各自的指纹特征,将所述N个目标数据块划分为至少一个数据组;The processing module is configured to divide the N target data blocks into at least one data group according to the respective fingerprint features of the N target data blocks;
    所述生成模块,用于基于所述多个数据组各自的指纹特征,生成所述目标数据的索引树。The generating module is configured to generate an index tree of the target data based on the respective fingerprint features of the plurality of data groups.
  19. 一种存储系统中数据管理装置,其特征在于,包括存储器和处理器,所述存储器与所述处理器耦合;所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令;当所述计算机指令被所述处理器执行时,使得所述处理器执行如权利要求1至13中任一项所述的方法。A data management device in a storage system, characterized in that it includes a memory and a processor, the memory is coupled to the processor; the memory is used to store computer program codes, and the computer program codes include computer instructions; when the When the computer instructions are executed by the processor, the processor is caused to perform the method according to any one of claims 1 to 13.
  20. 一种计算机存储介质,其特征在于,包括计算机指令,当所述计算机指令在计算设备上运行时,使得所述计算设备执行如权利要求1至13中任一项所述的方法。A computer storage medium, characterized by comprising computer instructions, which, when the computer instructions are run on a computing device, cause the computing device to execute the method according to any one of claims 1 to 13.
  21. 一种计算机程序产品,其特征在于,包括计算机可读指令,当所述计算机可读指令在计算机设备上运行时,使得所述计算机设备执行如权利要求1至13任一所述的方法。A computer program product, characterized by comprising computer-readable instructions, which, when the computer-readable instructions are run on a computer device, cause the computer device to execute the method according to any one of claims 1 to 13.
PCT/CN2021/137522 2021-12-13 2021-12-13 Method and apparatus for managing data in storage system WO2023108360A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/137522 WO2023108360A1 (en) 2021-12-13 2021-12-13 Method and apparatus for managing data in storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/137522 WO2023108360A1 (en) 2021-12-13 2021-12-13 Method and apparatus for managing data in storage system

Publications (1)

Publication Number Publication Date
WO2023108360A1 true WO2023108360A1 (en) 2023-06-22

Family

ID=86775264

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/137522 WO2023108360A1 (en) 2021-12-13 2021-12-13 Method and apparatus for managing data in storage system

Country Status (1)

Country Link
WO (1) WO2023108360A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011097887A1 (en) * 2010-02-10 2011-08-18 北京播思软件技术有限公司 Content-based file splitting method
CN111309523A (en) * 2020-02-16 2020-06-19 西安奥卡云数据科技有限公司 Data reading and writing method, data remote copying method and device and distributed storage system
CN113126879A (en) * 2019-12-30 2021-07-16 中国移动通信集团四川有限公司 Data storage method and device and electronic equipment
CN113495901A (en) * 2021-04-20 2021-10-12 河海大学 Variable-length data block oriented quick retrieval method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011097887A1 (en) * 2010-02-10 2011-08-18 北京播思软件技术有限公司 Content-based file splitting method
CN113126879A (en) * 2019-12-30 2021-07-16 中国移动通信集团四川有限公司 Data storage method and device and electronic equipment
CN111309523A (en) * 2020-02-16 2020-06-19 西安奥卡云数据科技有限公司 Data reading and writing method, data remote copying method and device and distributed storage system
CN113495901A (en) * 2021-04-20 2021-10-12 河海大学 Variable-length data block oriented quick retrieval method

Similar Documents

Publication Publication Date Title
US10628449B2 (en) Method and apparatus for processing database data in distributed database system
US20200301850A1 (en) Data processing method and nvme storage device
US9189506B2 (en) Database index management
KR20200027413A (en) Method, device and system for storing data
US9378155B2 (en) Method for processing and verifying remote dynamic data, system using the same, and computer-readable medium
US11681679B2 (en) Systems and methods for performing tree-structured dataset operations
TWI828901B (en) Software implemented using circuit and method for key-value stores
US20150293958A1 (en) Scalable data structures
US20200257732A1 (en) Systems and methods of managing an index
CN105989015B (en) Database capacity expansion method and device and method and device for accessing database
US10984050B2 (en) Method, apparatus, and computer program product for managing storage system
WO2021129354A1 (en) Data index management method and device in storage system
WO2023108360A1 (en) Method and apparatus for managing data in storage system
WO2012114402A1 (en) Database management device and database management method
US11287993B2 (en) Method, device, and computer program product for storage management
US9146694B2 (en) Distribution processing unit of shared storage
WO2018210178A1 (en) File storage method and storage device
CN113934361B (en) Method, apparatus and computer program product for managing a storage system
CN112148728A (en) Method, apparatus and computer program product for information processing
CN118043799A (en) Data management method and device in storage system
CN111290700A (en) Distributed data reading and writing method and system
CN112035413B (en) Metadata information query method, device and storage medium
CN115221360A (en) Tree structure configuration method and system
CN112307266B (en) Index model construction method and device
CN105550284B (en) Method and device for mixed use of memory and temporary table space in Presto computing node

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21967502

Country of ref document: EP

Kind code of ref document: A1