WO2024021491A1 - Data slicing method, apparatus and system - Google Patents

Data slicing method, apparatus and system Download PDF

Info

Publication number
WO2024021491A1
WO2024021491A1 PCT/CN2022/141819 CN2022141819W WO2024021491A1 WO 2024021491 A1 WO2024021491 A1 WO 2024021491A1 CN 2022141819 W CN2022141819 W CN 2022141819W WO 2024021491 A1 WO2024021491 A1 WO 2024021491A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
slice
target
value
data stream
Prior art date
Application number
PCT/CN2022/141819
Other languages
French (fr)
Chinese (zh)
Inventor
刘利
姚栋
赵真
李丽
赵龙飞
杨思源
Original Assignee
天翼云科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 天翼云科技有限公司 filed Critical 天翼云科技有限公司
Publication of WO2024021491A1 publication Critical patent/WO2024021491A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • the invention relates to the field of performance testing, and in particular to a data slicing method, device and system.
  • data deduplication technology refers to slicing the data stream or file according to a certain method (also known as dicing), and performs hash calculations on data blocks to find the same data blocks for deletion.
  • Data deduplication technology can compress duplicate data in the storage system and reduce storage capacity. This technology is currently widely used in backup systems.
  • fixed-length slicing means defining a fixed slice length (for example, 2 bytes is a data block), and then dividing the data stream into data blocks of the same length for storage according to the defined slice length.
  • This slicing method has a low deduplication rate for data streams in some new and deleted scenarios. For example, compared to data stream B, data stream A only adds a new character at the front end of data stream B.
  • the deduplication rate is not high.
  • embodiments of the present invention provide a data slicing method, a data slicing device, a data slicing system and a computer-readable storage medium, which can improve the data deduplication rate.
  • the present invention provides a data slicing method, which method includes:
  • the slice length is determined, and starting from the starting position of the slice, the target data stream is sliced according to the slice length.
  • searching for a value corresponding to the target character from a preset array includes:
  • determining the slice length according to the numerical value corresponding to the target character includes:
  • the following operations are performed sequentially for the characters after the slice starting position until the slice end position is determined, and the length between the slice end position and the slice start position is taken as The slice length:
  • the position of the character is determined as the end position of the slice.
  • the corresponding intermediate value is determined based on the following method:
  • the method further includes:
  • the method further includes:
  • the intermediate value corresponding to the character is less than the fragment length threshold, determine whether the position of the character is the end position of the character, and if so, move the character
  • the length between the end position and the starting position of the slice is used as the slice length.
  • the method further includes:
  • the target character is used as a data block to be cut, and the target data stream is sliced.
  • the present invention also provides a data slicing device, which includes:
  • Data acquisition module used to obtain the target data stream to be sliced
  • a search module configured to obtain the target character at the starting position of the slice from the target data stream, and search for the numerical value corresponding to the target character from the preset array, wherein in the preset array, at least part of The characters correspond to different numerical values;
  • a slicing module configured to determine the slice length according to the numerical value corresponding to the target character, and slice the target data stream according to the slice length starting from the starting position of the slice.
  • Another aspect of the present invention also provides a computer-readable storage medium.
  • the computer-readable storage medium is used to store a computer program. When the computer program is executed by a processor, the method as described above is implemented.
  • the present invention also provides a data slicing system.
  • the data slicing system includes a processor and a memory.
  • the memory is used to store a computer program.
  • the computer program is executed by the processor, the above-mentioned steps are implemented. Methods.
  • the target character at the starting position of the slice is obtained from the target data stream, and the value corresponding to the target character is searched from the preset array, and then the slice length is determined based on the value corresponding to the target character, to slice the target data stream.
  • the target data stream can be sliced into variable lengths, which can reduce the impact of newly added and deleted characters on data slicing in scenarios such as data addition and deletion, effectively improving the deduplication rate.
  • Figure 1 shows a schematic diagram of data slicing in some technologies
  • Figure 2 shows a schematic flowchart of a data slicing method provided by an embodiment of the present application
  • Figure 3 shows a schematic diagram of a target data flow provided by an embodiment of the present application
  • Figure 4 shows a schematic diagram of data slicing provided by one embodiment of the present application
  • Figure 5 shows a flow chart of data slicing provided by an embodiment of the present application
  • Figure 6 shows a flow chart of data slicing provided by another embodiment of the present application.
  • Figure 7 shows a flow chart of data slicing provided by another embodiment of the present application.
  • Figure 8 shows a schematic module diagram of a data slicing device provided by an embodiment of the present application.
  • Figure 9 shows a schematic diagram of a data slicing system provided by an embodiment of the present application.
  • FIG. 1 compared to data stream A, data stream B has new characters in the gray box.
  • data stream B and data stream A using fixed-length slicing affected by the new characters in data stream B, the data stream B and data stream A are cut out Among the data blocks, there are fewer identical data blocks.
  • the storage system needs to allocate storage space for data flow A and data flow B respectively. Although there are more identical characters in data stream B and data stream A, the deduplication rate is low and no compressed storage space is used.
  • this application provides a data slicing method that can improve the deduplication rate.
  • the data slicing method provided in this application can be applied to electronic devices.
  • Electronic devices include but are not limited to tablets, desktop computers, laptops, and servers.
  • Figure 2 is a schematic flow chart of a data slicing method according to an embodiment of the present application.
  • the data slicing method includes steps S21 to S23.
  • Step S21 Obtain the target data stream to be sliced.
  • the target data stream includes content to be stored in the storage system, such as a file data stream and a video image data stream.
  • Step S22 Obtain the target character at the starting position of the slice from the target data stream, and search for the value corresponding to the target character from the preset array, where at least some of the characters in the preset array have different values.
  • the slicing starting position refers to the starting position when slicing the target data stream.
  • the starting position of the slice can be the position of the first character of the target data stream.
  • the slicing starting position may be the position of the first character in the target data stream that has not yet been sliced after the last slicing is completed.
  • FIG. 3 is a schematic diagram of a target data flow provided by an embodiment of the present application. In Figure 3, it is assumed that the first character on the left is the starting position of the target data stream, and the last character on the right is the end position of the target data stream.
  • the starting position of the slicing can be the position of the first character on the left; when slicing the target data stream has already started, for example, the last slice is from the dotted line If cutting is performed, then the next time the slice is sliced, the starting position of the slice can be the position of the first character after the dotted line (that is, the character o).
  • the preset array may be a random array generated in a random manner.
  • a character is represented by 8 bits (8bit). 8 bits can represent 256 different characters, so the default array can include 256 values. Each numerical value corresponds to one of the above 256 characters. That is, for each character in the target data stream, a corresponding value can be found in the preset array. For example, character A corresponds to the value 20 in the preset array, and character B corresponds to the value 15 in the preset array.
  • the preset array can be saved. For characters in different target data streams, the values corresponding to the characters can be found in the same preset array. That is, only one preset array can be generated, and there is no need to generate separate preset arrays for different target data streams.
  • the character number of the target character when searching for the numerical value corresponding to the target character from the preset array, can be searched from the preset character table, where the preset character table includes the correspondence between characters and character numbers, Different characters have different character numbers. Then the character number of the target character can be used as an index to find the value corresponding to the target character from the preset array.
  • the preset character table may be an ASCII code table.
  • the character number can be the ASCII code of the character in the ASCII code table.
  • ASCII code table can also be pre-established by technical personnel. In the character table, technicians can assign each character a character number that is different from the ASCII code.
  • the character number can be used as the position information of the numerical value to search for the numerical value corresponding to the target character in the preset array. For example, assuming the character number is 20, the 20th value in the default array is used as the value corresponding to the target character.
  • Step S23 Determine the slice length according to the numerical value corresponding to the target character, and slice the target data stream according to the slice length starting from the starting position of the slice.
  • the value corresponding to the target character can be directly used as the slice length, that is, starting from the starting position of the slice, the target data stream is sliced according to the length defined by the value corresponding to the target character.
  • the value corresponding to the target character can be used as the number of characters to be sliced.
  • Figure 3 For ease of understanding, please refer to Figure 3. For example, assuming that the target character in Figure 3 is the first character i on the left, and the value corresponding to the target character is 5, then starting from the target character i, a data block containing 5 characters can be cut out from the target data stream ( i.e. itwea).
  • the value corresponding to the target character can also be used as the number of bits to be sliced.
  • a data block containing 7 characters can be cut out from the target data stream. (i.e. itweasc).
  • the length of the cut data block is also different. It should not be exactly the same, that is, when slicing the target data stream, the variable-length slicing method is used. And since it is a variable-length slice, it is possible to solve the problem described in Figure 1.
  • FIG. 4 a schematic diagram of data slicing is provided for an embodiment of the present application.
  • Data stream B in Figure 4 is similar to data stream A.
  • the characters in the gray box are new characters in data stream B compared to data stream A.
  • this application is a variable-length slice, for data stream A and data stream B, it is possible to separately cut the newly added characters in data stream B into a data block.
  • data stream A and data stream B can be cut into data blocks as shown in Figure 4.
  • data flow A and data flow B jointly include data blocks itw, asc, old, ayi, and, then during data storage, only one of these data blocks can be saved. In this way, the purpose of compressing storage space and improving deduplication rate is achieved.
  • the lengths of different target data streams may vary greatly. For example, some target data streams include a larger number of characters, and some target data streams include a smaller number of characters. If the value corresponding to the target character is large, it is more suitable to slice the target data stream with a longer length. However, for a target data stream with a shorter length, there may be a problem that it cannot be sliced. For example, assume that the target data stream includes 5 characters, but the value corresponding to the target character is 10 (representing 10 characters). In this case, it may not be possible to slice the target data stream with a short length.
  • the slice length may be determined based on the following method.
  • the numerical value corresponding to the target character can be operated to obtain the corresponding intermediate value. If the obtained intermediate value is less than the slice length threshold, perform the following operations on the characters after the slice start position in sequence until the slice end position is determined, and the length between the slice end position and the slice start position is used as the slice length:
  • the position of the character is determined as the end position of the slice.
  • the corresponding intermediate value can be determined based on the following method:
  • the value of the preset variable after each calculation of the first intermediate value, can also be updated to the first intermediate value, so that the value of the preset variable can be compared with the value found from the preset array next time.
  • the XOR operation is performed on the value of the updated preset variable and the value found in the preset array.
  • the position of the last character in the target data stream may also be recorded as the character end position.
  • the position of the character is the end position of the character. If so, the distance between the end position of the character and the start position of the slice is length as slice length.
  • the target character after calculating the numerical value corresponding to the target character and obtaining the corresponding intermediate value, if the obtained intermediate value is greater than or equal to the slice length threshold, the target character can be used as a data block to be cut, and the target data can be The stream is sliced.
  • FIG. 5 is a flow chart of data slicing provided in an embodiment of the present application.
  • the first character on the left is the target character i.
  • the preset variable is x
  • the initial value of the preset variable x is one of the values in the preset array, or the initial value of the preset variable x is a randomly generated or specified value
  • the film length threshold is slen
  • the target data The position of the last character in the stream is max, the value corresponding to character i is i', the value corresponding to character t is t', and the value corresponding to character w is w'
  • the corresponding value of character e is e’, the position of character i is id, the position of character t is td, the position of character w is wd, and the position of character e is ed.
  • the value corresponding to i can be XORed with the value of the preset variable x to obtain the first intermediate value corresponding to the character i.
  • the first intermediate value corresponding to character i can be subtracted from the preset value to obtain the second intermediate value.
  • the second intermediate value corresponding to character i is obtained.
  • the first intermediate value corresponding to the character w can be subtracted from the preset value to obtain the second intermediate value. Then perform an XOR operation on the first intermediate value and the second intermediate value to obtain the intermediate value corresponding to the character w, and determine that the intermediate value is smaller than the slice length threshold slen.
  • This process corresponds to x ⁇ (x-1) ⁇ slen in step 3 of Figure 5.
  • the first intermediate value corresponding to the character e can be subtracted from the preset value to obtain the second intermediate value. Then perform an XOR operation on the first intermediate value and the second intermediate value to obtain the intermediate value corresponding to the character e, and determine that the intermediate value is greater than the slice length threshold slen.
  • This process corresponds to x ⁇ (x-1)>slen in step 4 of Figure 5.
  • the position of character e determines the end position of the slice, and then the value between character e and character i (including character e and character i) is The length is used as the slice length to slice the target data stream.
  • the slice length should be 4 (meaning 4 characters). In this way, itwe can be used as a data block to slice the target data stream.
  • the position of character a after character e can be used as the starting position of slicing, and character a can be used as the target character, and the above steps can be performed to continue slicing the target data stream.
  • FIG. 6 a flow chart of data slicing is provided for another embodiment of the present application.
  • the obtained intermediate value is greater than the slice length threshold slen.
  • the target character i is treated as a separate data block and the target data stream is sliced.
  • FIG. 7 a flow chart of data slicing is provided for another embodiment of the present application.
  • the value corresponding to character d is d’, and the position of character d is dd.
  • the value corresponding to character d is calculated, and the obtained intermediate value is less than the slice length Threshold slen, but the position of character d is the position max of the last character of the target data stream, then the character d can be used as the end position of the character, and the character cold can be used as a data block to cut the target data stream.
  • the target character at the starting position of the slice is obtained from the target data stream, and the value corresponding to the target character is searched from the preset array, and then the slice length is determined based on the value corresponding to the target character, to slice the target data stream.
  • the target data stream can be sliced into variable lengths, which can reduce the impact of added and deleted characters on data slicing in scenarios such as data addition and deletion, effectively improving the deduplication rate.
  • FIG. 8 is a schematic module diagram of a data slicing device according to an embodiment of the present application.
  • Data slicing devices include:
  • Data acquisition module used to obtain the target data stream to be sliced
  • the search module is used to obtain the target character at the starting position of the slice from the target data stream, and to find the value corresponding to the target character from the preset array, where at least some of the characters in the preset array have different values. ;and
  • the slicing module is used to determine the slice length according to the value corresponding to the target character, and slice the target data stream according to the slice length starting from the starting position of the slice.
  • FIG. 9 is a schematic diagram of a data slicing system provided by an embodiment of the present application.
  • the data slicing system includes a processor and a memory.
  • the memory is used to store computer programs.
  • the computer program is executed by the processor, the above-mentioned data slicing method is implemented.
  • the processor may be a central processing unit (Central Processing Unit, CPU).
  • the processor can also be other general-purpose processors, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other Chips such as programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of these types of chips.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • Chips such as programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of these types of chips.
  • the memory can be used to store non-transitory software programs, non-transitory computer executable programs and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention.
  • the processor executes various functional applications and data processing of the processor by running non-transient software programs, instructions and modules stored in the memory, that is, implementing the method in the above method implementation.
  • the memory may include a program storage area and a data storage area, where the program storage area may store an operating system and an application program required for at least one function; the data storage area may store data created by the processor, etc.
  • the memory may include high-speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device.
  • the memory may optionally include memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.
  • One embodiment of the present application also provides a computer-readable storage medium, which is used to store a computer program.
  • the computer program is executed by a processor, the above-mentioned data slicing method is implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Character Input (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data slicing method, apparatus and system. The data slicing method comprises: acquiring a target data stream to be sliced (S21); acquiring from the target data stream a target character at a slicing starting position, and searching a preset array for a numerical value corresponding to the target character, wherein in the preset array, the numerical values corresponding to at least some characters are different (S22); and according to the numerical value corresponding to the target character, determining a slicing length, and starting from the slicing starting position, slicing the target data stream according to the slicing length (S23). The method can improve the deduplication rate.

Description

一种数据切片方法、装置和系统A data slicing method, device and system 技术领域Technical field
本发明涉及性能测试领域,具体涉及一种数据切片方法、装置和系统。The invention relates to the field of performance testing, and in particular to a data slicing method, device and system.
背景技术Background technique
随着数据备份技术的成熟和广泛应用,备份数据量在最近几年中呈现出爆发性的增长。研究表明,大型企业中的备份数据每年以40%到60%的速度增长,许多公司的备份数据规模每年会增加一倍甚至更多。这些备份数据中,存在很多相同的数据块,这大大增加了企业的成本和存储空间浪费。With the maturity and widespread application of data backup technology, the amount of backup data has shown explosive growth in recent years. Research shows that backup data in large enterprises is growing at a rate of 40% to 60% every year, and the size of backup data in many companies will double or more every year. There are many identical data blocks in these backup data, which greatly increases the enterprise's cost and waste of storage space.
目前,可以通过数据重删技术将部分重复数据删除,以达到减少存储容量的目的。其中,数据重删技术是指将数据流或者文件根据某种方式进行切片(又称为切块),以数据块为单位进行hash计算,以此找到相同的数据块进行删除。通过数据重删技术可以压缩存储系统的重复数据,减少存储容量。目前该技术广泛应用于备份系统。Currently, part of the duplicate data can be deleted through data deduplication technology to reduce storage capacity. Among them, data deduplication technology refers to slicing the data stream or file according to a certain method (also known as dicing), and performs hash calculations on data blocks to find the same data blocks for deletion. Data deduplication technology can compress duplicate data in the storage system and reduce storage capacity. This technology is currently widely used in backup systems.
技术问题technical problem
当前,一些技术采用定长切片的方法对数据流进行切片。其中,定长切片就是定义一个固定的切片长度(比如2个字节为一个数据块),然后按照定义的切片长度,将数据流分割为相同长度的数据块进行存储。这种切片方法对于一些新增、删除场景下的数据流来说,重删率不高。比如数据流A相对于数据流B来说,只是在数据流B的前端新增了一个字符,但通过定长切片的方法对这两个数据流进行切片时,切出的数据块差别较大,重删率不高。Currently, some technologies use fixed-length slicing methods to slice data streams. Among them, fixed-length slicing means defining a fixed slice length (for example, 2 bytes is a data block), and then dividing the data stream into data blocks of the same length for storage according to the defined slice length. This slicing method has a low deduplication rate for data streams in some new and deleted scenarios. For example, compared to data stream B, data stream A only adds a new character at the front end of data stream B. However, when the two data streams are sliced using the fixed-length slicing method, the data blocks cut out are quite different. , the deduplication rate is not high.
技术解决方案Technical solutions
有鉴于此,本发明实施方式提供了一种数据切片方法、数据切片装置、数据切片系统和计算机可读存储介质,可以提高数据重删率。In view of this, embodiments of the present invention provide a data slicing method, a data slicing device, a data slicing system and a computer-readable storage medium, which can improve the data deduplication rate.
本发明一方面提供了一种数据切片方法,所述方法包括:In one aspect, the present invention provides a data slicing method, which method includes:
获取待切片的目标数据流;Get the target data stream to be sliced;
从所述目标数据流中获取切片起始位置的目标字符,并从预设数组中查找与所述目标字符相对应的数值,其中,在所述预设数组中,至少部分字符对应的数值不相同;及Obtain the target character at the starting position of the slice from the target data stream, and search for the numerical value corresponding to the target character from the preset array, wherein in the preset array, the numerical value corresponding to at least some characters is not identical; and
根据所述目标字符对应的数值,确定切片长度,并从所述切片起始位置开始,按照所述切片长度,对所述目标数据流进行切片。According to the numerical value corresponding to the target character, the slice length is determined, and starting from the starting position of the slice, the target data stream is sliced according to the slice length.
在一些实施例中,所述从预设数组中查找与所述目标字符相对应的数值,包括:In some embodiments, searching for a value corresponding to the target character from a preset array includes:
从预设字符表中查找所述目标字符的字符编号,其中,所述预设字符表包括字符和字符编号的对应关系,不同字符的字符编号不同;Search the character number of the target character from a preset character table, where the preset character table includes a correspondence between characters and character numbers, and different characters have different character numbers;
将所述目标字符的字符编号作为索引,从所述预设数组中查找与所述目标字符相对应的数值。Using the character number of the target character as an index, search for the value corresponding to the target character from the preset array.
在一些实施例中,所述根据所述目标字符相对应的数值,确定切片长度,包括:In some embodiments, determining the slice length according to the numerical value corresponding to the target character includes:
对所述目标字符相对应的数值进行运算,得到对应的中间值;Perform operations on the numerical values corresponding to the target characters to obtain the corresponding intermediate values;
若得到的中间值小于片长阈值,针对所述切片起始位置后的字符依次执行如下操作,直至确定切片结束位置,并将所述切片结束位置和所述切片起始位置之间的长度作为所述切片长度:If the obtained intermediate value is less than the slice length threshold, the following operations are performed sequentially for the characters after the slice starting position until the slice end position is determined, and the length between the slice end position and the slice start position is taken as The slice length:
从所述预设数组中查找与所述字符相对应的数值;Find the value corresponding to the character from the preset array;
对查找到的数值进行运算,得到对应的中间值;Perform operations on the found values to obtain the corresponding intermediate values;
若得到的中间值大于或等于所述片长阈值,将所述字符所在位置确定为所述切片结束位置。If the obtained intermediate value is greater than or equal to the slice length threshold, the position of the character is determined as the end position of the slice.
在一些实施例中,针对从所述预设数组中查找到的任一数值,基于如下方法确定对应的中间值: In some embodiments, for any value found from the preset array, the corresponding intermediate value is determined based on the following method:
将该数值与预设变量的值进行异或运算,得到第一中间值;Perform an XOR operation on this value and the value of the preset variable to obtain the first intermediate value;
将所述第一中间值减去预设值,得到第二中间值;Subtract a preset value from the first intermediate value to obtain a second intermediate value;
将所述第一中间值和所述第二中间值进行异或运算,得到对应的中间值。Perform an XOR operation on the first intermediate value and the second intermediate value to obtain the corresponding intermediate value.
在一些实施例中,在每次计算得到所述第一中间值后,所述方法还包括:In some embodiments, after each calculation of the first intermediate value, the method further includes:
将所述预设变量的值更新为所述第一中间值,以便于在下一次将所述预设变量的值与从所述预设数组中查找的数值进行异或运算时,基于更新后所述预设变量的值,与从所述预设数组中查找的数值进行异或运算。Update the value of the preset variable to the first intermediate value, so that the next time the value of the preset variable is XORed with the value found from the preset array, based on the updated value The value of the preset variable is XORed with the value found in the preset array.
在一些实施例中,在获取到待切片的所述目标数据流后,所述方法还包括:In some embodiments, after obtaining the target data stream to be sliced, the method further includes:
将所述目标数据流中最后一个字符所在的位置记录为字符结束位置;Record the position of the last character in the target data stream as the character end position;
针对所述目标数据流中的任一所述字符,在该字符对应的中间值小于所述片长阈值的情况下,判断该字符所在位置是否为所述字符结束位置,若是,将所述字符结束位置和所述切片起始位置之间的长度作为所述切片长度。For any character in the target data stream, when the intermediate value corresponding to the character is less than the fragment length threshold, determine whether the position of the character is the end position of the character, and if so, move the character The length between the end position and the starting position of the slice is used as the slice length.
在一些实施例中,在对所述目标字符相对应的数值进行运算,得到对应的中间值后,所述方法还包括:In some embodiments, after calculating the numerical value corresponding to the target character to obtain the corresponding intermediate value, the method further includes:
若得到的中间值大于或等于所述片长阈值,将所述目标字符作为待切割的数据块,对所述目标数据流进行切片。If the obtained intermediate value is greater than or equal to the slice length threshold, the target character is used as a data block to be cut, and the target data stream is sliced.
本发明另一方面还提供了一种数据切片装置,所述装置包括:On the other hand, the present invention also provides a data slicing device, which includes:
数据获取模块,用于获取待切片的目标数据流;Data acquisition module, used to obtain the target data stream to be sliced;
查找模块,用于从所述目标数据流中获取切片起始位置的目标字符,并从预设数组中查找与所述目标字符相对应的数值,其中,在所述预设数组中,至少部分字符对应的数值不相同;及A search module, configured to obtain the target character at the starting position of the slice from the target data stream, and search for the numerical value corresponding to the target character from the preset array, wherein in the preset array, at least part of The characters correspond to different numerical values; and
切片模块,用于根据所述目标字符相对应的数值,确定切片长度,并从所述切片起始位置开始,按照所述切片长度,对所述目标数据流进行切片。A slicing module, configured to determine the slice length according to the numerical value corresponding to the target character, and slice the target data stream according to the slice length starting from the starting position of the slice.
本发明另一方面还提供了一种计算机可读存储介质,所述计算机可读存储介质用于存储计算机程序,所述计算机程序被处理器执行时,实现如上所述的方法。Another aspect of the present invention also provides a computer-readable storage medium. The computer-readable storage medium is used to store a computer program. When the computer program is executed by a processor, the method as described above is implemented.
本发明另一方面还提供了一种数据切片系统,所述数据切片系统包括处理器和存储器,所述存储器用于存储计算机程序,所述计算机程序被所述处理器执行时,实现如上所述的方法。On the other hand, the present invention also provides a data slicing system. The data slicing system includes a processor and a memory. The memory is used to store a computer program. When the computer program is executed by the processor, the above-mentioned steps are implemented. Methods.
有益效果beneficial effects
在本申请的一些实施例中,从目标数据流中获取切片起始位置的目标字符,并从预设数组中查找与目标字符相对应的数值,然后根据目标字符对应的数值,确定切片长度,来对目标数据流进行切片。如此,可以对目标数据流进行变长切片,进而可以在数据新增、删除等场景下,降低新增、删除的字符对于数据切片的影响,有效提高重删率。In some embodiments of the present application, the target character at the starting position of the slice is obtained from the target data stream, and the value corresponding to the target character is searched from the preset array, and then the slice length is determined based on the value corresponding to the target character, to slice the target data stream. In this way, the target data stream can be sliced into variable lengths, which can reduce the impact of newly added and deleted characters on data slicing in scenarios such as data addition and deletion, effectively improving the deduplication rate.
附图说明Description of drawings
通过参考附图会更加清楚的理解本发明的特征和优点,附图是示意性的而不应理解为对本发明进行任何限制,在附图中:The features and advantages of the present invention will be more clearly understood by referring to the accompanying drawings, which are schematic and should not be construed as limiting the invention in any way, in which:
图1示出了一些技术中的数据切片的示意图;Figure 1 shows a schematic diagram of data slicing in some technologies;
图2示出了本申请的一个实施例提供的数据切片方法的流程示意图;Figure 2 shows a schematic flowchart of a data slicing method provided by an embodiment of the present application;
图3示出了本申请的一个实施例提供的一个目标数据流的示意图;Figure 3 shows a schematic diagram of a target data flow provided by an embodiment of the present application;
图4示出了本申请的一个实施例提供数据切片的示意图;Figure 4 shows a schematic diagram of data slicing provided by one embodiment of the present application;
图5示出了本申请的一个实施例提供的数据切片的流程图;Figure 5 shows a flow chart of data slicing provided by an embodiment of the present application;
图6示出了本申请的另一个实施例提供的数据切片的流程图;Figure 6 shows a flow chart of data slicing provided by another embodiment of the present application;
图7示出了本申请的另一个实施例提供的数据切片的流程图;Figure 7 shows a flow chart of data slicing provided by another embodiment of the present application;
图8示出了本申请的一个实施例提供的数据切片装置的模块示意图;Figure 8 shows a schematic module diagram of a data slicing device provided by an embodiment of the present application;
图9示出了本申请的一个实施例提供的数据切片系统的示意图。Figure 9 shows a schematic diagram of a data slicing system provided by an embodiment of the present application.
本发明的实施方式Embodiments of the invention
为使本发明实施方式的目的、技术方案和优点更加清楚,下面将结合本发明实施方式中的附图,对本发明实施方式中的技术方案进行清楚、完整地描述,显然,所描述的实施方式是本发明一部分实施方式,而不是全部的实施方式。基于本发明中的实施方式,本领域技术人员在没有作出创造性劳动前提下所获得的所有其它实施方式,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative efforts fall within the scope of protection of the present invention.
请参见图1,为一些技术中的数据切片的示意图。图1中,数据流B相对数据流A来说,新增了灰色框内的字符。在采用定长切片(每3个字符为一个数据块)的方法对数据流B和数据流A进行切片时,受数据流B中新增的字符影响,从数据流B和数据流A切割出的数据块中,相同的数据块较少。存储系统需要为数据流A和数据流B分别分配存储空间。虽然数据流B和数据流A中相同的字符较多,但重删率却较低,不利用压缩存储空间。See Figure 1 for a schematic diagram of data slicing in some technologies. In Figure 1, compared to data stream A, data stream B has new characters in the gray box. When slicing data stream B and data stream A using fixed-length slicing (every 3 characters is a data block), affected by the new characters in data stream B, the data stream B and data stream A are cut out Among the data blocks, there are fewer identical data blocks. The storage system needs to allocate storage space for data flow A and data flow B respectively. Although there are more identical characters in data stream B and data stream A, the deduplication rate is low and no compressed storage space is used.
为此,本申请提供一种数据切片方法,可以提高重删率。本申请提供的数据切片方法可应用电子设备。电子设备包括但不限于平板电脑、台式电脑、笔记本电脑、服务器。请参阅图2,为本申请的一个实施例提供的数据切片方法的流程示意图。数据切片方法包括步骤S21至步骤S23。To this end, this application provides a data slicing method that can improve the deduplication rate. The data slicing method provided in this application can be applied to electronic devices. Electronic devices include but are not limited to tablets, desktop computers, laptops, and servers. Please refer to Figure 2, which is a schematic flow chart of a data slicing method according to an embodiment of the present application. The data slicing method includes steps S21 to S23.
步骤S21,获取待切片的目标数据流。Step S21: Obtain the target data stream to be sliced.
在一些实施例中,目标数据流包括待存储至存储系统中的内容,比如文件数据流、视频图像数据流。In some embodiments, the target data stream includes content to be stored in the storage system, such as a file data stream and a video image data stream.
步骤S22,从目标数据流中获取切片起始位置的目标字符,并从预设数组中查找与目标字符相对应的数值,其中,在预设数组中,至少部分字符对应的数值不相同。Step S22: Obtain the target character at the starting position of the slice from the target data stream, and search for the value corresponding to the target character from the preset array, where at least some of the characters in the preset array have different values.
在一些实施例中,切片起始位置指对目标数据流进行切片时的起始位置。对于还未开始进行切片的目标数据流来说,切片起始位置可以是目标数据流第一个字符所在的位置。对于已经开始切片的目标数据流来说,切片起始位置可以是当前最后一次切片结束后,目标数据流中还未进行切片的第一个字符所在的位置。为便于理解,请结合参阅图3,为本申请的一个实施例提供的一个目标数据流的示意图。图3中,假设左侧第一个字符为目标数据流的起始位置,右侧最后一个字符为目标数据流的结束位置。在还未开始对目标数据流进行切片时,切片起始位置可以是左侧第一个字符所在的位置;在已经开始对目标数据流进行切片的情况下,比如当前最后一次切片是从虚线处进行切割的,那么下一次切片时,切片起始位置可以是虚线后第一个字符(即字符o)所在的位置。In some embodiments, the slicing starting position refers to the starting position when slicing the target data stream. For a target data stream that has not yet started slicing, the starting position of the slice can be the position of the first character of the target data stream. For a target data stream that has started slicing, the slicing starting position may be the position of the first character in the target data stream that has not yet been sliced after the last slicing is completed. For ease of understanding, please refer to FIG. 3 , which is a schematic diagram of a target data flow provided by an embodiment of the present application. In Figure 3, it is assumed that the first character on the left is the starting position of the target data stream, and the last character on the right is the end position of the target data stream. When slicing the target data stream has not yet started, the starting position of the slicing can be the position of the first character on the left; when slicing the target data stream has already started, for example, the last slice is from the dotted line If cutting is performed, then the next time the slice is sliced, the starting position of the slice can be the position of the first character after the dotted line (that is, the character o).
在一些实施例中,预设数组可以是通过随机方式生成的一个随机数组。考虑到计算机中,一个字符由8个比特位(8bit)来表示。8个比特位可以表示256个不同的字符,因此,预设数组可以包括256个数值。每个数值分别与上述256个字符中的其中一个字符对应。即对于目标数据流中的每一个字符来说,在预设数组中均可以查找到一个对应的数值。比如字符A和预设数组中的数值20对应,字符B和预设数组中的数值15对应。In some embodiments, the preset array may be a random array generated in a random manner. Consider that in computers, a character is represented by 8 bits (8bit). 8 bits can represent 256 different characters, so the default array can include 256 values. Each numerical value corresponds to one of the above 256 characters. That is, for each character in the target data stream, a corresponding value can be found in the preset array. For example, character A corresponds to the value 20 in the preset array, and character B corresponds to the value 15 in the preset array.
在一些实施例中,以随机方式生成预设数组后,可以保存预设数组。针对不同目标数据流中的字符,可以在同一个预设数组查找字符对应的数值。即预设数组可以只生成一个,无需针对不同的目标数据流分别生成预设数组。In some embodiments, after the preset array is generated in a random manner, the preset array can be saved. For characters in different target data streams, the values corresponding to the characters can be found in the same preset array. That is, only one preset array can be generated, and there is no need to generate separate preset arrays for different target data streams.
在一些实施例中,从预设数组中查找与目标字符相对应的数值时,可以从预设字符表中查找目标字符的字符编号,其中,预设字符表包括字符和字符编号的对应关系,不同字符的字符编号不同。然后可以将目标字符的字符编号作为索引,从预设数组中查找与目标字符相对应的数值。In some embodiments, when searching for the numerical value corresponding to the target character from the preset array, the character number of the target character can be searched from the preset character table, where the preset character table includes the correspondence between characters and character numbers, Different characters have different character numbers. Then the character number of the target character can be used as an index to find the value corresponding to the target character from the preset array.
其中,预设字符表可以为ASCII码表。字符编号可以是字符在ASCII码表中的ASCII码。当然,可以理解的是,字符表中只要存在字符和字符编号的对应关系即可,并不一定必须为ASCII码表。比如,字符表还可以是由技术人员预先建立的。在字符表中,技术人员可以为每个字符分配与ASCII码不同的字符编号。The preset character table may be an ASCII code table. The character number can be the ASCII code of the character in the ASCII code table. Of course, it is understandable that as long as there is a corresponding relationship between characters and character numbers in the character table, it does not necessarily have to be an ASCII code table. For example, the character table can also be pre-established by technical personnel. In the character table, technicians can assign each character a character number that is different from the ASCII code.
在一些实施例中,从字符表中查找到目标字符的字符编号后,可以将字符编号作为数值的位置信息,在预设数组中查找与目标字符相对应的数值。举例来说,假设字符编号为20,则将预设数组中第20个数值作为与目标字符相对应的数值。In some embodiments, after the character number of the target character is found from the character table, the character number can be used as the position information of the numerical value to search for the numerical value corresponding to the target character in the preset array. For example, assuming the character number is 20, the 20th value in the default array is used as the value corresponding to the target character.
步骤S23,根据目标字符对应的数值,确定切片长度,并从切片起始位置开始,按照切片长度,对目标数据流进行切片。Step S23: Determine the slice length according to the numerical value corresponding to the target character, and slice the target data stream according to the slice length starting from the starting position of the slice.
在一些实施例中,可以将目标字符对应的数值直接作为切片长度,即从切片起始位置开始,按照目标字符对应的数值所限定的长度,对目标数据流进行切片。具体来说,可以将目标字符对应的数值作为待切片的字符的数量。为便于理解,结合参阅图3。比如,假设图3中的目标字符为左侧第一个字符i,目标字符对应的数值为5,则可以从目标字符i开始,从目标数据流中切割出一个包括5个字符的数据块(即itwea)。当然,也可以将目标字符对应的数值作为待切片的比特数量。比如,依然假设图3中的目标字符为左侧第一个字符i,目标字符对应的数值为56,则可以从目标字符i开始,从目标数据流中切割出一个包括7个字符的数据块(即itweasc)。In some embodiments, the value corresponding to the target character can be directly used as the slice length, that is, starting from the starting position of the slice, the target data stream is sliced according to the length defined by the value corresponding to the target character. Specifically, the value corresponding to the target character can be used as the number of characters to be sliced. For ease of understanding, please refer to Figure 3. For example, assuming that the target character in Figure 3 is the first character i on the left, and the value corresponding to the target character is 5, then starting from the target character i, a data block containing 5 characters can be cut out from the target data stream ( i.e. itwea). Of course, the value corresponding to the target character can also be used as the number of bits to be sliced. For example, assuming that the target character in Figure 3 is the first character i on the left, and the value corresponding to the target character is 56, then starting from the target character i, a data block containing 7 characters can be cut out from the target data stream. (i.e. itweasc).
在一些实施例中,由于预设数组中的至少部分字符对应的数值是不相同的,且正常情况下,目标数据流中的字符也不完全是相同的,那么切割出的数据块的长度也应该是不完全相同的,即对目标数据流进行切片时,是采用的变长切片的方法。又由于是变长切片,那么则有可能解决图1中所述的问题。结合参阅图4,为本申请的一个实施例提供数据切片的示意图。图4中的数据流B和数据流A类似,灰色框内字符为数据流B相对于数据流A新增的字符。由于本申请是变长切片,那么针对数据流A和数据流B来说,则是有可能将数据流B中新增的字符单独切割为一个数据块的。比如,假设字符i、a、o在预设数组中对应的数值均为3,字符e在预设数组中对应的数值为1,字符0在预设数组对应的数值为2,那么按照本申请的方法,则可以将数据流A和数据流B切割为图4中所示的数据块。从图4可以看出,数据流A和数据流B共同包括了数据块itw、asc、old、ayi、and,那么在数据存储时,针对这些数据块则可以仅保存一个。如此,达到了压缩存储空间,提高重删率的目的。In some embodiments, since the values corresponding to at least some characters in the preset array are different, and under normal circumstances, the characters in the target data stream are not completely the same, the length of the cut data block is also different. It should not be exactly the same, that is, when slicing the target data stream, the variable-length slicing method is used. And since it is a variable-length slice, it is possible to solve the problem described in Figure 1. Referring to FIG. 4 , a schematic diagram of data slicing is provided for an embodiment of the present application. Data stream B in Figure 4 is similar to data stream A. The characters in the gray box are new characters in data stream B compared to data stream A. Since this application is a variable-length slice, for data stream A and data stream B, it is possible to separately cut the newly added characters in data stream B into a data block. For example, assuming that the corresponding values of characters i, a, and o in the preset array are all 3, the corresponding value of character e in the preset array is 1, and the corresponding value of character 0 in the preset array is 2, then according to this application method, data stream A and data stream B can be cut into data blocks as shown in Figure 4. As can be seen from Figure 4, data flow A and data flow B jointly include data blocks itw, asc, old, ayi, and, then during data storage, only one of these data blocks can be saved. In this way, the purpose of compressing storage space and improving deduplication rate is achieved.
进一步的,考虑到一些场景中,不同目标数据流的长度可能会相差比较大,比如一些目标数据流包括的字符数较多,一些目标数据流包括的字符数较少。若目标字符对应的数值较大,那么在对长度较长的目标数据流进行切片时是比较适合的,但对于长度较短的目标数据流来说,则可能存在无法切片的问题。比如,假设目标数据流包括5个字符,但目标字符对应的数值为10(表示10个字符)。这种情况下,可能就无法对长度较短的目标数据流进行切片。反之,若目标字符对应的数值较小,这种情况下,虽然可以对长度较短的目标数据流进行切片,但对于长度较长的目标数据流来说,则可能存在切割的数据块过小过多的问题。鉴于以上描述,在本申请一些实施例中,可以基于如下方法来确定切片长度。Furthermore, consider that in some scenarios, the lengths of different target data streams may vary greatly. For example, some target data streams include a larger number of characters, and some target data streams include a smaller number of characters. If the value corresponding to the target character is large, it is more suitable to slice the target data stream with a longer length. However, for a target data stream with a shorter length, there may be a problem that it cannot be sliced. For example, assume that the target data stream includes 5 characters, but the value corresponding to the target character is 10 (representing 10 characters). In this case, it may not be possible to slice the target data stream with a short length. On the contrary, if the value corresponding to the target character is small, in this case, although the target data stream with a shorter length can be sliced, for the target data stream with a longer length, the cut data blocks may be too small. Too many questions. In view of the above description, in some embodiments of the present application, the slice length may be determined based on the following method.
具体来说,可以对目标字符相对应的数值进行运算,得到对应的中间值。若得到的中间值小于片长阈值,针对切片起始位置后的字符依次执行如下操作,直至确定切片结束位置,并将切片结束位置和切片起始位置之间的长度作为切片长度:Specifically, the numerical value corresponding to the target character can be operated to obtain the corresponding intermediate value. If the obtained intermediate value is less than the slice length threshold, perform the following operations on the characters after the slice start position in sequence until the slice end position is determined, and the length between the slice end position and the slice start position is used as the slice length:
从预设数组中查找与字符相对应的数值;Find the value corresponding to the character from the preset array;
对查找到的数值进行运算,得到对应的中间值;Perform operations on the found values to obtain the corresponding intermediate values;
若得到的中间值大于或等于片长阈值,将字符所在位置确定为切片结束位置。If the obtained intermediate value is greater than or equal to the slice length threshold, the position of the character is determined as the end position of the slice.
在一些实施例中,针对从预设数组中查找到的任一数值,可以基于如下方法确定对应的中间值: In some embodiments, for any value found from the preset array, the corresponding intermediate value can be determined based on the following method:
将该数值与预设变量的值进行异或运算,得到第一中间值;Perform an XOR operation on this value and the value of the preset variable to obtain the first intermediate value;
将第一中间值减去预设值,得到第二中间值;Subtract the preset value from the first intermediate value to obtain the second intermediate value;
将第一中间值和第二中间值进行异或运算,得到对应的中间值。Perform an XOR operation on the first intermediate value and the second intermediate value to obtain the corresponding intermediate value.
在一些实施例中,在每次计算得到第一中间值后,还可以将预设变量的值更新为第一中间值,以便于在下一次将预设变量的值与从预设数组中查找的数值进行异或运算时,基于更新后预设变量的值,与从预设数组中查找的数值进行异或运算。In some embodiments, after each calculation of the first intermediate value, the value of the preset variable can also be updated to the first intermediate value, so that the value of the preset variable can be compared with the value found from the preset array next time. When performing an XOR operation on a value, the XOR operation is performed on the value of the updated preset variable and the value found in the preset array.
在一些实施例中,在获取到待切片的目标数据流后,还可以将目标数据流中最后一个字符所在的位置记录为字符结束位置。针对目标数据流中的任一字符,在该字符对应的中间值小于片长阈值的情况下,判断该字符所在位置是否为字符结束位置,若是,将字符结束位置和切片起始位置之间的长度作为切片长度。In some embodiments, after obtaining the target data stream to be sliced, the position of the last character in the target data stream may also be recorded as the character end position. For any character in the target data stream, when the corresponding intermediate value of the character is less than the slice length threshold, determine whether the position of the character is the end position of the character. If so, the distance between the end position of the character and the start position of the slice is length as slice length.
在一些实施例中,在对目标字符相对应的数值进行运算,得到对应的中间值后,若得到的中间值大于或等于片长阈值,可以将目标字符作为待切割的数据块,对目标数据流进行切片。In some embodiments, after calculating the numerical value corresponding to the target character and obtaining the corresponding intermediate value, if the obtained intermediate value is greater than or equal to the slice length threshold, the target character can be used as a data block to be cut, and the target data can be The stream is sliced.
为便于理解,结合参阅图5,为本申请的一个实施例提供的数据切片的流程图。图5中,左侧第一个字符为目标字符i。假设预设变量为x,且预设变量x的初始值为预设数组中的其中一个数值,或者预设变量x的初始值为随机生成或指定的一个数值,片长阈值为slen,目标数据流最后一个字符所在的位置为max,字符i对应的数值为i’,字符t对应的数值为t’,字符w对应的数值为w’, 字符e对应的数值为e’,字符i所在的位置为id,字符t所在的位置为td,字符w所在的位置为wd,字符e所在的位置为ed。For ease of understanding, refer to FIG. 5 , which is a flow chart of data slicing provided in an embodiment of the present application. In Figure 5, the first character on the left is the target character i. Assume that the preset variable is x, and the initial value of the preset variable x is one of the values in the preset array, or the initial value of the preset variable x is a randomly generated or specified value, the film length threshold is slen, and the target data The position of the last character in the stream is max, the value corresponding to character i is i', the value corresponding to character t is t', and the value corresponding to character w is w', The corresponding value of character e is e’, the position of character i is id, the position of character t is td, the position of character w is wd, and the position of character e is ed.
首先,对字符i对应的数值进行运算。First, perform operations on the numerical value corresponding to the character i.
1)可以将i对应的数值与预设变量x的值进行异或运算,得到字符i对应的第一中间值。在本实施例中,将预设变量x的值除2后,将相除后得到的值与i对应的数值进行异或运算,得到第一中间值,并将预设变量x的值更新为第一中间值。该过程对应图5步骤1中的x=x>>1)^i’。1) The value corresponding to i can be XORed with the value of the preset variable x to obtain the first intermediate value corresponding to the character i. In this embodiment, after dividing the value of the preset variable x by 2, perform an XOR operation on the divided value and the value corresponding to i to obtain the first intermediate value, and update the value of the preset variable x to First intermediate value. This process corresponds to x=x>>1)^i’ in step 1 of Figure 5.
2)可以将字符i对应的第一中间值减去预设值,得到第二中间值。在本实施例中,将字符i对应的第一中间值减去1后,得到字符i对应的第二中间值。然后将第一中间值和第二中间值进行异或运算,得到字符i对应的中间值,并确定中间值小于片长阈值slen。该过程对应图5步骤1中的x^(x-1)<slen。2) The first intermediate value corresponding to character i can be subtracted from the preset value to obtain the second intermediate value. In this embodiment, after subtracting 1 from the first intermediate value corresponding to character i, the second intermediate value corresponding to character i is obtained. Then perform an XOR operation on the first intermediate value and the second intermediate value to obtain the intermediate value corresponding to character i, and determine that the intermediate value is smaller than the slice length threshold slen. This process corresponds to x^(x-1)<slen in step 1 of Figure 5.
3)判断字符i所在位置不是目标数据流最后一个字符所在的位置max。该过程对应图5步骤1中的id!=max。3) Determine that the position of character i is not the position max of the last character of the target data stream. This process corresponds to id!=max in step 1 of Figure 5.
进一步的,对字符t对应的数值进行运算。Further, perform operations on the numerical value corresponding to the character t.
1)可以将t对应的数值与更新后的预设变量x的值进行异或运算,得到字符t对应的第一中间值。该过程对应图5步骤2中的x=x>>1)^t’。1) The value corresponding to t can be XORed with the value of the updated preset variable x to obtain the first intermediate value corresponding to the character t. This process corresponds to x=x>>1)^t’ in step 2 of Figure 5.
2)可以将字符i对应的第一中间值减去预设值,得到第二中间值。然后将第一中间值和第二中间值进行异或运算,得到字符t对应的中间值,并确定中间值小于片长阈值slen。该过程对应图5步骤2中的x^(x-1)<slen。2) The first intermediate value corresponding to character i can be subtracted from the preset value to obtain the second intermediate value. Then perform an XOR operation on the first intermediate value and the second intermediate value to obtain the intermediate value corresponding to the character t, and determine that the intermediate value is smaller than the slice length threshold slen. This process corresponds to x^(x-1)<slen in step 2 of Figure 5.
3)判断字符t所在位置不是目标数据流最后一个字符所在的位置max。该过程对应图5步骤2中的td!=max。3) Determine that the position of character t is not the position max of the last character of the target data stream. This process corresponds to td!=max in step 2 of Figure 5.
进一步的,对字符w对应的数值进行运算。Further, perform operations on the numerical value corresponding to the character w.
1)可以将w对应的数值与更新后的预设变量x的值进行异或运算,得到字符w对应的第一中间值。该过程对应图5步骤3中的x=x>>1)^w’。1) The value corresponding to w can be XORed with the value of the updated preset variable x to obtain the first intermediate value corresponding to the character w. This process corresponds to x=x>>1)^w’ in step 3 of Figure 5.
2)可以将字符w对应的第一中间值减去预设值,得到第二中间值。然后将第一中间值和第二中间值进行异或运算,得到字符w对应的中间值,并确定中间值小于片长阈值slen。该过程对应图5步骤3中的x^(x-1)<slen。2) The first intermediate value corresponding to the character w can be subtracted from the preset value to obtain the second intermediate value. Then perform an XOR operation on the first intermediate value and the second intermediate value to obtain the intermediate value corresponding to the character w, and determine that the intermediate value is smaller than the slice length threshold slen. This process corresponds to x^(x-1)<slen in step 3 of Figure 5.
3)判断字符w所在位置不是目标数据流最后一个字符所在的位置max。该过程对应图5步骤3中的wd!=max。3) Determine that the position of character w is not the position max of the last character of the target data stream. This process corresponds to wd!=max in step 3 of Figure 5.
进一步的,对字符e对应的数值进行运算。Further, the numerical value corresponding to the character e is calculated.
1)可以将e对应的数值与更新后的预设变量x的值进行异或运算,得到字符e对应的第一中间值。该过程对应图5步骤4中的x=x>>1)^e’。1) The value corresponding to e can be XORed with the value of the updated preset variable x to obtain the first intermediate value corresponding to the character e. This process corresponds to x=x>>1)^e’ in step 4 of Figure 5.
2)可以将字符e对应的第一中间值减去预设值,得到第二中间值。然后将第一中间值和第二中间值进行异或运算,得到字符e对应的中间值,并确定中间值大于片长阈值slen。该过程对应图5步骤4中的x^(x-1)>slen。2) The first intermediate value corresponding to the character e can be subtracted from the preset value to obtain the second intermediate value. Then perform an XOR operation on the first intermediate value and the second intermediate value to obtain the intermediate value corresponding to the character e, and determine that the intermediate value is greater than the slice length threshold slen. This process corresponds to x^(x-1)>slen in step 4 of Figure 5.
3)判断字符e所在位置不是目标数据流最后一个字符所在的位置max。该过程对应图5步骤4中的ed!=max。3) Determine that the position of character e is not the position max of the last character of the target data stream. This process corresponds to ed!=max in step 4 of Figure 5.
在对字符e的数值进行运算后,由于得到的中间值大于片长阈值slen,则字符e所在的位置确定切片结束位置,然后将字符e和字符i之间(包括字符e和字符i)的长度作为切片长度,对目标数据流进行切片。此处,由于切片长度应该4(表示4个字符)。如此,可以将itwe作为一个数据块,对目标数据流进行切片。然后可以将字符e后的字符a所在位置作为切片起始位置,以及将字符a作为目标字符,并执行上述步骤,继续对目标数据流进行切片。After calculating the numerical value of character e, since the obtained intermediate value is greater than the slice length threshold slen, the position of character e determines the end position of the slice, and then the value between character e and character i (including character e and character i) is The length is used as the slice length to slice the target data stream. Here, since the slice length should be 4 (meaning 4 characters). In this way, itwe can be used as a data block to slice the target data stream. Then the position of character a after character e can be used as the starting position of slicing, and character a can be used as the target character, and the above steps can be performed to continue slicing the target data stream.
继续参阅图6,为本申请的另一个实施例提供的数据切片的流程图。图6中,在对目标字符i对应的数值进行运算后,得到的中间值便大于片长阈值slen。此时,将目标字符i单独作为一个数据块,对目标数据流进行切片。Continuing to refer to FIG. 6 , a flow chart of data slicing is provided for another embodiment of the present application. In Figure 6, after calculating the value corresponding to the target character i, the obtained intermediate value is greater than the slice length threshold slen. At this time, the target character i is treated as a separate data block and the target data stream is sliced.
继续参阅图7,为本申请的另一个实施例提供的数据切片的流程图。图7中,假设字符d对应的数值为d’,字符d所在的位置为dd。从图7可以看出,假设从字符c处开始对目标数据流进行切片,依照与图5相似的流程到达字符d所在位置后,对字符d对应的数值进行运算,得到的中间值小于片长阈值slen,但字符d所在位置为目标数据流最后一个字符所在的位置max,则可以将字符d所在作为字符结束位置,并将字符cold作为一个数据块,对目标数据流进行切割。Continuing to refer to FIG. 7 , a flow chart of data slicing is provided for another embodiment of the present application. In Figure 7, it is assumed that the value corresponding to character d is d’, and the position of character d is dd. As can be seen from Figure 7, assuming that the target data stream is sliced starting from character c, after reaching the position of character d according to a process similar to Figure 5, the value corresponding to character d is calculated, and the obtained intermediate value is less than the slice length Threshold slen, but the position of character d is the position max of the last character of the target data stream, then the character d can be used as the end position of the character, and the character cold can be used as a data block to cut the target data stream.
在本申请的一些实施例中,从目标数据流中获取切片起始位置的目标字符,并从预设数组中查找与目标字符相对应的数值,然后根据目标字符对应的数值,确定切片长度,来对目标数据流进行切片。如此,可以对目标数据流进行变长切片,进而可以在数据新增、删除等场景下,降低新增、删除的字符对于数据切片的影响,有效提高重删率。In some embodiments of the present application, the target character at the starting position of the slice is obtained from the target data stream, and the value corresponding to the target character is searched from the preset array, and then the slice length is determined based on the value corresponding to the target character, to slice the target data stream. In this way, the target data stream can be sliced into variable lengths, which can reduce the impact of added and deleted characters on data slicing in scenarios such as data addition and deletion, effectively improving the deduplication rate.
请参阅图8,为本申请的一个实施例提供的数据切片装置的模块示意图。数据切片装置包括:Please refer to FIG. 8 , which is a schematic module diagram of a data slicing device according to an embodiment of the present application. Data slicing devices include:
数据获取模块,用于获取待切片的目标数据流;Data acquisition module, used to obtain the target data stream to be sliced;
查找模块,用于从目标数据流中获取切片起始位置的目标字符,并从预设数组中查找与目标字符相对应的数值,其中,在预设数组中,至少部分字符对应的数值不相同;及The search module is used to obtain the target character at the starting position of the slice from the target data stream, and to find the value corresponding to the target character from the preset array, where at least some of the characters in the preset array have different values. ;and
切片模块,用于根据目标字符相对应的数值,确定切片长度,并从切片起始位置开始,按照切片长度,对目标数据流进行切片。The slicing module is used to determine the slice length according to the value corresponding to the target character, and slice the target data stream according to the slice length starting from the starting position of the slice.
请参阅图9,为本申请的一个实施例提供的数据切片系统的示意图。数据切片系统包括处理器和存储器,存储器用于存储计算机程序,计算机程序被处理器执行时,实现上述的数据切片方法。Please refer to Figure 9, which is a schematic diagram of a data slicing system provided by an embodiment of the present application. The data slicing system includes a processor and a memory. The memory is used to store computer programs. When the computer program is executed by the processor, the above-mentioned data slicing method is implemented.
其中,处理器可以为中央处理器(Central Processing Unit,CPU)。处理器还可以为其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等芯片,或者上述各类芯片的组合。The processor may be a central processing unit (Central Processing Unit, CPU). The processor can also be other general-purpose processors, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other Chips such as programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of these types of chips.
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序、非暂态计算机可执行程序以及模块,如本发明实施方式中的方法对应的程序指令/模块。处理器通过运行存储在存储器中的非暂态软件程序、指令以及模块,从而执行处理器的各种功能应用以及数据处理,即实现上述方法实施方式中的方法。As a non-transitory computer-readable storage medium, the memory can be used to store non-transitory software programs, non-transitory computer executable programs and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention. The processor executes various functional applications and data processing of the processor by running non-transient software programs, instructions and modules stored in the memory, that is, implementing the method in the above method implementation.
存储器可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储处理器所创建的数据等。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory may include a program storage area and a data storage area, where the program storage area may store an operating system and an application program required for at least one function; the data storage area may store data created by the processor, etc. In addition, the memory may include high-speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory may optionally include memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.
本申请一个实施方式还提供一种计算机可读存储介质,所述计算机可读存储介质用于存储计算机程序,所述计算机程序被处理器执行时,实现上述的数据切片方法。One embodiment of the present application also provides a computer-readable storage medium, which is used to store a computer program. When the computer program is executed by a processor, the above-mentioned data slicing method is implemented.
虽然结合附图描述了本发明的实施方式,但是本领域技术人员可以在不脱离本发明的精神和范围的情况下作出各种修改和变型,这样的修改和变型均落入由所附权利要求所限定的范围之内。Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art can make various modifications and variations without departing from the spirit and scope of the invention. Such modifications and variations are covered by the appended claims. within the limited scope.

Claims (10)

  1. 一种数据切片方法,其特征在于,所述方法包括:A data slicing method, characterized in that the method includes:
    获取待切片的目标数据流;Get the target data stream to be sliced;
    从所述目标数据流中获取切片起始位置的目标字符,并从预设数组中查找与所述目标字符相对应的数值,其中,在所述预设数组中,至少部分字符对应的数值不相同;及根据所述目标字符对应的数值,确定切片长度,并从所述切片起始位置开始,按照所述切片长度,对所述目标数据流进行切片。Obtain the target character at the starting position of the slice from the target data stream, and search for the numerical value corresponding to the target character from the preset array, wherein in the preset array, the numerical value corresponding to at least some characters is not are the same; and determine the slice length according to the numerical value corresponding to the target character, and slice the target data stream according to the slice length starting from the starting position of the slice.
  2. 如权利要求1所述的方法,其特征在于,所述从预设数组中查找与所述目标字符相对应的数值,包括:The method of claim 1, wherein searching for a value corresponding to the target character from a preset array includes:
    从预设字符表中查找所述目标字符的字符编号,其中,所述预设字符表包括字符和字符编号的对应关系,不同字符的字符编号不同;Search the character number of the target character from a preset character table, where the preset character table includes a correspondence between characters and character numbers, and different characters have different character numbers;
    将所述目标字符的字符编号作为索引,从所述预设数组中查找与所述目标字符相对应的数值。Using the character number of the target character as an index, search for the value corresponding to the target character from the preset array.
  3. 如权利要求1所述的方法,其特征在于,所述根据所述目标字符相对应的数值,确定切片长度,包括:The method of claim 1, wherein determining the slice length according to the numerical value corresponding to the target character includes:
    对所述目标字符相对应的数值进行运算,得到对应的中间值;Perform operations on the numerical values corresponding to the target characters to obtain the corresponding intermediate values;
    若得到的中间值小于片长阈值,针对所述切片起始位置后的字符依次执行如下操作,直至确定切片结束位置,并将所述切片结束位置和所述切片起始位置之间的长度作为所述切片长度:If the obtained intermediate value is less than the slice length threshold, the following operations are performed sequentially for the characters after the slice starting position until the slice end position is determined, and the length between the slice end position and the slice start position is taken as The slice length:
    从所述预设数组中查找与所述字符相对应的数值;Find the value corresponding to the character from the preset array;
    对查找到的数值进行运算,得到对应的中间值;Perform operations on the found values to obtain the corresponding intermediate values;
    若得到的中间值大于或等于所述片长阈值,将所述字符所在位置确定为所述切片结束位置。If the obtained intermediate value is greater than or equal to the slice length threshold, the position of the character is determined as the end position of the slice.
  4. 如权利要求3所述的方法,其特征在于,针对从所述预设数组中查找到的任一数值,基于如下方法确定对应的中间值: The method according to claim 3, characterized in that, for any value found from the preset array, the corresponding intermediate value is determined based on the following method:
    将该数值与预设变量的值进行异或运算,得到第一中间值;Perform an XOR operation on this value and the value of the preset variable to obtain the first intermediate value;
    将所述第一中间值减去预设值,得到第二中间值;Subtract a preset value from the first intermediate value to obtain a second intermediate value;
    将所述第一中间值和所述第二中间值进行异或运算,得到对应的中间值。Perform an XOR operation on the first intermediate value and the second intermediate value to obtain the corresponding intermediate value.
  5. 如权利要求4所述的方法,其特征在于,在每次计算得到所述第一中间值后,所述方法还包括:The method of claim 4, wherein after each calculation of the first intermediate value, the method further includes:
    将所述预设变量的值更新为所述第一中间值,以便于在下一次将所述预设变量的值与从所述预设数组中查找的数值进行异或运算时,基于更新后所述预设变量的值,与从所述预设数组中查找的数值进行异或运算。Update the value of the preset variable to the first intermediate value, so that the next time the value of the preset variable is XORed with the value found from the preset array, based on the updated value The value of the preset variable is XORed with the value found in the preset array.
  6. 如权利要求3所述的方法,其特征在于,在获取到待切片的所述目标数据流后,所述方法还包括:The method of claim 3, wherein after obtaining the target data stream to be sliced, the method further includes:
    将所述目标数据流中最后一个字符所在的位置记录为字符结束位置;Record the position of the last character in the target data stream as the character end position;
    针对所述目标数据流中的任一所述字符,在该字符对应的中间值小于所述片长阈值的情况下,判断该字符所在位置是否为所述字符结束位置,若是,将所述字符结束位置和所述切片起始位置之间的长度作为所述切片长度。For any character in the target data stream, when the intermediate value corresponding to the character is less than the fragment length threshold, determine whether the position of the character is the end position of the character, and if so, move the character The length between the end position and the starting position of the slice is used as the slice length.
  7. 如权利要求3所述的方法,其特征在于,在对所述目标字符相对应的数值进行运算,得到对应的中间值后,所述方法还包括:The method according to claim 3, characterized in that, after calculating the numerical value corresponding to the target character and obtaining the corresponding intermediate value, the method further includes:
    若得到的中间值大于或等于所述片长阈值,将所述目标字符作为待切割的数据块,对所述目标数据流进行切片。If the obtained intermediate value is greater than or equal to the slice length threshold, the target character is used as a data block to be cut, and the target data stream is sliced.
  8. 一种数据切片装置,其特征在于,所述装置包括:A data slicing device, characterized in that the device includes:
    数据获取模块,用于获取待切片的目标数据流;Data acquisition module, used to obtain the target data stream to be sliced;
    查找模块,用于从所述目标数据流中获取切片起始位置的目标字符,并从预设数组中查找与所述目标字符相对应的数值,其中,在所述预设数组中,至少部分字符对应的数值不相同;及切片模块,用于根据所述目标字符相对应的数值,确定切片长度,并从所述切片起始位置开始,按照所述切片长度,对所述目标数据流进行切片。A search module, configured to obtain the target character at the starting position of the slice from the target data stream, and search for the numerical value corresponding to the target character from the preset array, wherein in the preset array, at least part of The numerical values corresponding to the characters are different; and a slicing module is used to determine the slice length according to the numerical value corresponding to the target character, and start from the starting position of the slice and perform the processing on the target data stream according to the slice length. slice.
  9. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质用于存储计算机程序,所述计算机程序被处理器执行时,实现如权利要求1至7中任一所述的方法。A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program. When the computer program is executed by a processor, the method according to any one of claims 1 to 7 is implemented.
  10. 一种数据切片系统,其特征在于,所述数据切片系统包括处理器和存储器,所述存储器用于存储计算机程序,所述计算机程序被所述处理器执行时,实现如权利要求1至7中任一所述的方法。 A data slicing system, characterized in that the data slicing system includes a processor and a memory, and the memory is used to store a computer program. When the computer program is executed by the processor, it implements claims 1 to 7 any of the methods described.
PCT/CN2022/141819 2022-07-29 2022-12-26 Data slicing method, apparatus and system WO2024021491A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210910741.8A CN115470186A (en) 2022-07-29 2022-07-29 Data slicing method, device and system
CN202210910741.8 2022-07-29

Publications (1)

Publication Number Publication Date
WO2024021491A1 true WO2024021491A1 (en) 2024-02-01

Family

ID=84366324

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141819 WO2024021491A1 (en) 2022-07-29 2022-12-26 Data slicing method, apparatus and system

Country Status (2)

Country Link
CN (1) CN115470186A (en)
WO (1) WO2024021491A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115470186A (en) * 2022-07-29 2022-12-13 天翼云科技有限公司 Data slicing method, device and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158954A (en) * 2007-11-07 2008-04-09 上海爱数软件有限公司 Method for recognizing repeat data in computer storage
CN101320372A (en) * 2008-05-22 2008-12-10 上海爱数软件有限公司 Compression method for repeated data
CN102682086A (en) * 2012-04-23 2012-09-19 华为技术有限公司 Data segmentation method and data segmentation equipment
US20170242620A1 (en) * 2014-09-28 2017-08-24 Beijing Gupanchuangshi Science And Technology Development Co., Ltd. Data Block Storage Method, Data Query Method and Data Modification Method
CN111722787A (en) * 2019-03-22 2020-09-29 华为技术有限公司 Blocking method and device
WO2021174839A1 (en) * 2020-03-06 2021-09-10 平安科技(深圳)有限公司 Data compression method and apparatus, and computer-readable storage medium
CN115470186A (en) * 2022-07-29 2022-12-13 天翼云科技有限公司 Data slicing method, device and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158954A (en) * 2007-11-07 2008-04-09 上海爱数软件有限公司 Method for recognizing repeat data in computer storage
CN101320372A (en) * 2008-05-22 2008-12-10 上海爱数软件有限公司 Compression method for repeated data
CN102682086A (en) * 2012-04-23 2012-09-19 华为技术有限公司 Data segmentation method and data segmentation equipment
US20170242620A1 (en) * 2014-09-28 2017-08-24 Beijing Gupanchuangshi Science And Technology Development Co., Ltd. Data Block Storage Method, Data Query Method and Data Modification Method
CN111722787A (en) * 2019-03-22 2020-09-29 华为技术有限公司 Blocking method and device
WO2021174839A1 (en) * 2020-03-06 2021-09-10 平安科技(深圳)有限公司 Data compression method and apparatus, and computer-readable storage medium
CN115470186A (en) * 2022-07-29 2022-12-13 天翼云科技有限公司 Data slicing method, device and system

Also Published As

Publication number Publication date
CN115470186A (en) 2022-12-13

Similar Documents

Publication Publication Date Title
US20200150890A1 (en) Data Deduplication Method and Apparatus
US10089360B2 (en) Apparatus and method for single pass entropy detection on data transfer
CN108427539B (en) Offline de-duplication compression method and device for cache device data and readable storage medium
US11232073B2 (en) Method and apparatus for file compaction in key-value store system
US11797204B2 (en) Data compression processing method and apparatus, and computer-readable storage medium
US10152389B2 (en) Apparatus and method for inline compression and deduplication
US20240022648A1 (en) Systems and methods for data deduplication by generating similarity metrics using sketch computation
WO2014067063A1 (en) Duplicate data retrieval method and device
US10817474B2 (en) Adaptive rate compression hash processor
CN107704202B (en) Method and device for quickly reading and writing data
WO2014094479A1 (en) Method and device for deleting duplicate data
US11249987B2 (en) Data storage in blockchain-type ledger
CN111125033B (en) Space recycling method and system based on full flash memory array
US9843802B1 (en) Method and system for dynamic compression module selection
WO2024021491A1 (en) Data slicing method, apparatus and system
CN111274245B (en) Method and device for optimizing data storage
US11675768B2 (en) Compression/decompression using index correlating uncompressed/compressed content
WO2021226922A1 (en) Data compression method, apparatus and device, and readable storage medium
WO2020192012A1 (en) Data processing method and apparatus, and storage medium
CN109213972B (en) Method, device, equipment and computer storage medium for determining document similarity
Vikraman et al. A study on various data de-duplication systems
CN114461635A (en) MySQL database data storage method and device and electronic equipment
US8972360B2 (en) Position invariant compression of files within a multi-level compression scheme
US11748307B2 (en) Selective data compression based on data similarity
CN111625186B (en) Data processing method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22952920

Country of ref document: EP

Kind code of ref document: A1