WO2024021491A1 - Procédé, appareil et système de découpage de données - Google Patents

Procédé, appareil et système de découpage de données Download PDF

Info

Publication number
WO2024021491A1
WO2024021491A1 PCT/CN2022/141819 CN2022141819W WO2024021491A1 WO 2024021491 A1 WO2024021491 A1 WO 2024021491A1 CN 2022141819 W CN2022141819 W CN 2022141819W WO 2024021491 A1 WO2024021491 A1 WO 2024021491A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
slice
target
value
data stream
Prior art date
Application number
PCT/CN2022/141819
Other languages
English (en)
Chinese (zh)
Inventor
刘利
姚栋
赵真
李丽
赵龙飞
杨思源
Original Assignee
天翼云科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 天翼云科技有限公司 filed Critical 天翼云科技有限公司
Publication of WO2024021491A1 publication Critical patent/WO2024021491A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • the invention relates to the field of performance testing, and in particular to a data slicing method, device and system.
  • data deduplication technology refers to slicing the data stream or file according to a certain method (also known as dicing), and performs hash calculations on data blocks to find the same data blocks for deletion.
  • Data deduplication technology can compress duplicate data in the storage system and reduce storage capacity. This technology is currently widely used in backup systems.
  • fixed-length slicing means defining a fixed slice length (for example, 2 bytes is a data block), and then dividing the data stream into data blocks of the same length for storage according to the defined slice length.
  • This slicing method has a low deduplication rate for data streams in some new and deleted scenarios. For example, compared to data stream B, data stream A only adds a new character at the front end of data stream B.
  • the deduplication rate is not high.
  • embodiments of the present invention provide a data slicing method, a data slicing device, a data slicing system and a computer-readable storage medium, which can improve the data deduplication rate.
  • the present invention provides a data slicing method, which method includes:
  • the slice length is determined, and starting from the starting position of the slice, the target data stream is sliced according to the slice length.
  • searching for a value corresponding to the target character from a preset array includes:
  • determining the slice length according to the numerical value corresponding to the target character includes:
  • the following operations are performed sequentially for the characters after the slice starting position until the slice end position is determined, and the length between the slice end position and the slice start position is taken as The slice length:
  • the position of the character is determined as the end position of the slice.
  • the corresponding intermediate value is determined based on the following method:
  • the method further includes:
  • the method further includes:
  • the intermediate value corresponding to the character is less than the fragment length threshold, determine whether the position of the character is the end position of the character, and if so, move the character
  • the length between the end position and the starting position of the slice is used as the slice length.
  • the method further includes:
  • the target character is used as a data block to be cut, and the target data stream is sliced.
  • the present invention also provides a data slicing device, which includes:
  • Data acquisition module used to obtain the target data stream to be sliced
  • a search module configured to obtain the target character at the starting position of the slice from the target data stream, and search for the numerical value corresponding to the target character from the preset array, wherein in the preset array, at least part of The characters correspond to different numerical values;
  • a slicing module configured to determine the slice length according to the numerical value corresponding to the target character, and slice the target data stream according to the slice length starting from the starting position of the slice.
  • Another aspect of the present invention also provides a computer-readable storage medium.
  • the computer-readable storage medium is used to store a computer program. When the computer program is executed by a processor, the method as described above is implemented.
  • the present invention also provides a data slicing system.
  • the data slicing system includes a processor and a memory.
  • the memory is used to store a computer program.
  • the computer program is executed by the processor, the above-mentioned steps are implemented. Methods.
  • the target character at the starting position of the slice is obtained from the target data stream, and the value corresponding to the target character is searched from the preset array, and then the slice length is determined based on the value corresponding to the target character, to slice the target data stream.
  • the target data stream can be sliced into variable lengths, which can reduce the impact of newly added and deleted characters on data slicing in scenarios such as data addition and deletion, effectively improving the deduplication rate.
  • Figure 1 shows a schematic diagram of data slicing in some technologies
  • Figure 2 shows a schematic flowchart of a data slicing method provided by an embodiment of the present application
  • Figure 3 shows a schematic diagram of a target data flow provided by an embodiment of the present application
  • Figure 4 shows a schematic diagram of data slicing provided by one embodiment of the present application
  • Figure 5 shows a flow chart of data slicing provided by an embodiment of the present application
  • Figure 6 shows a flow chart of data slicing provided by another embodiment of the present application.
  • Figure 7 shows a flow chart of data slicing provided by another embodiment of the present application.
  • Figure 8 shows a schematic module diagram of a data slicing device provided by an embodiment of the present application.
  • Figure 9 shows a schematic diagram of a data slicing system provided by an embodiment of the present application.
  • FIG. 1 compared to data stream A, data stream B has new characters in the gray box.
  • data stream B and data stream A using fixed-length slicing affected by the new characters in data stream B, the data stream B and data stream A are cut out Among the data blocks, there are fewer identical data blocks.
  • the storage system needs to allocate storage space for data flow A and data flow B respectively. Although there are more identical characters in data stream B and data stream A, the deduplication rate is low and no compressed storage space is used.
  • this application provides a data slicing method that can improve the deduplication rate.
  • the data slicing method provided in this application can be applied to electronic devices.
  • Electronic devices include but are not limited to tablets, desktop computers, laptops, and servers.
  • Figure 2 is a schematic flow chart of a data slicing method according to an embodiment of the present application.
  • the data slicing method includes steps S21 to S23.
  • Step S21 Obtain the target data stream to be sliced.
  • the target data stream includes content to be stored in the storage system, such as a file data stream and a video image data stream.
  • Step S22 Obtain the target character at the starting position of the slice from the target data stream, and search for the value corresponding to the target character from the preset array, where at least some of the characters in the preset array have different values.
  • the slicing starting position refers to the starting position when slicing the target data stream.
  • the starting position of the slice can be the position of the first character of the target data stream.
  • the slicing starting position may be the position of the first character in the target data stream that has not yet been sliced after the last slicing is completed.
  • FIG. 3 is a schematic diagram of a target data flow provided by an embodiment of the present application. In Figure 3, it is assumed that the first character on the left is the starting position of the target data stream, and the last character on the right is the end position of the target data stream.
  • the starting position of the slicing can be the position of the first character on the left; when slicing the target data stream has already started, for example, the last slice is from the dotted line If cutting is performed, then the next time the slice is sliced, the starting position of the slice can be the position of the first character after the dotted line (that is, the character o).
  • the preset array may be a random array generated in a random manner.
  • a character is represented by 8 bits (8bit). 8 bits can represent 256 different characters, so the default array can include 256 values. Each numerical value corresponds to one of the above 256 characters. That is, for each character in the target data stream, a corresponding value can be found in the preset array. For example, character A corresponds to the value 20 in the preset array, and character B corresponds to the value 15 in the preset array.
  • the preset array can be saved. For characters in different target data streams, the values corresponding to the characters can be found in the same preset array. That is, only one preset array can be generated, and there is no need to generate separate preset arrays for different target data streams.
  • the character number of the target character when searching for the numerical value corresponding to the target character from the preset array, can be searched from the preset character table, where the preset character table includes the correspondence between characters and character numbers, Different characters have different character numbers. Then the character number of the target character can be used as an index to find the value corresponding to the target character from the preset array.
  • the preset character table may be an ASCII code table.
  • the character number can be the ASCII code of the character in the ASCII code table.
  • ASCII code table can also be pre-established by technical personnel. In the character table, technicians can assign each character a character number that is different from the ASCII code.
  • the character number can be used as the position information of the numerical value to search for the numerical value corresponding to the target character in the preset array. For example, assuming the character number is 20, the 20th value in the default array is used as the value corresponding to the target character.
  • Step S23 Determine the slice length according to the numerical value corresponding to the target character, and slice the target data stream according to the slice length starting from the starting position of the slice.
  • the value corresponding to the target character can be directly used as the slice length, that is, starting from the starting position of the slice, the target data stream is sliced according to the length defined by the value corresponding to the target character.
  • the value corresponding to the target character can be used as the number of characters to be sliced.
  • Figure 3 For ease of understanding, please refer to Figure 3. For example, assuming that the target character in Figure 3 is the first character i on the left, and the value corresponding to the target character is 5, then starting from the target character i, a data block containing 5 characters can be cut out from the target data stream ( i.e. itwea).
  • the value corresponding to the target character can also be used as the number of bits to be sliced.
  • a data block containing 7 characters can be cut out from the target data stream. (i.e. itweasc).
  • the length of the cut data block is also different. It should not be exactly the same, that is, when slicing the target data stream, the variable-length slicing method is used. And since it is a variable-length slice, it is possible to solve the problem described in Figure 1.
  • FIG. 4 a schematic diagram of data slicing is provided for an embodiment of the present application.
  • Data stream B in Figure 4 is similar to data stream A.
  • the characters in the gray box are new characters in data stream B compared to data stream A.
  • this application is a variable-length slice, for data stream A and data stream B, it is possible to separately cut the newly added characters in data stream B into a data block.
  • data stream A and data stream B can be cut into data blocks as shown in Figure 4.
  • data flow A and data flow B jointly include data blocks itw, asc, old, ayi, and, then during data storage, only one of these data blocks can be saved. In this way, the purpose of compressing storage space and improving deduplication rate is achieved.
  • the lengths of different target data streams may vary greatly. For example, some target data streams include a larger number of characters, and some target data streams include a smaller number of characters. If the value corresponding to the target character is large, it is more suitable to slice the target data stream with a longer length. However, for a target data stream with a shorter length, there may be a problem that it cannot be sliced. For example, assume that the target data stream includes 5 characters, but the value corresponding to the target character is 10 (representing 10 characters). In this case, it may not be possible to slice the target data stream with a short length.
  • the slice length may be determined based on the following method.
  • the numerical value corresponding to the target character can be operated to obtain the corresponding intermediate value. If the obtained intermediate value is less than the slice length threshold, perform the following operations on the characters after the slice start position in sequence until the slice end position is determined, and the length between the slice end position and the slice start position is used as the slice length:
  • the position of the character is determined as the end position of the slice.
  • the corresponding intermediate value can be determined based on the following method:
  • the value of the preset variable after each calculation of the first intermediate value, can also be updated to the first intermediate value, so that the value of the preset variable can be compared with the value found from the preset array next time.
  • the XOR operation is performed on the value of the updated preset variable and the value found in the preset array.
  • the position of the last character in the target data stream may also be recorded as the character end position.
  • the position of the character is the end position of the character. If so, the distance between the end position of the character and the start position of the slice is length as slice length.
  • the target character after calculating the numerical value corresponding to the target character and obtaining the corresponding intermediate value, if the obtained intermediate value is greater than or equal to the slice length threshold, the target character can be used as a data block to be cut, and the target data can be The stream is sliced.
  • FIG. 5 is a flow chart of data slicing provided in an embodiment of the present application.
  • the first character on the left is the target character i.
  • the preset variable is x
  • the initial value of the preset variable x is one of the values in the preset array, or the initial value of the preset variable x is a randomly generated or specified value
  • the film length threshold is slen
  • the target data The position of the last character in the stream is max, the value corresponding to character i is i', the value corresponding to character t is t', and the value corresponding to character w is w'
  • the corresponding value of character e is e’, the position of character i is id, the position of character t is td, the position of character w is wd, and the position of character e is ed.
  • the value corresponding to i can be XORed with the value of the preset variable x to obtain the first intermediate value corresponding to the character i.
  • the first intermediate value corresponding to character i can be subtracted from the preset value to obtain the second intermediate value.
  • the second intermediate value corresponding to character i is obtained.
  • the first intermediate value corresponding to the character w can be subtracted from the preset value to obtain the second intermediate value. Then perform an XOR operation on the first intermediate value and the second intermediate value to obtain the intermediate value corresponding to the character w, and determine that the intermediate value is smaller than the slice length threshold slen.
  • This process corresponds to x ⁇ (x-1) ⁇ slen in step 3 of Figure 5.
  • the first intermediate value corresponding to the character e can be subtracted from the preset value to obtain the second intermediate value. Then perform an XOR operation on the first intermediate value and the second intermediate value to obtain the intermediate value corresponding to the character e, and determine that the intermediate value is greater than the slice length threshold slen.
  • This process corresponds to x ⁇ (x-1)>slen in step 4 of Figure 5.
  • the position of character e determines the end position of the slice, and then the value between character e and character i (including character e and character i) is The length is used as the slice length to slice the target data stream.
  • the slice length should be 4 (meaning 4 characters). In this way, itwe can be used as a data block to slice the target data stream.
  • the position of character a after character e can be used as the starting position of slicing, and character a can be used as the target character, and the above steps can be performed to continue slicing the target data stream.
  • FIG. 6 a flow chart of data slicing is provided for another embodiment of the present application.
  • the obtained intermediate value is greater than the slice length threshold slen.
  • the target character i is treated as a separate data block and the target data stream is sliced.
  • FIG. 7 a flow chart of data slicing is provided for another embodiment of the present application.
  • the value corresponding to character d is d’, and the position of character d is dd.
  • the value corresponding to character d is calculated, and the obtained intermediate value is less than the slice length Threshold slen, but the position of character d is the position max of the last character of the target data stream, then the character d can be used as the end position of the character, and the character cold can be used as a data block to cut the target data stream.
  • the target character at the starting position of the slice is obtained from the target data stream, and the value corresponding to the target character is searched from the preset array, and then the slice length is determined based on the value corresponding to the target character, to slice the target data stream.
  • the target data stream can be sliced into variable lengths, which can reduce the impact of added and deleted characters on data slicing in scenarios such as data addition and deletion, effectively improving the deduplication rate.
  • FIG. 8 is a schematic module diagram of a data slicing device according to an embodiment of the present application.
  • Data slicing devices include:
  • Data acquisition module used to obtain the target data stream to be sliced
  • the search module is used to obtain the target character at the starting position of the slice from the target data stream, and to find the value corresponding to the target character from the preset array, where at least some of the characters in the preset array have different values. ;and
  • the slicing module is used to determine the slice length according to the value corresponding to the target character, and slice the target data stream according to the slice length starting from the starting position of the slice.
  • FIG. 9 is a schematic diagram of a data slicing system provided by an embodiment of the present application.
  • the data slicing system includes a processor and a memory.
  • the memory is used to store computer programs.
  • the computer program is executed by the processor, the above-mentioned data slicing method is implemented.
  • the processor may be a central processing unit (Central Processing Unit, CPU).
  • the processor can also be other general-purpose processors, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other Chips such as programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of these types of chips.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • Chips such as programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of these types of chips.
  • the memory can be used to store non-transitory software programs, non-transitory computer executable programs and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention.
  • the processor executes various functional applications and data processing of the processor by running non-transient software programs, instructions and modules stored in the memory, that is, implementing the method in the above method implementation.
  • the memory may include a program storage area and a data storage area, where the program storage area may store an operating system and an application program required for at least one function; the data storage area may store data created by the processor, etc.
  • the memory may include high-speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device.
  • the memory may optionally include memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.
  • One embodiment of the present application also provides a computer-readable storage medium, which is used to store a computer program.
  • the computer program is executed by a processor, the above-mentioned data slicing method is implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Character Input (AREA)

Abstract

Procédé, appareil et système de découpage de données. Le procédé de découpage de données consiste à : acquérir un flux de données cible à découper (S21); acquérir à partir du flux de données cible un caractère cible à une position de départ de découpage, et rechercher un réseau prédéfini pour une valeur numérique correspondant au caractère cible, les valeurs numériques correspondant à au moins certains caractères étant différentes dans le réseau prédéfini (S22); et en fonction de la valeur numérique correspondant au caractère cible, déterminer une longueur de découpage et démarrer à la position de départ de découpage, découper le flux de données cible en fonction de la longueur de découpage (S23). Le procédé peut améliorer le taux de déduplication.
PCT/CN2022/141819 2022-07-29 2022-12-26 Procédé, appareil et système de découpage de données WO2024021491A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210910741.8 2022-07-29
CN202210910741.8A CN115470186A (zh) 2022-07-29 2022-07-29 一种数据切片方法、装置和系统

Publications (1)

Publication Number Publication Date
WO2024021491A1 true WO2024021491A1 (fr) 2024-02-01

Family

ID=84366324

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141819 WO2024021491A1 (fr) 2022-07-29 2022-12-26 Procédé, appareil et système de découpage de données

Country Status (2)

Country Link
CN (1) CN115470186A (fr)
WO (1) WO2024021491A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115470186A (zh) * 2022-07-29 2022-12-13 天翼云科技有限公司 一种数据切片方法、装置和系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158954A (zh) * 2007-11-07 2008-04-09 上海爱数软件有限公司 一种在计算机存储中识别重复数据的方法
CN101320372A (zh) * 2008-05-22 2008-12-10 上海爱数软件有限公司 一种重复数据的压缩方法
CN102682086A (zh) * 2012-04-23 2012-09-19 华为技术有限公司 数据分块方法及设备
US20170242620A1 (en) * 2014-09-28 2017-08-24 Beijing Gupanchuangshi Science And Technology Development Co., Ltd. Data Block Storage Method, Data Query Method and Data Modification Method
CN111722787A (zh) * 2019-03-22 2020-09-29 华为技术有限公司 一种分块方法及其装置
WO2021174839A1 (fr) * 2020-03-06 2021-09-10 平安科技(深圳)有限公司 Procédé et appareil de compression de données, et support de stockage lisible par ordinateur
CN115470186A (zh) * 2022-07-29 2022-12-13 天翼云科技有限公司 一种数据切片方法、装置和系统

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158954A (zh) * 2007-11-07 2008-04-09 上海爱数软件有限公司 一种在计算机存储中识别重复数据的方法
CN101320372A (zh) * 2008-05-22 2008-12-10 上海爱数软件有限公司 一种重复数据的压缩方法
CN102682086A (zh) * 2012-04-23 2012-09-19 华为技术有限公司 数据分块方法及设备
US20170242620A1 (en) * 2014-09-28 2017-08-24 Beijing Gupanchuangshi Science And Technology Development Co., Ltd. Data Block Storage Method, Data Query Method and Data Modification Method
CN111722787A (zh) * 2019-03-22 2020-09-29 华为技术有限公司 一种分块方法及其装置
WO2021174839A1 (fr) * 2020-03-06 2021-09-10 平安科技(深圳)有限公司 Procédé et appareil de compression de données, et support de stockage lisible par ordinateur
CN115470186A (zh) * 2022-07-29 2022-12-13 天翼云科技有限公司 一种数据切片方法、装置和系统

Also Published As

Publication number Publication date
CN115470186A (zh) 2022-12-13

Similar Documents

Publication Publication Date Title
US10089360B2 (en) Apparatus and method for single pass entropy detection on data transfer
US20200150890A1 (en) Data Deduplication Method and Apparatus
CN108427539B (zh) 缓存设备数据的离线去重压缩方法、装置及可读存储介质
US10678435B2 (en) Deduplication and compression of data segments in a data storage system
US11627207B2 (en) Systems and methods for data deduplication by generating similarity metrics using sketch computation
US10152389B2 (en) Apparatus and method for inline compression and deduplication
US11232073B2 (en) Method and apparatus for file compaction in key-value store system
US20210397350A1 (en) Data Processing Method and Apparatus, and Computer-Readable Storage Medium
CN107704202B (zh) 一种数据快速读写的方法和装置
WO2014067063A1 (fr) Procédé et dispositif de récupération de données en double
CN111125033B (zh) 一种基于全闪存阵列的空间回收方法及系统
WO2014094479A1 (fr) Procédé et dispositif permettant de supprimer des données dupliquées
US11249987B2 (en) Data storage in blockchain-type ledger
CN110389967B (zh) 数据存储方法、装置、服务器及存储介质
US9843802B1 (en) Method and system for dynamic compression module selection
US11995050B2 (en) Systems and methods for sketch computation
WO2024021491A1 (fr) Procédé, appareil et système de découpage de données
WO2021226922A1 (fr) Procédé, appareil et dispositif de compression de données, et support de stockage lisible
CN111274245B (zh) 一种用于优化数据存储的方法和装置
US11675768B2 (en) Compression/decompression using index correlating uncompressed/compressed content
CN107423425B (zh) 一种对k/v格式的数据快速存储和查询方法
WO2020192012A1 (fr) Procédé et appareil de traitement de données et support de stockage
Vikraman et al. A study on various data de-duplication systems
CN114461635A (zh) 一种MySQL数据库数据存储方法、装置和电子设备
US8972360B2 (en) Position invariant compression of files within a multi-level compression scheme

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22952920

Country of ref document: EP

Kind code of ref document: A1