RU2622875C2

RU2622875C2 - Method of digital data prefix deduplication

Info

Publication number: RU2622875C2
Application number: RU2015118470A
Authority: RU
Inventors: Дмитрий Борисович Афанасьев; Максим Андреевич Жуков
Priority date: 2015-05-18
Filing date: 2015-05-18
Publication date: 2017-06-20
Also published as: RU2015118470A

Abstract

FIELD: information technology.

SUBSTANCE: digital data are divided into data blocks of equal length, the metadata of said blocks being placed bitwise in the prefix tree. The metadata selection is carried out by the segments of equal length directly from the data blocks, the presence of a block identical to the processed one among the already processed is determined by means of the prefix tree bypass in the predetermined segments bypass order, in the absence of the link by value of the segment corresponding to the bypass order at another prefix tree level, the processed block is recognized as unique, and a link to this block is added at said prefix tree level as per the corresponding segment value; in case of finding the link to the processed data block in the prefix tree, the full reconciliation of both blocks is performed, as the result of which, in case of detection of the blocks difference, the link to the processed block is replaced by the new branch of the tree, containing the sequence of nodes up to the first different segment, and in case of blocks matching, the decision is made to define the processed block as a duplicate.

EFFECT: elimination of redundancy in the processed digital data.

2 dwg

Description

Изобретение относится к области сжатия данных и может использоваться при хранении больших объемов данных с избыточностью.The invention relates to the field of data compression and can be used when storing large amounts of data with redundancy.

Из существующего уровня техники известны методы дедупликации данных, осуществляющие по крайней мере первичный поиск идентичных блоков данных по значению их хеш-сумм, описанные в публикации Александра Щербинина Решения по дедупликации данных // Storage News. 2008. №2 (35) (http://old.i-teco.ru/article198.html). Недостатками существующих технических решений являются необходимость накладных вычислительных расходов на вычисление результата хеш-функции для каждого блока данных, необходимость применения методов разрешения хеш-коллизий, большой объем метаданных, прямо пропорциональный объему уникальных блоков данных и зависящий от размера результата применяемой хеш-функции.Data deduplication methods are known from the prior art that perform at least a primary search for identical data blocks by their hash sums described in Alexander Shcherbinin's publication Data Deduplication Solutions // Storage News. 2008. No. 2 (35) (http://old.i-teco.ru/article198.html). The disadvantages of existing technical solutions are the need for computational overhead to calculate the result of the hash function for each data block, the need to apply hash collision resolution methods, a large amount of metadata that is directly proportional to the volume of unique data blocks and depends on the size of the result of the applied hash function.

Наиболее близким к заявленному техническому решению является метод оптимизации блочной дедупликации данных (US 8108353 В2, опубл. 31.01.2012), осуществляющий разбиение цифровых данных на блоки данных равной длины и размещение метаданных этих блоков данных, представляющих значения их хеш-функций, в префиксном дереве. Недостатками данного технического решения является выбор в качестве метаданных результатов хеш-функций для каждого блока обрабатываемых данных, что требует дополнительных вычислительных расходов на вычисление результата хеш-функции для каждого блока обрабатываемых данных, и необходимость хранения больших объемов полученных метаданных.Closest to the claimed technical solution is a method for optimizing block data deduplication (US 8108353 B2, published January 31, 2012), which splits digital data into data blocks of equal length and places metadata of these data blocks representing values of their hash functions in a prefix tree . The disadvantages of this technical solution is the choice of hash functions as metadata for each block of processed data, which requires additional computational costs for calculating the result of the hash function for each block of processed data, and the need to store large amounts of received metadata.

Задачей, на решение которой направлено заявленное изобретение, является снижение объема метаданных, сокращение вычислительных накладных расходов и времени процесса.The problem to which the claimed invention is directed is to reduce the amount of metadata, reduce computational overhead and process time.

Данная задача решается за счет того, что в способе префиксной дедупликации цифровых данных, согласно которому цифровые данные разбивают на блоки данных равной длины и последовательно обрабатывают, помещая метаданные этих блоков поразрядно в префиксное дерево, новым является то, что выбор метаданных осуществляется по сегментам также равной длины непосредственно из блоков данных, определение наличия идентичного обрабатываемому блоку среди уже обработанных осуществляется путем обхода префиксного дерева по заранее определенному порядку обхода сегментов, при отсутствии на очередном уровне префиксного дерева ссылки по значению соответствующего порядку обхода сегмента обрабатываемый блок признают уникальным и добавляют ссылку на этот блок на этом уровне префиксного дерева по соответствующему значению сегмента. В случае нахождения ссылки в префиксном дереве на обработанный блок данных выполняют полную сверку обоих блоков, в результате которой при обнаружении различия блоков осуществляют замену ссылки на обработанный блок ссылкой на новую ветвь дерева, содержащую последовательность узлов до первого различного сегмента, а в случае совпадения блоков принимают решение по определению обрабатываемого блока дубликатом.This problem is solved due to the fact that in the method of prefix deduplication of digital data, according to which digital data is divided into data blocks of equal length and sequentially processed by placing the metadata of these blocks bitwise in the prefix tree, it is new that metadata is selected by segments also equal lengths directly from data blocks, determining the presence of an identical block to be processed among already processed ones is carried out by traversing the prefix tree in a predetermined order when traversing segments, if there is no link at the next level of the prefix tree, the processed block is recognized as unique by the value of the segment traversal order and a link to this block is added at this level of the prefix tree by the corresponding segment value. If there is a link in the prefix tree to the processed data block, a complete reconciliation of both blocks is performed, as a result of which, when differences are detected, the links to the processed block are replaced by a link to a new tree branch containing a sequence of nodes up to the first different segment, and if the blocks match decision to determine the processed block duplicate.

Техническим результатом, обеспечиваемым приведенной совокупностью признаков, является устранение избыточности в обрабатываемых цифровых данных.The technical result provided by the given set of features is the elimination of redundancy in the processed digital data.

На фиг. 1 изображен алгоритм обработки блока данных способом префиксной дедупликации цифровых данных. Способ оперирует блоками данных равной длины, полученными из цифровых данных. Для очередного обрабатываемого блока данных определяется первый обрабатываемый сегмент согласно заранее выбранному порядку обхода блока данных, например прямому порядку обхода, подразумевающему последовательный обход блока данных от младшего сегмента к старшему. Обход префиксного дерева начинается с корневого узла префиксного дерева. Из цифровых данных выбирается блок данных и разбивается на сегменты равной длины. По значению первого сегмента этого блока, согласно выбранному порядку обхода, осуществляется переход из корневого узла в другой узел. Переход из текущего узла в следующий узел осуществляется по значению текущего сегмента блока данных в случае наличия ссылки на узел. В случае отсутствия ссылки по значению текущего сегмента обрабатываемого блока данных осуществляется его запись на носитель с последующем изменением метаданных путем записи ссылки на записанный блок в текущий узел префиксного дерева по значению текущего сегмента блока данных. При переходе в следующий узел префиксного дерева в качестве текущего сегмента выбирается следующий сегмент согласно выбранному порядку обхода блока данных. В случае наличия ссылки на блок данных на носителе информации по значению текущего сегмента блока данных осуществляется чтение блока данных с носителя и производится полная сверка с обрабатываемым блоком данных. При несовпадении блоков данных осуществляется запись обрабатываемого блока данных на носитель (блок признается уникальным), построение ветви префиксного дерева от текущего узла до первого отличного сегмента блоков данных согласно выбранному порядку обхода и запись ссылок на блоки данных в узел дерева по значениям отличных сегментов.In FIG. 1 shows an algorithm for processing a data block by the method of prefix deduplication of digital data. The method operates with blocks of data of equal length obtained from digital data. For the next data block being processed, the first segment being processed is determined according to a pre-selected order of traversal of the data block, for example, a direct traversal order, which implies sequential traversal of the data block from the lower to the highest segment. Walking around the prefix tree starts at the root node of the prefix tree. From the digital data, a data block is selected and divided into segments of equal length. By the value of the first segment of this block, according to the selected traversal order, a transition is made from the root node to another node. The transition from the current node to the next node is carried out by the value of the current segment of the data block if there is a link to the node. If there is no reference to the value of the current segment of the data block being processed, it is recorded on the medium with the subsequent change of metadata by writing the link to the recorded block in the current node of the prefix tree according to the value of the current segment of the data block. When moving to the next node of the prefix tree, the next segment is selected as the current segment according to the selected data block traversal order. If there is a link to the data block on the storage medium by the value of the current segment of the data block, the data block is read from the medium and a complete reconciliation with the processed data block is performed. If the data blocks do not coincide, the processed data block is written to the medium (the block is recognized as unique), the prefix tree branch is built from the current node to the first different segment of data blocks according to the selected traversal order, and the links to data blocks are written to the tree node using the values of different segments.

В соответствии с фиг. 1 на фиг. 2 изображено частично заполненное префиксное дерево, содержащее метаданные 18 обработанных блоков данных. На фиг. 2 ссылки на блоки данных на носителе изображены пунктиром. Размер сегмента в данном примере равен 1 байту, и максимальное количество ссылок в узле равно 256. По данному дереву можно найти блоки, расположенные на носителе, по первым трем сегментам согласно заранее выбранному порядку обхода блока данных. Значения сегментов в дереве изображены на ребрах дерева. В примере приведены блоки данных с начальными сегментами со значениями 0, 2, 74, 255. В частности метаданные содержат информацию о двух обработанных блоках со значением первого сегмента 74, отличных по значениям второго сегмента (0 и 93). По значениям отличных сегментов в узле содержатся ссылки на блоки данных, находящихся на носителе информации, а в узле по значению первого сегмента, равному 0, содержатся ссылки как на другие узлы по значениям отличных сегментов блоков данных (0 и 125), так и ссылки на блоки данных, находящиеся на носителе информации (8).In accordance with FIG. 1 in FIG. 2 shows a partially filled prefix tree containing metadata of 18 processed data blocks. In FIG. 2 references to data blocks on the medium are indicated by a dotted line. The segment size in this example is 1 byte, and the maximum number of links in the node is 256. In this tree, you can find the blocks located on the medium in the first three segments according to the pre-selected order of traversal of the data block. The values of the segments in the tree are shown on the edges of the tree. The example shows data blocks with initial segments with values of 0, 2, 74, 255. In particular, metadata contains information about two processed blocks with the value of the first segment 74, different in values of the second segment (0 and 93). According to the values of different segments, the node contains links to data blocks located on the storage medium, and the node, by the value of the first segment equal to 0, contains links to other nodes by the values of different segments of data blocks (0 and 125), and links to data blocks located on the storage medium (8).

Предложенный способ может быть реализован 4 модулями:The proposed method can be implemented by 4 modules:

1) модулем приема данных, отвечающим за получение данных и предоставление блока данных фиксированной длины;1) a data receiving module responsible for receiving data and providing a fixed-length data block;

2) модулем верификации данных, осуществляющим определение наличия подобного блока в хранилище;2) a data verification module that determines the availability of such a block in the repository;

3) модулем хранения метаданных, осуществляющим хранение, поиск и доступ метаданных;3) a metadata storage module that stores, searches and access metadata;

4) модулем доступа к хранилищу, осуществляющим взаимодействие с носителем дедуплицированных данных.4) a storage access module that interacts with a deduplicated data medium.

Модуль приема данных выделяет из данных блок фиксированной длины и передает его в модуль верификации данных. Модуль верификации данных производит выявление блока данных путем обхода метаданных, взаимодействуя с модулем хранения метаданных. В случае наличия ссылки на блок данных модуль хранения метаданных возвращает ссылку на этот блок данных на носителе, и модуль верификации получает от модуля доступа к хранилищу блок данных и осуществляет полную сверку блоков данных. В случае совпадения блоков модуль верификации признает проверяемый блок дубликатом, иначе осуществляет запись проверяемого блока через модуль доступа к хранилищу и возвращает значение ссылки записанного блока в модуль хранения метаданных, который осуществляет построение ветви префиксного дерева от узла, содержащего ссылку считанного блока, согласно определенному порядку обхода блока данных до первого различного сегмента и запись ссылок блоков данных по значениям отличных сегментов. В случае отсутствия ссылки на блок данных модуль верификации инициирует запись блока данных в модуль доступа к хранилищу и, получив ссылку записанного блока данных, передает ссылку на него в модуль хранения метаданных, который осуществляет запись ссылки блока данных в узел префиксного дерева по значению последнего проверенного сегмента.The data receiving module extracts a fixed-length block from the data and transfers it to the data verification module. The data verification module identifies the data block by traversing metadata, interacting with the metadata storage module. If there is a link to the data block, the metadata storage module returns a link to this data block on the medium, and the verification module receives a data block from the storage access module and performs a complete reconciliation of the data blocks. If the blocks match, the verification module recognizes the block being checked as a duplicate; otherwise, it records the block being checked through the storage access module and returns the link value of the recorded block to the metadata storage module, which constructs the prefix tree branch from the node containing the read block link according to a certain traversal order block of data to the first different segment and record links of data blocks according to the values of different segments. If there is no link to the data block, the verification module initiates the recording of the data block in the storage access module and, having received the link of the recorded data block, sends a link to it to the metadata storage module, which records the data block link to the prefix tree node by the value of the last verified segment .

Результатом приведенного технического решения является получение данных с устраненной избыточностью на блочном уровне.The result of the technical solution is to obtain data with eliminated redundancy at the block level.

Claims

A method of prefix deduplication of digital data, according to which digital data is divided into data blocks of equal length and sequentially processed by placing the metadata of these blocks bitwise in a prefix tree, characterized in that the metadata is also selected on segments of equal length directly from the data blocks, identifying whether it is identical to the processed a block among already processed ones is carried out by traversing the prefix tree in a predetermined order of traversing segments, if there is no blackout the processed block is recognized as unique and add a link to this block at this level of the prefix tree according to the corresponding segment value, if there is a link in the prefix tree to the processed data block, a complete reconciliation of both blocks is performed, as a result which, when detecting differences in blocks, replace the link to the processed block with a link to a new branch of the tree containing a sequence of nodes up to the first of the segment, and in case of coincidence deciding blocks by definition the processed block duplicate.