RU2403677C1

RU2403677C1 - Method for lossless data compression and retrieval

Info

Publication number: RU2403677C1
Application number: RU2009104211/09A
Authority: RU
Inventors: Сергей Борисович Муллов (RU); Сергей Борисович Муллов
Original assignee: Сергей Борисович Муллов
Priority date: 2009-02-09
Filing date: 2009-02-09
Publication date: 2010-11-10
Also published as: RU2009104211A

Abstract

FIELD: information technology.

SUBSTANCE: number of zeros n₀ and number of ones n₁ is determined in a compressed data stream, an algorithm is selected for assigning non-repeating digital codes to all possible permutations with repetitions from n₀ zeroes and n₁ ones and finding the corresponding permutation with repetitions in accordance with the assigned digital code and the number of each character. A specific data stream from n₀ zeroes and n₁ ones is assigned a digital code N_c in accordance with the selected algorithm. The total number of characters in the digital code N_c n_c is determined. The value d₁ which is equal to the sum of n₁ and n₀ minus the value n_c is determined, as well as the value d₂ which is equal to half the difference between values n₀ and n₁, after which the assigned digital code N_c and values d₁ and d₂ are stored. To retrieve the data stream, reverse operations are carried out and in accordance with the selected algorithm on values n₀, n₁ and N_c, the specific permutation with repetitions from n₀ zeroes and n₁ ones is found, which corresponds to the initial data stream.

EFFECT: higher compression ratio of data and possibility of compressing previously compressed data.

2 tbl

Description

Изобретение относится к области, связанной с сокращением избыточности передаваемой и хранимой информации, и может быть использовано для сжатия и восстановления без потерь цифровых данных в информационных системах и системах электросвязи.The invention relates to the field associated with reducing the redundancy of transmitted and stored information, and can be used to compress and restore lossless digital data in information systems and telecommunication systems.

Основы теории информации были заложены К. Шенноном в 1948 году в своей пионерской работе по теории информации (Д. Сэломон, «Сжатие данных, изображений и звука«, М.: Техносфера, 2004, стр.25), в которой он ввел понятие энтропии источника. Фундаментальное значение этой величины состоит в том, что она задает нижнюю границу возможного сжатия (Д. Сэломон, «Сжатие данных, изображений и звука», М.: Техносфера, 2004, стр.8). К этой границе можно приближаться сколь угодно близко, с помощью подходящего способа кодирования источника. Под энтропией символа а, имеющего вероятность Р, подразумевается количество информации, содержащейся в а, которая равна: - P·log₂(P) (Д. Сэломон, «Сжатие данных, изображений и звука», М.: Техносфера, 2004, стр.25).The foundations of the theory of information were laid down by K. Shannon in 1948 in his pioneering work on the theory of information (D. Salomon, “Compression of Data, Images, and Sound,” Moscow: Technosphere, 2004, p. 25), in which he introduced the concept of entropy source. The fundamental value of this quantity is that it sets the lower limit of possible compression (D. Salomon, “Compression of data, images, and sound,” Moscow: Technosphere, 2004, p. 8). This boundary can be approached arbitrarily close, using a suitable source coding method. By the entropy of the symbol a, with probability P, is meant the amount of information contained in a, which is: - P · log ₂ (P) (D. Salomon, “Compression of data, images and sound”, M .: Technosphere, 2004, p. .25).

Одним из основных показателей эффективности сжатия является коэффициент сжатия - К_с, который определяется как:One of the main indicators of compression efficiency is the compression ratio - K _s , which is defined as:

К_с = Размер выходного потока/размер входного потока.K _c = Size of the output stream / size of the input stream.

Различают сжатие без потерь, при котором возможно полностью восстановить исходную информацию, и сжатие с частичной потерей данных, при котором происходит восстановление данных с частичной потерей информации.Distinguish between lossless compression, in which it is possible to completely restore the original information, and compression with partial data loss, in which data is restored with a partial loss of information.

Известны статистические способы сжатия и восстановления данных, которые используют статистические свойства сжимаемых данных (Д. Сэломон, «Сжатие данных, изображений и звука«, М.: Техносфера, 2004, стр.25). Под статистическими свойствами понимают вероятность каждого символа в потоке данных. Для нахождения статистических свойств считают количество каждого символа в потоке данных.Statistical methods for compressing and recovering data are known that use the statistical properties of compressible data (D. Salomon, “Compressing Data, Images, and Sound,” M .: Technosphere, 2004, p. 25). Statistical properties mean the probability of each symbol in the data stream. To find statistical properties, consider the amount of each character in the data stream.

Известен статистический способ сжатия и восстановления данных без потерь с использованием кодов переменной длины (Д. Сэломон, "Сжатие данных, изображений и звука", М.: Техносфера, 2004 г. стр.26), включающий выбор алфавита - набора символов от a₁ до a_n, сбор статистики - подсчет количества каждого символа в сжимаемом потоке данных и вычисление вероятностей каждого символа алфавита в сжимаемом потоке данных, составление кодов переменной длины и замену исходных символов кодами переменной длины, при этом часто встречающимся символам присваивают короткие коды, а редко встречающимся - длинные. Для восстановления первоначального потока данных в соответствии с построенными кодами переменной длины заменяют коды переменной длины на исходные символы алфавита. Наиболее известным из способов кодирования кодами переменной длины является способ Хаффмана (Д. Сэломон, "Сжатие данных, изображений и звука", М.: Техносфера, 2004, стр.30).There is a statistical method for lossless data compression and recovery using variable-length codes (D. Salomon, “Compressing data, images and sound”, M .: Technosphere, 2004, p. 26), including the choice of the alphabet - a set of characters from a ₁ to a _n , statistics collection - counting the number of each character in a compressible data stream and calculating the probabilities of each character of the alphabet in a compressible data stream, compiling variable-length codes and replacing the original characters with variable-length codes, while frequently occurring characters are assigned to short codes, and rare ones - long. To restore the original data stream in accordance with the constructed codes of variable length, replace the codes of variable length with the original characters of the alphabet. The most famous of the encoding methods of variable length codes is the Huffman method (D. Salomon, "Compression of data, images and sound", M .: Technosphere, 2004, p.30).

Недостатками известного способа является то, что:The disadvantages of this method is that:

- максимальное сжатие достигается, если вероятности символов равны отрицательным степеням числа 2;- maximum compression is achieved if the probabilities of the characters are equal to the negative powers of 2;

- различные длины кодовых слов приводят к неравномерным задержкам при восстановлении данных;- different codeword lengths lead to uneven delays in data recovery;

- способ неэффективен, если символы алфавита имеют близкие вероятности;- the method is ineffective if the characters of the alphabet have close probabilities;

- способ неприменим в случае использования двухсимвольного алфавита.- the method is not applicable in the case of using a two-character alphabet.

Известен также статистический способ сжатия и восстановления данных без потерь под названием арифметическое кодирование (Д.Сэломон, "Сжатие данных, изображений и звука", М.: Техносфера, 2004, стр.63), принятый за прототип, который включает:There is also a statistical method of lossless data compression and recovery called arithmetic coding (D. Salomon, “Compressing data, images and sound”, M .: Technosphere, 2004, p. 63), adopted for the prototype, which includes:

- сбор статистики о сжимаемом потоке данных, то есть подсчет в сжимаемом потоке данных количества каждого символа;- collection of statistics about the compressible data stream, that is, counting the number of each character in the compressible data stream;

- задание «текущего интервала» [0, 1);- setting the "current interval" [0, 1);

- повторение следующих действий для каждого символа входного потока данных;- repeating the following steps for each character of the input data stream;

- разделение текущего интервала на части пропорционально вероятностям каждого символа;- dividing the current interval into parts in proportion to the probabilities of each symbol;

- выбор подинтервала, соответствующего символу, и назначение его новым текущим интервалом.- selection of the sub-interval corresponding to the symbol, and its assignment to the new current interval.

Когда весь входной поток данных будет обработан, выходом алгоритма объявляется любая точка, которая однозначно определяет текущий интервал и будет записана в виде конечного цифрового кода.When the entire input data stream is processed, the output of the algorithm declares any point that uniquely determines the current interval and will be recorded as a final digital code.

Для декодирования полученной строки цифр в известном способе читают первую цифру кода и в соответствии с значениями таблицы, в которую на этапе сжатия были занесены символы сжимаемого потока данных, частоты и вероятности символов, интервалы, присвоенные символам, определяют первый символ, затем удаляют эффект первого символа из кода с помощью вычитания нижнего конца интервала символа и деления на длину этого интервала. Далее повторяют предыдущую последовательность действий до конца кода.To decode the obtained string of digits in the known method, the first digit of the code is read and, in accordance with the values of the table in which the symbols of the compressible data stream, the frequencies and probabilities of the symbols were entered, the intervals assigned to the symbols determine the first symbol, then remove the effect of the first symbol from the code by subtracting the lower end of the character interval and dividing by the length of this interval. Next, repeat the previous sequence of actions until the end of the code.

Способ сжатия и восстановления данных без потерь под названием арифметическое кодирование позволяет сжимать данные до теоретического предела (Д. Сэломон, "Сжатие данных, изображений и звука", М.: Техносфера, 2004, стр.76). Недостатком известного способа является:A lossless data compression and recovery method called arithmetic coding allows data to be compressed to a theoretical limit (D. Salomon, “Compressing Data, Images, and Sound,” M .: Technosphere, 2004, p. 76). The disadvantage of this method is:

- метод неэффективен, если вероятности символов в сжимаемом потоке данных равны или имеют близкие значения, то есть в случаях, когда энтропия потока данных стремится к максимальному значению.- the method is ineffective if the probabilities of the symbols in the compressible data stream are equal or have close values, that is, in cases where the entropy of the data stream tends to the maximum value.

Техническим результатом заявляемого изобретения является:The technical result of the claimed invention is:

- повышение степени сжатия данных;- increase the degree of data compression;

- возможность сжатия данных, ранее подвергнутых сжатию.- the ability to compress data previously subjected to compression.

Указанный технический результат достигается тем, что в способе сжатия и восстановления данных без потерь, включающем подсчет количества в сжимаемом потоке данных, состоящем из р различных символов, каждого символа и обозначение через n₁, n₂…n_p, согласно изобретению выбирают алгоритм присвоения неповторяющихся цифровых кодов всем возможным перестановкам с повторениями из произвольного количества символов и нахождения в соответствии с присвоенным цифровым кодом и количеством каждого символа соответствующей перестановки с повторениями, затем присваивают конкретному потоку данных из р различных символов в количестве n₁, n₂…n_p в соответствии с выбранным алгоритмом конкретный цифровой код N_c, после чего присвоенный цифровой код N_c и количество каждого символа n₁, n₂…n_p сохраняют, а для восстановления потока данных в соответствии с выбранным алгоритмом, присвоенным цифровым кодом N_c и значениями n₁, n₂…n_p находят конкретную перестановку с повторениями из р различных символов в количестве n₁, n₂…n_p, которая соответствует исходному потоку данных.The specified technical result is achieved by the fact that in the method of compressing and recovering lossless data, including counting the number in a compressible data stream consisting of p different symbols, each symbol and designation by n ₁ , n ₂ ... n _p , according to the invention, an algorithm for assigning non-repeating digital codes to all possible permutations with repetitions from an arbitrary number of characters and finding in accordance with the assigned digital code and the number of each character of the corresponding permutation with repetition mi, then assign a specific data stream of p different symbols in the amount of n ₁ , n ₂ ... n _p in accordance with the selected algorithm, a specific digital code N _c , after which the assigned digital code N _c and the number of each character n ₁ , n ₂ ... n _{p is} saved, and to restore the data stream in accordance with the selected algorithm assigned by the digital code N _c and the values n ₁ , n ₂ ... n _p find a specific permutation with repetitions of p different characters in the amount of n ₁ , n ₂ ... n _p , which corresponds to the original data stream.

Указанный технический результат достигается также тем, что в способе сжатия и восстановления данных без потерь в сжимаемом потоке данных, состоящем из р различных символов, считают количество каждого символа в потоке данных и обозначают через n₁, n₂…n_p, согласно изобретению выбирают алгоритм А присвоения неповторяющихся цифровых кодов всем возможным перестановкам с повторениями из произвольного количества символов и нахождения в соответствии с присвоенным цифровым кодом и количеством каждого символа соответствующей перестановки с повторениями, затем присваивают конкретному потоку данных из р различных символов в количестве n₁, n₂…n_p в соответствии с выбранным алгоритмом А конкретный цифровой код N_c, после чего считают общее количество символов в цифровом коде N_c и обозначают его через n_c, затем выбирают алгоритм В определения значений n₁, n₂…n_p через значение n_c и новые значения d₁, d₂…d_p и нахождения значений n₁, n₂…n_p через значения n_c, d₁, d₂…d_p, затем присвоенный цифровой код N_c и значения d₁, d₂…d_p сохраняют, а для восстановления потока данных считают общее количество символов в цифровом коде N_c и обозначают его через n_c, затем в соответствии с выбранным алгоритмом В и значениями d₁, d₂…d_p находят значения n₁, n₂…n_p, после чего в соответствии с выбранным алгоритмом А, кодом N_c и значениями n₁, n₂…n_p находят конкретную перестановку с повторениями из р различных символов в количестве n₁, n₂…n_p, которая соответствует исходному потоку данных. Способ осуществляют следующим образом.The indicated technical result is also achieved by the fact that in the lossless data compression and recovery method in a compressible data stream consisting of p different symbols, the number of each symbol in the data stream is counted and denoted by n ₁ , n ₂ ... n _p , according to the invention, an algorithm is selected And the assignment of non-repeating digital codes to all possible permutations with repetitions from an arbitrary number of characters and finding in accordance with the assigned digital code and the number of each character of the corresponding permutation with ovtoreniyami then assigned to a particular data stream from a number of different symbols in the number n _1, n ₂ ... n _p, in accordance with the selected algorithm A particular digital code N _c, then find the total number of characters in the digital code N _c and denote it by n _c , then choose the algorithm In determining the values of n ₁ , n ₂ ... n _p through the value of n _c and the new values of d ₁ , d ₂ ... d _p and finding the values of n ₁ , n ₂ ... n _p through the values of n _c , d ₁ , d ₂ ... d _p, is then assigned a digital code value N _c and d _1, d ₂ ... d _p stored, and to restore the data stream is considered total quant symbols GUSTs digital code N _c and denote it by n _c, then according to the selected algorithm B and the values d _1, d ₂ ... d _p are the values of n _1, n ₂ ... n _p, and then according to the selected algorithm A , the code N _c and the values n ₁ , n ₂ ... n _p find a specific permutation with repetitions of p different symbols in the amount of n ₁ , n ₂ ... n _p , which corresponds to the original data stream. The method is as follows.

В сжимаемом потоке данных, состоящем из р различных символов, считают количество каждого символа в потоке данных и обозначают через n₁, n₂…n_p. Затем выбирают алгоритм присвоения неповторяющихся цифровых кодов всем возможным перестановкам с повторениями из произвольного количества символов и нахождения в соответствии с присвоенным цифровым кодом и количеством каждого символа соответствующей перестановки с повторениями, после чего присваивают конкретному потоку данных из р различных символов в количестве n₁, n₂…n_p в соответствии с выбранным алгоритмом конкретный цифровой код N_с, присвоенный цифровой код N_c и количество каждого символа n₁, n₂…n_p сохраняют.In a compressible data stream consisting of p different symbols, the number of each symbol in the data stream is counted and denoted by n ₁ , n ₂ ... n _p . Then, an algorithm is selected for assigning non-repeating digital codes to all possible permutations with repetitions from an arbitrary number of characters and finding the corresponding permutation with repetitions in accordance with the assigned digital code and the number of each character, after which they are assigned to a specific data stream from p different symbols in the amount of n ₁ , n ₂ ... n _p in accordance with the selected algorithm, a specific digital code N _c , the assigned digital code N _c and the number of each character n ₁ , n ₂ ... n _p are stored.

Возможны различные алгоритмы нахождения N_c и восстановления исходного потока данных по значениям N_c, n₁, n₂…n_p, при этом наиболее оптимальным будет алгоритм, которому для нахождения N_c и для восстановления исходного потока данных по значениям N_c, n₁, n₂…n_p требуется выполнение наименьшего количества перестановок с повторениями, либо наименьшее количество вычислительных операций. Сжатие данных в заявляемом способе достигается за счет того, что для записи количества перестановок с повторениями из р различных символов в количестве n₁, n₂…n_p, определяемого формулой (n₀+n₁…+n_p)!/n₀!n₁!…n_p! (М. Холл, «Комбинаторика», М.: Мир, 1970, стр.13), требуется всегда меньшее количество информации, например бит, в случае использования двухсимвольного алфавита, чем для записи несжатого потока данных, то есть log_k[(n₀+n₁…+n_p)!/n₀!n₁!…n_p!]<[n₀+n₁…+n_p], где n₀…n_p- количество каждого символа от l до р, log_k - логарифм по основанию k.Various algorithms for finding N _c and restoring the original data stream from the values of N _c , n ₁ , n ₂ ... n _p are possible, while the most optimal algorithm is that for finding N _c and to restore the original data stream from the values of N _c , n ₁ , n ₂ ... n _p requires the least number of permutations with repetitions, or the least number of computational operations. Data compression in the claimed method is achieved due to the fact that to record the number of permutations with repetitions of p different symbols in the amount of n ₁ , n ₂ ... n _p defined by the formula (n ₀ + n ₁ ... + n _p )! / N ₀ ! n ₁ ! ... n _p ! (M. Hall, Combinatorics, Moscow: Mir, 1970, p. 13), always requires less information, for example bits, in the case of using a two-character alphabet than for writing an uncompressed data stream, that is, log _k [(n ₀ + n ₁ ... + n _p )! / N ₀ ! N ₁ ! ... n _p !] <[N ₀ + n ₁ ... + n _p ], where n ₀ ... n _p is the number of each character from l to p, log _k is the base k logarithm.

В случае когда для хранения или передачи значений N_c и n₁, n₂…n_p требуется меньшее количество информации, чем для хранения или передачи исходного потока данных, достигается указанный технический результат.In the case when for storing or transmitting the values of N _c and n ₁ , n ₂ ... n _p requires less information than for storing or transmitting the original data stream, the specified technical result is achieved.

Для восстановления потока данных в соответствии с выбранным алгоритмом, присвоенным цифровым кодом N_c и значениями n₁, n₂…n_p находят конкретную перестановку с повторениями из р различных символов в количестве n₁, n₂…n_p, которая соответствует исходному потоку данных.To restore the data stream in accordance with the selected algorithm assigned by the digital code N _c and the values n ₁ , n ₂ ... n _p find a specific permutation with repetitions of p different symbols in the amount of n ₁ , n ₂ ... n _p , which corresponds to the original data stream .

Для реализации способа сжатия и восстановления данных без потерь для сжатия данных, ранее подвергнутых сжатию и имеющих высокую энтропию, в сжимаемом потоке данных, состоящем из р различных символов, считают количество каждого символа в потоке данных и обозначают через n₁, n₂…n_p, затем выбирают алгоритм А присвоения неповторяющихся цифровых кодов всем возможным перестановкам с повторениями из произвольного количества символов и нахождения в соответствии с присвоенным цифровым кодом и количеством каждого символа соответствующей перестановки с повторениями, после чего присваивают конкретному потоку данных из р различных символов в количестве n₁, n₂…n_p в соответствии с выбранным алгоритмом А конкретный цифровой код N_c.To implement a lossless data compression and recovery method for compressing data previously compressed and having high entropy, in the compressible data stream consisting of p different symbols, consider the number of each symbol in the data stream and denote it by n ₁ , n ₂ ... n _p , then select the algorithm A for assigning non-repeating digital codes to all possible permutations with repetitions from an arbitrary number of characters and finding, in accordance with the assigned digital code and the number of each character of the corresponding SETTING with repetition, and then assigned to a specific data stream of a number of different symbols in the number n _1, n ₂ ... n _p, in accordance with the selected algorithm A particular digital code N _c.

Возможны различные алгоритмы нахождения N_c и восстановления исходного потока данных по значениям N_c, n₁, n₂…n_p, при этом наиболее оптимальным будет алгоритм, которому для нахождения N_c и для восстановления исходного потока по значениям N_c, n₁, n₂…n_p требуется выполнение наименьшего количества перестановок с повторениями либо наименьшее количество вычислительных операций. Сжатие данных в заявляемом способе достигается за счет того, что для записи количества перестановок с повторениями из р различных символов в количестве n₁, n₂…n_p, определяемого формулой (n₀+n₁…+n_p)!/n₀!n₁!…n_p! (М. Холл, «Комбинаторика», М.: Мир, 1970, стр.13), требуется всегда меньшее количество информации, например бит, в случае использования двухсимвольного алфавита, чем для записи несжатого потока данных, то есть log_k[(n₀+n₁…+n_p)!/n₀!n₁!…n_p!]<[n₀+n₁…+n_p], где n₀…n_p - количество каждого символа от l до р, log_k - логарифм по основанию k.Various algorithms for finding N _c and restoring the original data stream from the values of N _c , n ₁ , n ₂ ... n _p are possible, while the most optimal algorithm is that for finding N _c and to restore the original stream from the values of N _c , n ₁ , n ₂ ... n _p requires the least number of permutations with repetitions or the least number of computational operations. Data compression in the claimed method is achieved due to the fact that to record the number of permutations with repetitions of p different symbols in the amount of n ₁ , n ₂ ... n _p defined by the formula (n ₀ + n ₁ ... + n _p )! / N ₀ ! n ₁ ! ... n _p ! (M. Hall, Combinatorics, Moscow: Mir, 1970, p. 13), always requires less information, for example bits, in the case of using a two-character alphabet than for writing an uncompressed data stream, that is, log _k [(n ₀ + n ₁ ... + n _p )! / N ₀ ! N ₁ ! ... n _p !] <[N ₀ + n ₁ ... + n _p ], where n ₀ ... n _p is the number of each character from l to p, log _k is the base k logarithm.

Далее считают общее количество символов в цифровом коде N_c и обозначают его через n_c, затем выбирают алгоритм В определения значений n₁, n₂…n_p через значение n_c и новые значения d₁, d₂…d_p и нахождения значений n₁, n₂…n_p через значения n_c, d₁, d₂…d_p, затем присвоенный цифровой код N_c и значения d₁, d₂…d_p сохраняют.Next, consider the total number of characters in the digital code N _c and denote it by n _c , then select the algorithm B for determining the values of n ₁ , n ₂ ... n _p through the value of n _c and the new values of d ₁ , d ₂ ... d _p and finding the values of n ₁ , n ₂ ... n _p through the values of n _c , d ₁ , d ₂ ... d _p , then the assigned digital code N _c and the values of d ₁ , d ₂ ... d _p are stored.

Возможны различные алгоритмы определения значений n₁, n₂…n_p через значение n_c и новые значения d₁, d₂…d_p и нахождения значений n₁, n₂…n_p через значения n_c, d₁, d₂…d_p. Задача этого алгоритма - сократить затраты информации на хранение и передачу значений n₁, n₂…n_p. Поскольку ранее сжатые данные имеют более высокую неупорядоченность по сравнению с несжатыми данными, то есть более высокую энтропию, то во многих случаях n₁, n₂…n_p имеют близкие значения и могут быть сохранены более эффективно, например в случае использования двухсимвольного алфавита значения n₀, n₁, где n₀ - количество нулей в потоке данных, n₁ - количество единиц в потоке данных, могут отличаться от n_c на близкие величины, и хранение новых значений вместо n₁, n₂…n_p будет более оптимальным.There are various algorithms for determining the values of n ₁ , n ₂ ... n _p through the value of n _c and new values of d ₁ , d ₂ ... d _p and finding the values of n ₁ , n ₂ ... n _p through the values of n _c , d ₁ , d ₂ ... d _p . The task of this algorithm is to reduce the cost of information on the storage and transmission of values n ₁ , n ₂ ... n _p . Since previously compressed data have a higher disorder than uncompressed data, i.e. higher entropy, in many cases n ₁ , n ₂ ... n _p have similar values and can be stored more efficiently, for example, in the case of using a two-character alphabet, the values of n ₀ , n ₁ , where n ₀ is the number of zeros in the data stream, n ₁ is the number of units in the data stream, can differ from n _c by close values, and storing new values instead of n ₁ , n ₂ ... n _p will be more optimal.

Для восстановления потока данных считают общее количество символов в цифровом коде N_c и обозначают его через n_c, затем в соответствии с выбранным алгоритмом В и значениями n_c, d₁, d₂…d_p находят значения n₁, n₂…n_p, после чего в соответствии с выбранным алгоритмом А, кодом N_c и значениями n₁, n₂…n_p находят конкретную перестановку с повторениями из р различных символов в количестве n₁, n₂…n_p, которая соответствует исходному потоку данных.To restore the data stream, consider the total number of characters in the digital code N _c and denote it by n _c , then, in accordance with the selected algorithm B and the values n _c , d ₁ , d ₂ ... d _p, find the values n ₁ , n ₂ ... n _p and then, in accordance with the selected algorithm A, the code N _c and the values n ₁ , n ₂ ... n _p find a specific permutation with repetitions of p different symbols in the amount of n ₁ , n ₂ ... n _p , which corresponds to the original data stream.

Пример 1 конкретного выполнения способа.Example 1 specific implementation of the method.

В качестве примера выполнения заявляемого способа сжатия и восстановления данных без потерь с целью наглядности и простоты выбираем двоичный алфавит, состоящий из символов нуль и единица. В сжимаемом двоичном потоке данных, например 000100000000, подсчитывают количество нулей и обозначают это количество через n₀, подсчитывают количество единиц и обозначают это количество через n₁. В данном случае n₀ равно одиннадцати, а n₁ равно единице. Затем в соответствии с выбранным алгоритмом конкретному двоичному потоку 000100000000, состоящему из нулей в количестве одиннадцать и одной единицы, присваивают только один неповторяющийся двоичный код N_c, который не используют для кодирования других двоичных чисел, состоящих из такого же количества нулей и единиц.As an example of the implementation of the proposed method for compressing and restoring data without loss, for the purpose of clarity and simplicity, we choose the binary alphabet consisting of zero and one characters. In a compressible binary data stream, for example, 000100000000, the number of zeros is counted and this number is denoted by n ₀ , the number of units is counted, and this number is denoted by n ₁ . In this case, n ₀ is equal to eleven, and n ₁ is equal to one. Then, in accordance with the selected algorithm, a specific binary stream 000100000000, consisting of zeros in the amount of eleven and one unit, is assigned only one non-repeating binary code N _c , which is not used to encode other binary numbers consisting of the same number of zeros and ones.

Для нахождения двоичного числа N_c могут быть использованы различные алгоритмы, например составляют перестановки с повторениями из одиннадцати нулей и одной единицы, причем вначале помещают все нули, а затем все единицы и этой перестановке присваивают первый порядковый номер - нуль, далее осуществляют перестановки с повторениями из одиннадцати нулей и одной единицы, результаты перестановок с повторениями и присвоенные порядковые номера в порядке возрастания сохраняют, а перестановки с повторениями осуществляют до тех пор, пока результат перестановок не совпадет с исходным несжатым двоичным потоком данных, или составляют перестановки с повторениями из одиннадцати нулей и одной единицы, причем вначале помещают все единицы, а затем все нули и этой перестановке присваивают первый порядковый номер - нуль, далее осуществляют перестановки с повторениями из одиннадцати нулей и одной единицы, результаты перестановок с повторениями и присвоенные порядковые номера в порядке убывания сохраняют, а перестановки с повторениями осуществляют до тех пор, пока результат перестановок не совпадет с исходным несжатым двоичным потоком данных.Various algorithms can be used to find the binary number N _c , for example, constitute permutations with repetitions of eleven zeros and one unit, and first place all zeros, and then all units and this permutation are assigned the first ordinal number - zero, then carry out permutations with repetitions from eleven zeros and one unit, the results of permutations with repetitions and assigned sequence numbers in ascending order are preserved, and permutations with repetitions are carried out until the result t permutations does not coincide with the original uncompressed binary data stream, or constitute permutations with repetitions of eleven zeros and one unit, with first putting all units, and then all zeros and this permutation are assigned the first ordinal number - zero, then perform permutations with repetitions of eleven zeros and one units, the results of permutations with repetitions and assigned sequence numbers in descending order are stored, and permutations with repetitions are carried out until the result of permutations e coincides with the original uncompressed binary data stream.

Для данного примера 1 выполнения заявляемого способа сжатия и восстановления данных без потерь выбран следующий алгоритм нахождения числа N_c. Составляют перестановки с повторениями из одиннадцати нулей и одной единицы, причем вначале помещают все нули, а затем все единицы и этой перестановке присваивают первый порядковый номер - нуль. Далее осуществляют перестановки с повторениями из одиннадцати нулей и одной единицы. Результаты перестановок с повторениями и присвоенные порядковые номера в порядке возрастания заносят в Таблицу 1. Перестановки с повторениями осуществляют до тех пор, пока результат перестановок не совпадет с исходным несжатым двоичным потоком данных.For this example 1, the implementation of the proposed method of compression and recovery of data without loss, the following algorithm is selected to find the number N _c . They make up permutations with repetitions of eleven zeros and one unit, and first all zeros are placed, and then all units are assigned the first serial number - zero - to this permutation. Then carry out permutations with repetitions of eleven zeros and one unit. The results of permutations with repetitions and assigned sequence numbers in ascending order are entered in Table 1. Permutations with repetitions are performed until the result of the permutations does not coincide with the original uncompressed binary data stream.

Таблица 1Table 1 № п/пNo. p / p Перестановки с повторениями из одиннадцати нулей и одной единицыPermutations with repetitions of eleven zeros and one unit Порядковые номера перестановок с повторениями в двоичном видеBinary repetition permutation sequence numbers 1one 22 33 1one 000000000001000000000001 00 22 000000000010000000000010 1one 33 000000000100000000000100 1010 4four 000000001000000000001000 11eleven 55 000000010000000000010000 100one hundred 66 000000100000000000100000 101101 77 000001000000000001000000 110110 88 000010000000000010000000 111111 99 000100000000000100000000 10001000

Далее сохраняют значения n₀, которое равно в двоичном виде 1011 и занимает четыре бита, n₁, которое равно в двоичном виде 1 и занимает один бит, а также двоичное число N_c длиной четыре бита, которое равно в двоичном виде 1000 и содержится в графе 3 строке 9 Таблицы 1 и соответствует порядковому номеру числа перестановок с повторениями из нулей в количестве одиннадцать и одной единицы в потоке данных 000100000000.Then, the values n ₀ , which is equal in binary form 1011 and occupies four bits, n ₁ , which is equal in binary form 1 and occupies one bit, and also binary number N _{with a} length of four bits, which is equal to binary 1000 and is stored in column 3 line 9 of Table 1 and corresponds to the serial number of the number of permutations with repetitions of zeros in the amount of eleven and one unit in the data stream 000100000000.

Таким образом, для записи исходного несжатого потока данных длиной двенадцать бит необходимо девять бит.Thus, nine bits are needed to record the original uncompressed data stream of twelve bits.

Для восстановления первоначального потока данных составляют перестановку длиной n₀ плюс n₁ бит, то есть 12 бит, причем вначале помещают все нули, в данном случае одиннадцать нулей, а затем все единицы, в данном случае одну. Таким образом получают перестановку 000000000001. Далее осуществляют перестановки с повторениями из одиннадцати нулей и одной единицы, причем результаты перестановок рассматривают как двоичные числа и осуществляют перестановки в порядке их возрастания, а число таких перестановок равно N_c, результат последней перестановки под номером N_c будет соответствовать несжатому потоку данных. Восстановленный поток данных представлен в графе 3 строке 9 Таблицы 2.To restore the original data stream, they make up a permutation of length n ₀ plus n ₁ bits, that is, 12 bits, with all zeros placed in the beginning, in this case eleven zeros, and then all ones, in this case one. Thus, a permutation of 000000000001 is obtained. Then, permutations are carried out with repetitions of eleven zeros and one unit, and the results of permutations are considered as binary numbers and carry out permutations in ascending order, and the number of such permutations is equal to N _c , the result of the last permutation under the number N _c will correspond to uncompressed data stream. The restored data stream is presented in column 3, line 9 of Table 2.

Таблица 2table 2 № п/пNo. p / p Порядковый номер перестановокPermutation sequence number Перестановки с повторениями из одиннадцати нулей и одной единицыPermutations with repetitions of eleven zeros and one unit 1one 22 33 1one 00 000000000001000000000001 22 1one 000000000010000000000010 33 1010 000000000100000000000100 4four 11eleven 000000001000000000001000 55 100one hundred 000000010000000000010000 66 101101 000000100000000000100000 77 110110 000001000000000001000000 88 111111 000010000000000010000000 99 10001000 000100000000000100000000

Пример 2 конкретного выполнения способа.Example 2 of a specific implementation of the method.

Также существует следующий вариант сжатия потока данных, представленного в примере 1. Составляют перестановки с повторениями из одиннадцати нулей и одной единицы, причем вначале помещают все нули, а затем все единицы и этой перестановке присваивают первый порядковый номер - нуль. Далее осуществляют перестановки с повторениями из одиннадцати нулей и одной единицы.There is also the following option for compressing the data stream shown in Example 1. They make permutations with repetitions of eleven zeros and one unit, with all zeros placed first, and then all units are assigned the first ordinal number, zero, to this permutation. Then carry out permutations with repetitions of eleven zeros and one unit.

Результаты перестановок с повторениями и присвоенные порядковые номера в порядке возрастания заносят в Таблицу 1. Перестановки с повторениями осуществляют до тех пор, пока результат перестановок не совпадет с исходным несжатым двоичным потоком данных.The results of permutations with repetitions and the assigned sequence numbers in ascending order are entered in Table 1. Permutations with repetitions are performed until the result of the permutations does not coincide with the original uncompressed binary data stream.

Для сохранения сжатых данных выбирают один из возможных алгоритмов определения значений n₀, n₁ через n_c и новые значения d₁, d₂.To save compressed data, choose one of the possible algorithms for determining the values of n ₀ , n ₁ through n _c and the new values of d ₁ , d ₂ .

В данном примере для сохранения сжатых данных вместо сохранения значений N_c, n₀, n₁ сохраняют следующие значения:In this example, to save compressed data, instead of saving the values N _c , n ₀ , n ₁ , the following values are stored:

N_c;N _c ;

d₁=n₁+n₀-n_c;d ₁ = n ₁ + n ₀ -n _c ;

d₂=(n₁+n₀)/2-n₁.d ₂ = (n ₁ + n ₀ ) / 2-n ₁ .

Для восстановления первоначального потока данных считывают общее количество символов в цифровом коде N_c и обозначают через n_c. Далее находят значения n₁ и n₀ следующим образом:To restore the original data stream, the total number of characters in the digital code N _c is read and denoted by n _c . Next, find the values of n ₁ and n ₀ as follows:

- n₁+n₀=d₁+n_c;- n ₁ + n ₀ = d ₁ + n _c ;

- n₁=(n₁+n₀)/2-d₂,- n ₁ = (n ₁ + n ₀ ) / 2-d ₂ ,

- n₀=d₁+n_c-n₁.- n ₀ = d ₁ + n _c -n ₁ .

Затем составляют перестановку длиной n₀ плюс n₁ бит, то есть 12 бит, причем вначале помещают все нули, в данном случае одиннадцать нулей, а затем все единицы, в данном случае одну. Таким образом получают перестановку 000000000001. Далее осуществляют перестановки с повторениями из одиннадцати нулей и одной единицы, причем результаты перестановок рассматривают как двоичные числа и осуществляют перестановки в порядке их возрастания, а число таких перестановок равно N_c, результат последней перестановки под номером N_c будет соответствовать несжатому потоку данных. Восстановленный поток данных представлен в графе 3 строке 9 Таблицы 2.Then they make up a permutation of length n ₀ plus n ₁ bits, that is, 12 bits, and first all zeros are placed, in this case eleven zeros, and then all ones, in this case one. Thus, a permutation of 000000000001 is obtained. Then, permutations are carried out with repetitions of eleven zeros and one unit, and the results of permutations are considered as binary numbers and carry out permutations in ascending order, and the number of such permutations is equal to N _c , the result of the last permutation under the number N _c will correspond to uncompressed data stream. The restored data stream is presented in column 3, line 9 of Table 2.

В данном варианте реализации способа сжатия и восстановления данных без потерь:In this embodiment, the implementation of the method of compression and recovery of data without loss:

N_c=1000, а длина двоичного числа N_c равна четыре бита;N _c = 1000, and the length of the binary number N _c is four bits;

d₁=n-n_c=12-4=8, a длина двоичного числа d₁ пять бита;d ₁ = nn _c = 12-4 = 8, and the length of the binary number is d ₁ five bits;

d₂=(n₁+n₀)/2-n₁=(1+11)/2-1=5; а длина двоичного числа d₂ три бита.d ₂ = (n ₁ + n ₀ ) / 2-n ₁ = (1 + 11) / 2-1 = 5; and the length of the binary number is d ₂ three bits.

Таким образом, при данном варианте сохранения двоичных чисел для восстановления потока данных будет необходимо 4+5+3=12 бит, то есть для записи результатов способа сжатия потребуется столько же бит, сколько и составляет длина несжатого потока данных. Поэтому для сохранения сжатого потока данных в данном конкретном примере целесообразно хранить непосредственно значения N_c, n₁, n₁.Thus, with this option of storing binary numbers, 4 + 5 + 3 = 12 bits will be necessary to restore the data stream, that is, to write the results of the compression method, it will take as many bits as the length of the uncompressed data stream. Therefore, to maintain a compressed data stream in this particular example, it is advisable to store directly the values of N _c , n ₁ , n ₁ .

Таким образом, выбор варианта сохранения результатов сжатия данных зависит от конкретного потока данных и может быть легко определен для конкретного потока данных.Thus, the choice of storing the data compression results depends on the specific data stream and can be easily determined for a particular data stream.

Сравним полученные результаты с теоретическими расчетами, которые позволяют судить о максимально возможном сжатии конкретного потока данных.Let us compare the results with theoretical calculations that allow us to judge the maximum possible compression of a particular data stream.

В соответствии с теорией (Д. Сэломон, "Сжатие данных, изображений и звука", М.: Техносфера, 2004, стр.26, 75) максимальное сжатие последовательности из одиннадцати нулей и одной единицы, которое теоретически может быть достигнуто с помощью способа арифметического кодирования, составляет log₂(1/12)^1*(11/12)¹¹=5 бит, также понадобится как минимум четыре бита для хранения количества нулей и один бит для хранения количества единиц, которые будут необходимы для восстановления потока данных, всего для сохранения результатов сжатия необходимо минимум 10 бит, а коэффициент сжатия К_с указанного двоичного потока данных составит:In accordance with the theory (D. Salomon, “Compression of data, images, and sound,” M .: Technosphere, 2004, p. 26, 75), the maximum compression of a sequence of eleven zeros and one unit, which theoretically can be achieved using the arithmetic method encoding, is log ₂ (1/12) ^{1 *} (11/12) ¹¹ = 5 bits, you will also need at least four bits to store the number of zeros and one bit to store the number of units that will be needed to restore the data stream, for saving the compression results requires a minimum of 10 bits, and the compression coefficient K _{from the} specified binary data stream will be:

K_{c(арифметическое кодирование)}=10/12=0,833.K _{c (arithmetic coding)} = 10/12 = 0.833.

В примере 1 для записи указанной последовательности потребовалось 9 бит, а коэффициент сжатия указанного потока:In Example 1, it took 9 bits to write the specified sequence, and the compression ratio of the specified stream:

K_{c(заявляемый способ)}=9/12=0,75.K _{c (the claimed method)} = 9/12 = 0.75.

Таким образом, заявляемый способ превосходит способ арифметического кодирования по степени сжатия.Thus, the claimed method is superior to the method of arithmetic coding in terms of compression.

Для иллюстрации преимуществ сохранения не трех двоичных чисел N_c, n₀, n₁, а сохранения значений N_c, d₁ и d₂ оценим результаты сжатия двоичного потока данных длиной 65536 бит, состоящего из 2¹⁵=32768 нулей - двоичное число n₀, и 2¹⁵=32768 единиц - двоичное число n₁. Поскольку в данном потоке данных количество единиц и количество нулей одинаково, то данный поток невозможно сжать с помощью способа арифметического кодирования.To illustrate the advantages of storing not three binary numbers N _c , n ₀ , n ₁ , but storing the values of N _c , d ₁ and d _2, we will evaluate the compression results of a binary data stream 65536 bits long, consisting of 2 ¹⁵ = 32768 zeros - a binary number n ₀ , and 2 ¹⁵ = 32768 units - the binary number n ₁ . Since in this data stream the number of units and the number of zeros are the same, this stream cannot be compressed using the arithmetic coding method.

Для записи количества перестановок с повторениями из 32768 нулей и 32768 единиц потребуется:To record the number of permutations with repetitions of 32768 zeros and 32768 units, you will need:

log₂[(n₀+n₁)!/(n₀!n₁!)]=65528 бит.log ₂ [(n ₀ + n ₁ )! / (n ₀ ! n ₁ !)] = 65528 bits.

Для записи двоичного числа d₁=n₁+n₀-n_c=32768+32768-65528=8 потребуется, включая случай, когда двоичное число d₁ равно нулю, log₂(8+1)=4 бита. Для записи двоичного числа d₂=(n₁+n₀)/2-n₁=0 потребуется один бит. Таким образом, для записи исходного потока данных потребуется 65528+4+1=65533 бита, что на 3 бита короче, чем исходный несжатый поток данных. Если для записи результатов сжатия сохранять значения N_c, n₀, n₁, то потребуется log₂(N_c)+log₂(n₀)+log₂(n₁)=65528+15+15=65558, что на 22 бита больше, чем исходный поток данных.To write a binary number d ₁ = n ₁ + n ₀ -n _c = 32768 + 32768-65528 = 8 is required, including the case when the binary number d ₁ is zero, log ₂ (8 + 1) = 4 bits. To write a binary number, d ₂ = (n ₁ + n ₀ ) / 2-n ₁ = 0, one bit is required. Thus, to write the original data stream, 65528 + 4 + 1 = 65533 bits are required, which is 3 bits shorter than the original uncompressed data stream. If to save the results of compression, save the values of N _c , n ₀ , n ₁ , then we need log ₂ (N _c ) + log ₂ (n ₀ ) + log ₂ (n ₁ ) = 65528 + 15 + 15 = 65558, which is 22 bits are larger than the original data stream.

Необходимо отметить, что заявляемый способ сжатия и восстановления данных без потерь, учитывая его высокую эффективность с точки зрения степени сжатия данных, в том числе и данных с высокой энтропией, может быть использован для сжатия данных, которые уже подвергались сжатию другими способами. Кроме этого заявляемый способ позволяет многократно сжимать уже сжатые этим же способом данные и в случае, когда затраты на хранение сжатых данных будут меньше, чем исходный несжатый поток данных, применение заявляемого способа будет эффективным.It should be noted that the inventive method for compressing and recovering data without loss, given its high efficiency in terms of the degree of data compression, including data with high entropy, can be used to compress data that has already been compressed in other ways. In addition, the inventive method allows you to repeatedly compress data already compressed in the same way and in the case when the cost of storing the compressed data will be less than the original uncompressed data stream, the application of the proposed method will be effective.

Заявляемый способ сжатия и восстановления данных без потерь может быть применим для сжатия и последующего восстановления без потерь любых типов данных, например графических файлов, видеофайлов, файлов баз данных и других типов данных. Особенно актуальным может быть применение заявляемого способа в случаях, когда данные необходимо предварительно сжать, причем время, необходимое на сжатие, не является критически важным, а затем передавать по каналам связи уже сжатые данные, в том числе по каналам связи с низкой скоростью передачи данных. В этом случае за счет многократного сжатия данных возможно значительно сократить время, необходимое для передачи больших объемов данных.The inventive method of compression and recovery of data without loss can be applicable for compression and subsequent recovery without loss of any type of data, such as image files, video files, database files and other data types. Particularly relevant may be the application of the proposed method in cases where the data must be pre-compressed, and the time required for compression is not critical, and then transmit already compressed data via communication channels, including communication channels with a low data transfer rate. In this case, due to multiple data compression, it is possible to significantly reduce the time required to transfer large amounts of data.

Также эффективно применение заявляемого способа для сжатия данных в режиме реального времени. В этом случае сжимаемый поток данных может быть разделен на отрезки известной длины, для сжатия которых заявляемым способом потребуется приемлемое время с учетом использования конкретных вычислительных мощностей.It is also effective to use the proposed method for compressing data in real time. In this case, the compressible data stream can be divided into segments of known length, for compression of which the claimed method will require an acceptable time taking into account the use of specific computing power.

Для того чтобы ускорить процесс сжатия и восстановления данных в заявляемом способе сжатия и восстановления данных без потерь могут быть использованы алгоритмы, предусматривающие параллельную обработку данных. Применение параллельных механизмов обработки данных также позволит использовать заявляемый способ сжатия и восстановления данных без потерь для сжатия и последующего восстановления данных в режимах реального времени, например, при передаче голосового или видеотрафика в сетях связи.In order to speed up the process of data compression and recovery in the inventive method of lossless data compression and recovery, algorithms involving parallel data processing can be used. The use of parallel data processing mechanisms will also allow the use of the inventive lossless data compression and recovery method for compression and subsequent data recovery in real time, for example, when transmitting voice or video traffic in communication networks.

Claims

A lossless method for compressing and recovering binary data, according to which the number of zeros is counted in a compressible data stream and denoted by n ₀ and the number of units and denoted by n ₁ , then the algorithm for assigning non-repeating digital codes to all possible permutations with repetitions of n ₀ zeros and n ₁ units and finding in accordance with the assigned digital code and the number of each character of the corresponding permutation with repetitions, then assign to a specific data stream of n ₀ zeros and n ₁ units in accordance with the selected algorithm, a specific digital code N _c , characterized in that the total number of characters in the digital code N _{c is counted} and denoted by n _c , the values of d _{1 are} determined, which is equal to the sum of n ₁ and n ₀ minus the value of n _c , as well as the value d ₂ , which is equal to half the difference between the values of n ₀ and n ₁ , after which the assigned digital code N _c and the values of d ₁ and d ₂ are stored, and to restore the data stream, first, in accordance with the stored values of d ₁ and d ₂ , as well as calculated value n _c , which is equal to the length of the digital code N _c , find the value n ₁ , cat which is equal to half the sum of the values of d ₁ and n _c minus the value of d ₂ and the value of n ₀ , which, in turn, is equal to the sum of the values of d ₁ and n _c minus the value of n ₁ , and then, in accordance with the selected algorithm, according to the values of n ₀ , n ₁ and N _c find a specific permutation with repetitions of n ₀ zeros and n ₁ units, which corresponds to the original data stream.