RU2417424C1

RU2417424C1 - Method of compensating for multi-dimensional data for storing and searching for information in database management system and device for realising said method

Info

Publication number: RU2417424C1
Application number: RU2009149705/08A
Authority: RU
Inventors: Вадим Митрофанович Мельников (RU); Вадим Митрофанович Мельников; Сергей Павлович Маркин (RU); Сергей Павлович Маркин
Original assignee: Закрытое акционерное общество Научно-производственное предприятие "Реляционные экспертные системы"
Priority date: 2009-12-30
Filing date: 2009-12-30
Publication date: 2011-04-27

Abstract

FIELD: information technology.

SUBSTANCE: method can be used for any type of loaded information including text and numerical information and enables to compress indices, data and repeating parts of elements using a created group dictionary, and also enables compression of repeating parts of text lines, dates and other data types using byte and bit templates. When compressing repeating values within one column, the formed dictionary of unique values is used. The algorithm ensures efficient decompression of elements of the input stream during the analysis step and dynamic code generation for the decompression library. Requests with aggregated functions (sum, average, minimum, maximum, number) are optimised, which is achieved by using indices.

EFFECT: increase in storage volume and high reliability of information in the database.

11 cl, 10 dwg

Description

Изобретение относится к вычислительной технике, в частности к способу компрессии многомерных данных для хранения, поиска и анализа информации в системе управления базами данных и устройству для его осуществления, и может быть использовано для увеличения объема записи и надежного хранения и поиска различной информации в системе управления базами данных (СУБД), для поддержки информационных хранилищ, а также в других случаях ненормализованного хранения информации.The invention relates to computer technology, in particular to a method for compressing multidimensional data for storing, searching and analyzing information in a database management system and device for its implementation, and can be used to increase the recording volume and reliable storage and retrieval of various information in a database management system data (DBMS), to support information storages, as well as in other cases of abnormal storage of information.

За последние десятилетия в связи с развитием вычислительной техники и использованием компьютерных технологий, все более актуальной является задача хранения большого объема различной информации, проведения быстрого поиска и анализа востребованной информации, а также управление базами данных, в которых эта информация хранится. Большинство производителей баз данных ориентированы на поддержку продуктов такого класса. Одним из направлений развития является обработка и хранение большого объема различной информации (текстовой, числовой, тип даты и т.д.) в целях анализа. Используются различные подходы к организации процесса анализа, которые позволяют использовать для анализа вычислительную технику с достаточно небольшими требованиями к ресурсам. Это становится возможным благодаря появлению механизмов компрессии и специальных способов для обработки информации в сжатом виде.Over the past decade, in connection with the development of computer technology and the use of computer technology, the task of storing a large amount of various information, conducting a quick search and analysis of the information in demand, as well as managing the databases in which this information is stored, is becoming increasingly urgent. Most database manufacturers are focused on supporting products of this class. One of the directions of development is the processing and storage of a large amount of various information (textual, numerical, date type, etc.) for analysis purposes. Various approaches to the organization of the analysis process are used, which allow the use of computer technology for analysis with fairly small resource requirements. This is made possible by the advent of compression mechanisms and special methods for processing information in a compressed form.

Известно техническое решение [патент РФ №2270528 «Способ сжатия и восстановления цифровых данных и устройство для его осуществления», H04N 1/417], которое заключается в том, что для увеличения объема записи на DVD (digital video disc цифровой видеодиск (компакт-диск большой емкости, обеспечивающий качественное воспроизведение видео- и аудиоинформации, записанной на нем в кодированной (цифровой) форме) или CD (compact disk) осуществляют сжатие цифровых данных, для чего формируют плавающий коэффициент сжатия от 2 до 255 при записи путем сравнения значений кодов потока по величине, подсчета числа равных по величине и следующих друг за другом кодов, формирование двоичного кода этого числа и введение его в поток за первым кодом своей последовательности с исключением из нее подсчитанных кодов. При восстановлении производится определение в потоке кода числа равных кодов, дешифрирование его и формирование числа сигналов выдачи первого кода, равного числу изъятых кодов при сжатии.A technical solution is known [RF patent No. 2270528 "Method for compressing and restoring digital data and a device for its implementation", H04N 1/417], which consists in the fact that to increase the amount of recording on DVD (digital video disc digital video disc (CD-ROM) large capacity, providing high-quality reproduction of video and audio information recorded on it in encoded (digital) form or CD (compact disk) compresses digital data, which creates a floating compression ratio from 2 to 255 when recording by comparing the values of the stream codes in size, counting the number of equal in size and successive codes, generating a binary code of this number and putting it into the stream after the first code of its sequence with the exception of the counted codes from it. When restoring, the number of equal codes is determined in the code stream, its interpretation and the formation of the number of signals issuing the first code equal to the number of seized codes during compression.

Изобретение по патенту РФ №2270528 невозможно использовать для анализа в СУБД, хранящих очень большой объем различной информации (текстовой, числовой, тип даты и т.д.).The invention according to the patent of the Russian Federation No. 2270528 cannot be used for analysis in a DBMS storing a very large amount of various information (text, numeric, date type, etc.).

Известно изобретение по патенту [США 6317867 В1 11/2001 «Method and system for clustering instructions within executable code for compression», в котором компрессия для исполняемых файлов, ориентирована на поиске битовых шаблонов.The invention is known according to the patent [US 6317867 B1 11/2001 "Method and system for clustering instructions within executable code for compression", in which compression for executable files is focused on the search for bit patterns.

Изобретение по патенту США №6317867 невозможно использовать для анализа в СУБД, хранящих большой объем различной информации (текстовой, числовой, тип даты и т.д.).The invention according to US patent No. 6317867 cannot be used for analysis in a DBMS storing a large amount of various information (text, numeric, date type, etc.).

Наиболее близким техническим решением к заявляемому изобретению является изобретение [патент США №6122628 «Multidimensional data clustering and dimensional reduction for indexing and searching]. Способ-прототип предназначен для сжатия многомерных данных, каждый элемент которых представлен в виде числового n-мерного вектора, что следует из использования матричной алгебры для преобразования векторов. Структурная схема устройства, на котором осуществляют способ-прототип, выполнено на фиг.1 описания к патенту США №6122628, а фиг.6 описания к патенту иллюстрирует алгоритм блока компрессии.The closest technical solution to the claimed invention is the invention [US patent No. 6122628 "Multidimensional data clustering and dimensional reduction for indexing and searching]. The prototype method is designed to compress multidimensional data, each element of which is represented as a n-dimensional numerical vector, which follows from the use of matrix algebra for transforming vectors. The structural diagram of the device on which the prototype method is carried out is made in FIG. 1 to the description of US patent No. 6122628, and FIG. 6 to the description of the patent illustrates the compression block algorithm.

Для сравнения заявляемого изобретения и прототипа к материалам данной заявки прилагается описание и чертежи к патенту США №6122628.For comparison, the claimed invention and the prototype to the materials of this application is attached a description and drawings to US patent No. 6122628.

Рассмотрим подробнее способ-прототип, который заключается в следующем:Let us consider in more detail the prototype method, which consists in the following:

- исходный набор векторов разбивают на кластеры (используя известные методы, и на основании знаний о природе данных, например, может использовать метрику Евклида для определения расстояния);- the initial set of vectors is divided into clusters (using known methods, and based on knowledge of the nature of the data, for example, can use the Euclidean metric to determine the distance);

- для каждого кластера запоминают его характеристики, в частности границы;- for each cluster, its characteristics are remembered, in particular, boundaries;

- для каждого кластера вычисляют матрицу собственных значений и матрицу преобразований для сингулярного разложения;- for each cluster, a matrix of eigenvalues and a matrix of transformations for a singular decomposition are calculated;

- набор собственных векторов сортируют по убыванию абсолютных значений;- a set of eigenvectors is sorted in descending order of absolute values;

- удаляют размерности, которые соответствуют собственным векторам с наименьшими собственными значениями и тестируют точность выполнения запросов на случайной выборке;- remove dimensions that correspond to eigenvectors with the smallest eigenvalues and test the accuracy of query execution on a random sample;

- осуществляют преобразование исходных элементов кластера путем умножения матриц и выведения из рассмотрения координат, соответствующих удаленным измерениям (здесь очевидно сжатие, так как хранятся не все компоненты каждого вектора). При этом допускается погрешность в представлении вектора в индексе в рамках, достаточных для корректного выполнения операций поиска;- carry out the transformation of the initial elements of the cluster by multiplying the matrices and deducing from consideration the coordinates corresponding to the remote measurements (compression is obvious here, since not all components of each vector are stored). In this case, an error in the representation of the vector in the index is allowed within the framework sufficient for the correct execution of search operations;

- результат преобразования, матрица преобразования и собственные вектора сохраняют в текущем узле для обратного преобразования при поиске;- the result of the transformation, the transformation matrix, and eigenvectors are stored in the current node for the inverse transformation during the search;

- к элементам кластера процедура кластеризации и уменьшения размерности применяется рекурсивно;- the clustering and dimension reduction procedure is applied recursively to cluster elements;

- процесс кластеризации останавливают, когда ошибка становится выше пороговой на очередном этапе уменьшения размерности;- the clustering process is stopped when the error exceeds a threshold at the next stage of dimensionality reduction;

- выполняют запросы на поиск вектора на равенство и поиск по условию «в диапазоне» для компонент векторов. Для этого выполняют рекурсивный поиск по дереву индекса с отсечением ветвей, которые априори не содержат искомой информации на основании анализа границ кластера, который соответствует ветви.- fulfill the requests for the search for a vector for equality and search by the condition "in the range" for the components of the vectors. To do this, perform a recursive search in the index tree with cutoff of branches that a priori do not contain the desired information based on an analysis of the boundaries of the cluster that corresponds to the branch.

На фиг.1 описания к патенту США №6122628 выполнена структурная схема устройства, на котором осуществляют способ-прототип. Устройство-прототип содержит:Figure 1 of the description of US patent No. 6122628 made a structural diagram of a device on which the prototype method. The prototype device contains:

блок генерации запросов пользователей (101) для генерации запросов пользователей,a unit for generating user requests (101) for generating user requests,

сеть (102), посредством которой осуществляется связь (взаимодействие) блока генерации запросов пользователей с блоком обработки данных и блоком управления системой баз данных,a network (102) through which the communication (interaction) of the user request generation unit with the data processing unit and the database system control unit is carried out,

блок аналитической обработки запросов пользователей (103),block for analytical processing of user requests (103),

блок обработки реляционных запросов пользователей (104),user relational query processing unit (104),

блоки хранения данных (105),data storage units (105),

блок обработки множества индексов (107),block processing multiple indexes (107),

блок генерации индексов (110),index generation unit (110),

блок хранения индексов (108),index storage unit (108),

блок формирования информации о кластере данных (111),a data cluster information generating unit (111),

блок компрессии (112).compression unit (112).

Алгоритм работы устройства заключается в следующем:The algorithm of the device is as follows:

- пользователи через блок генерации запросов пользователей (101) формируют запрос;- users through the block generating user requests (101) form a request;

- сформированный запрос передается через сеть (102) в блок аналитической обработки запросов пользователей (103) или в блок обработки реляционных запросов пользователей (104) в зависимости от типа запроса;- the generated request is transmitted through the network (102) to the block for analytical processing of user requests (103) or to the block for processing relational user requests (104) depending on the type of request;

- блок аналитической обработки запросов пользователей (103) выполняет запрос с помощью блока обработки реляционных запросов (104);- block analytical processing of user requests (103) performs a request using the block processing relational queries (104);

- блок обработки реляционных запросов пользователей осуществляет доступ к данным посредством блоков хранения данных (105);- a user relational query processing unit accesses data via data storage units (105);

- блок обработки реляционных запросов осуществляет доступ к индексам посредством блока обработки множества индексов (107);- the unit for processing relational queries accesses the indices through the unit for processing multiple indices (107);

- блок обработки множества индексов (107) использует блок хранения индексов (108);- the processing unit of the plurality of indices (107) uses an index storage unit (108);

- блок хранения индексов (108) хранит индексы, полученные в результате обработки данных блоком генерации индексов (110);- the index storage unit (108) stores the indices obtained as a result of data processing by the index generation unit (110);

- блок генерации индексов (110) формирует индексы с помощью блока хранения индексов (108), блока формирование информации о кластере данных (111), блока компрессии (112);- an index generation unit (110) generates indices using an index storage unit (108), a data cluster information generation unit (111), a compression unit (112);

- блок формирования информации о кластере (111) использует блок хранения индексов (108) для хранения информации о кластерах;- the cluster information generation unit (111) uses an index storage unit (108) to store cluster information;

- блок компрессии (112) использует блок хранения индексов (108) для хранения информации об уменьшении размерности (компрессии).- the compression unit (112) uses an index storage unit (108) to store information about decreasing the dimension (compression).

Недостатком описанного изобретения по патенту США №6122628 является то, что:The disadvantage of the described invention according to US patent No. 6122628 is that:

- способ и устройство-прототип неприменимы для текстовой информации;- the prototype method and device are not applicable for textual information;

- позволяют сжимать только разработанные ими индексы, а не данные, но в индексах данные могут быть представлены с погрешностью;- allow you to compress only the indices developed by them, and not the data, but in the indices the data can be represented with an error;

- не позволяют сжимать повторяющиеся части текстовых полей;- do not allow to compress the repeating parts of text fields;

- используют сложный алгоритм вычисления сингулярного разложения;- use a complex algorithm for calculating the singular decomposition;

- используют несколько проходов для обработки информации (для кластеризации на каждом уровне), что также характеризует как сложный алгоритм в реализации;- use several passes for processing information (for clustering at each level), which also characterizes how complex the algorithm is in implementation;

- большое время на распаковку и упаковку, т.к. выполняется умножение матриц;- a lot of time for unpacking and packaging, because matrix multiplication is performed;

- не оптимизируют запросы с агрегируемыми функциями (сумма, среднее, минимум, максимум, количество), т.е. направлены только на равенство или нечеткий поиск.- do not optimize queries with aggregated functions (sum, average, minimum, maximum, quantity), i.e. aimed only at equality or fuzzy search.

Все перечисленные недостатки изобретения по патенту США №6122628 не позволяют его использовать в современных системах универсальных баз данных (СУБД) для увеличения объема записей и надежного хранения и поиска различной информации (в том числе текстовой информации), для поддержки информационных хранилищ, а также в других случаях ненормализованного хранения информации.All these disadvantages of the invention according to US patent No. 6122628 do not allow it to be used in modern universal database systems (DBMS) to increase the volume of records and the reliable storage and retrieval of various information (including textual information), to support information repositories, as well as in other cases of abnormal storage of information.

Задача заявляемого изобретения - повышение степени компрессии (сжатия) многомерных данных без потери их полезной функции, увеличение объема хранения информации и повышение надежности хранения и поиска различной информации в базах данных (БД).The objective of the invention is to increase the degree of compression (compression) of multidimensional data without losing their useful function, increase the amount of information storage and increase the reliability of storage and retrieval of various information in databases (DB).

Поставленная задача решается заявляемым способом компрессии многомерных данных для хранения и поиска информации в системе управления базами данных, заключающемся в том, чтоThe problem is solved by the claimed method of compression of multidimensional data for storage and retrieval of information in a database management system, which consists in the fact that

каждый входной поток многомерных данных, состоящий из элементов одинаковой структуры, где каждый элемент представлен набором полей с заданными значениями, а совокупность значений одного и того же поля в разных элементах образует столбец значений, разбивают на строки, каждая из которых представляет структурированный набор столбцов, каждый из которых имеет свой тип данных: текстовый или числовой, или тип даты, причем каждый тип данных представлен соответствующей последовательностью байтов или последовательностью битов,each input stream of multidimensional data, consisting of elements of the same structure, where each element is represented by a set of fields with specified values, and the set of values of the same field in different elements forms a column of values, divided into rows, each of which represents a structured set of columns, each of which has its own data type: text or numeric, or date type, each data type being represented by a corresponding sequence of bytes or a sequence of bits,

определяют для каждого столбца его функциональное назначение, формируя, таким образом, по меньшей мере, три группы данных:determine its functional purpose for each column, thus forming at least three data groups:

поисковую, агрегируемую и информационную,search, aggregate and informational,

в поисковую группу данных определяют те столбцы, которые служат для ограничения выборки данных,the search group of data defines those columns that are used to limit the selection of data,

в агрегируемую группу данных формируют столбцы, которые представляют числовой показатель и используются в условиях отбора данных при поиске,in the aggregated data group are formed columns that represent a numerical indicator and are used in the conditions of data selection during the search,

в информационную группу данных определяют столбцы с длинными текстовыми полями, которые не участвуют в условиях отбора при поиске и не представляют числовой показатель,columns with long text fields are defined in the data information group, which do not participate in the selection conditions during the search and do not represent a numerical indicator,

запоминают информацию о сформированных группах данных,remember information about the generated data groups,

анализируют данные (информацию) в сформированных группах данных и по результатам анализа выполняют компрессию сформированных групп данных, для чего:analyze data (information) in the generated data groups and, based on the analysis results, perform compression of the generated data groups, for which:

для столбцов поисковой группы данных формируют групповой словарь, при этом каждый элемент группового словаря состоит из значений, которые принадлежат входному потоку многомерных данных,for the columns of the search data group, a group dictionary is formed, wherein each element of the group dictionary consists of values that belong to the input multidimensional data stream,

используя сформированный групповой словарь, анализируют данные, содержащиеся в столбцах поисковой группы данных, таким образом, что для каждого входного элемента потока данных добавляют элемент в групповой словарь, если такой элемент уже имеется, то увеличивают количество повторений элемента, если элемент отсутствует в групповом словаре, то добавляют элемент, соответствующий входному элементу потока данных,using the generated group dictionary, analyze the data contained in the columns of the data search group so that for each input element of the data stream add an element to the group dictionary, if such an element already exists, then increase the number of repetitions of the element if the element is not in the group dictionary, then add the element corresponding to the input element of the data stream,

при переполнении группового словаря осуществляют подсчет различных значений для каждого столбца, значения которого входят в элемент группового словаря,when the group dictionary is full, various values are calculated for each column whose values are included in the group dictionary element,

удаляют из группового словаря столбец, имеющий наибольшее количество различных значений без повторений,remove the column from the group dictionary that has the largest number of different values without repetition,

групповой словарь перестраивают путем объединения элементов, которые отличаются значениями полей удаленного столбца, выполняя, таким образом, компрессию группового словаря,the group dictionary is rebuilt by combining elements that differ in the values of the fields of the deleted column, thus performing compression of the group dictionary,

для столбцов поисковой группы данных, элементы которых не вошли в групповой словарь, формируют словарь уникальных значений, в котором каждое значение встречается один раз и имеет свой номер,for columns of a data search group whose elements are not included in the group dictionary, a dictionary of unique values is formed, in which each value occurs once and has its own number,

для каждого набора уникальных значений формируют два типа шаблонов, первый из которых формируют для типа данных, в которых семантические части значения представлены последовательностью байт, второй шаблон - для типа данных, значения которых представлены последовательностью бит,for each set of unique values, two types of patterns are formed, the first of which is formed for a data type in which the semantic parts of the value are represented by a sequence of bytes, the second template is for a data type whose values are represented by a sequence of bits,

первый шаблон представляет собой общий вид всех значений в рамках столбцов для типа данных, в которых семантические части значений представлены записью с полями, размер которых кратен байту,the first template is a general view of all the values within the columns for a data type in which the semantic parts of the values are represented by a record with fields the size of which is a multiple of a byte,

второй шаблон представляет собой общий вид всех значений в рамках столбцов для типа данных, в которых семантические части значений представлены записью с полями, размер которых кратен биту,the second template is a general view of all values within the columns for a data type in which the semantic parts of the values are represented by a record with fields the size of which is a multiple of a bit,

причем выделяют шаблон путем сравнения изменений соответственно значений байта или бита,moreover, a template is selected by comparing the changes, respectively, byte or bit values,

используя сформированные шаблоны, выполняют компрессию столбцов данных таким образом, что при компрессии сохраняют только те части, которые не являются частью шаблона, и удаляют части значений столбца, соответствующих шаблону.using the generated templates, the data columns are compressed in such a way that during compression only those parts that are not part of the template are saved, and parts of the column values corresponding to the template are deleted.

осуществляют компрессию столбцов информационной группы путем преобразования значений столбцов этой группы, представленных последовательностью байтов, в последовательность, представленную меньшим количеством байтов, за счет преобразования подряд идущих одинаковых значений байтов,compressing the columns of the information group by converting the column values of this group, represented by a sequence of bytes, into a sequence represented by a smaller number of bytes, by converting consecutive identical byte values,

сохраняют информацию об использованных процедурах выполненной компрессии информационной, агрегируемой и поисковой групп данных и данных в них,save information about the used procedures for the compression of the information, aggregated and search data groups and the data in them,

для выполнения процедуры поиска формируют индексы в зависимости от типа столбцов и количества различных значений в словаре уникальных значений таким образом, чтоto perform the search procedure, indexes are formed depending on the type of columns and the number of different values in the dictionary of unique values in such a way that

для столбцов поисковой группы данных формируют такие индексы, которые при поиске позволяют вычислять отношения среди значений данного столбца, в случае, если количество различных значений в столбце не превышает заданное пороговое значение, которое определяют в зависимости от конфигурации системы,for the columns of the data search group, such indices are formed that during the search allow you to calculate the relationships among the values of this column, if the number of different values in the column does not exceed a predetermined threshold value, which is determined depending on the system configuration,

для столбцов агрегируемой группы данных формируют такие индексы, которые при поиске в зависимости от запроса пользователем требуемой информации позволяют вычислять отношение значений среди значений данного столбца, или суммировать значения столбцов или находить минимальное или максимальное значения среди значений данного столбца, или количество значений или среднее значение среди значений данного столбца,for the columns of the aggregated data group, such indices are formed that when searching, depending on the user’s request for the required information, they can calculate the ratio of values among the values of a given column, or summarize the column values or find the minimum or maximum values among the values of this column, or the number of values or average value among values of this column,

сохраняют сформированные индексы,keep formed indexes,

осуществляют динамическую генерацию библиотеки декомпрессии для каждого входного потока многомерных данных:carry out the dynamic generation of the decompression library for each input stream of multidimensional data:

для чего выполняют генерацию стандартных точек входа для исходного кода библиотеки декомпрессии, при этом генерируют исходный код на языке программирования для распаковки значений столбцов, вошедших в групповой словарь,why they perform the generation of standard entry points for the source code of the decompression library, while generating source code in a programming language for unpacking the values of the columns included in the group dictionary,

для каждого столбца, не вошедшего в групповой словарь, генерируют исходный код на языке программирования для распаковки значений столбцов, не вошедших в групповой словарь,for each column that is not included in the group dictionary, source code in a programming language is generated to unpack the values of columns that are not included in the group dictionary,

запускают компилятор для создания библиотеки декомпрессии на основании исходного кода,start the compiler to create a decompression library based on the source code,

запоминают библиотеку декомпрессии,remember the decompression library,

регистрируют библиотеку декомпрессии для входного потока многомерных данных,register the decompression library for the input stream of multidimensional data,

завершают генерацию библиотеки декомпрессии,complete the decompression library generation,

по запросу пользователя, который содержит информацию о выбираемых столбцах потока данных, или условиях, ограничивающих выбираемый набор строк, или функции агрегирования осуществляют восстановление запомненных данных, для чегоat the request of the user, which contains information about the selected columns of the data stream, or the conditions that limit the selected set of rows, or the aggregation functions restore the stored data, for which

распознают команду пользователя путем трансляции - лексического и семантического анализа и преобразование его во внутреннее представление (форму), используя которое определяют те столбцы данных, которые должны быть переданы пользователю,recognize the user’s command by translating - lexical and semantic analysis and converting it into an internal representation (form), using which the columns of data that should be transferred to the user are determined,

определяют набор поисковых столбцов, используя которые осуществляют фильтрацию данных, и соответствующие им индексы,define a set of search columns, using which filter data, and their corresponding indices,

определяют набор столбцов, для которых вызывают агрегируемые функции, и соответствующие им индексы,determine the set of columns for which the aggregated functions are called, and the corresponding indices,

определяют порядок извлечения и процедуру декомпрессии данных, вычисляют результат запроса пользователя, используя определенные порядок извлечения и процедуру декомпрессии данных, для чего:determine the extraction order and data decompression procedure, calculate the result of the user’s request using certain extraction procedures and data decompression procedures, for which:

при использовании индексов, для столбцов поисковой группы данных определяют результаты операций отношений в виде номеров элементов исходного потока данных,when using indexes, for the columns of the data search group, the results of relations operations are determined in the form of element numbers of the original data stream,

при использовании индексов для столбцов агрегируемой группы определяют результаты операции агрегирования,when using indexes for columns of an aggregated group, the results of the aggregation operation are determined,

выполняют поиск сохраненных данных и библиотеки декомпрессии и осуществляют декомпрессию необходимых для выполнения запроса элементов данных путем вызова библиотеки декомпрессии через стандартные точки входа, восстанавливая, таким образом, входной поток многомерных данных, удовлетворяющих условиям поиска, который передают пользователю.they search for stored data and decompression libraries and decompress the data elements necessary to execute the query by calling the decompression library through standard entry points, thus restoring the input stream of multidimensional data that satisfies the search conditions that are transmitted to the user.

При этом столбцы поисковой группы данных представлены, например, последовательностью байтов или последовательностью битов.The columns of the data search group are represented, for example, by a sequence of bytes or a sequence of bits.

Столбцы агрегируемой группы данных представлены, например, последовательностью байтов или последовательностью битов.The columns of an aggregated data group are represented, for example, by a sequence of bytes or a sequence of bits.

Столбцы информационной группы данных представлены последовательностью байтов.The columns of an information data group are represented by a sequence of bytes.

В качестве группового словаря, например, используют хэш-таблицу, количество элементов в которой не превышает мощности машинного слова, а каждый элемент состоит из совокупности элементов всех колонок, кроме колонок информационной группы данных.As a group dictionary, for example, a hash table is used, the number of elements in which does not exceed the power of the machine word, and each element consists of a set of elements of all columns except the columns of the data information group.

Компрессию столбцов информационной группы данных, представленных последовательностью байтов, осуществляют путем замены последовательности подряд идущих одинаковых значений байтов на последовательность из двух элементов - количество и само значение, осуществляя, таким образом, преобразование в последовательность, представленную меньшим количеством байтов.Compression of the columns of the information group of data represented by a sequence of bytes is carried out by replacing a sequence of identical bytes in succession with a sequence of two elements - the number and the value itself, thereby converting to a sequence represented by fewer bytes.

Для столбцов поисковой группы формируют Bitmap индексы, которые представляют собой структуру из нескольких последовательностей битов, каждый из которых соответствует уникальному значению столбца, для которого строится индекс.For the columns of the search group, Bitmap indexes are formed, which are a structure of several sequences of bits, each of which corresponds to a unique value of the column for which the index is built.

Для столбцов агрегируемой группы формируют индексы кусочно-битовые Bit-sliced, представляющие собой последовательности битов, причем каждая последовательность образована конкатенацией i-го бита каждого значения столбца, при этом количество последовательностей определяется количеством битов, необходимых для представления каждого значения.For the columns of the aggregated group, piece-bit Bit-sliced indices are formed, which are sequences of bits, each sequence being formed by concatenating the ith bit of each column value, while the number of sequences is determined by the number of bits required to represent each value.

Если установлено, что для выполнения процедуры поиска для столбцов поисковой группы возможно применить Bitmap индексы, то поиск запрошенных пользователем данных осуществляют, используя Bitmap индексы.If it is established that it is possible to use Bitmap indexes to perform the search procedure for the columns of the search group, then the data requested by the user is searched using Bitmap indexes.

Если установлено, что для выполнения процедуры поиска для столбцов агрегируемой группы возможно применить индексы кусочно-битовые Bit-sliced, то поиск запрошенных пользователем данных осуществляют, используя кусочно-битовые индексы Bit-sliced.If it is established that to perform the search procedure for the columns of the aggregated group, it is possible to use bit-sliced bit-indexed indices, then the data requested by the user is searched using bit-sliced bit-indexes.

Поставленная задача решается также заявляемым устройством компрессии многомерных данных для хранения и поиска информации в системе управления базами данных, содержащимThe problem is also solved by the claimed device for compressing multidimensional data for storage and retrieval of information in a database management system containing

блок генерации запросов пользователей,user request generation unit,

первую сеть,the first network

блок обработки многомерных запросов,multidimensional request processing unit,

блок оптимизации и выполнения запросов пользователей,unit for optimizing and fulfilling user requests,

блок хранения компрессированных данных,compressed data storage unit,

блок хранения компрессированных индексов,Compressed Index Storage Unit

блок компрессии многомерных данных и индексов,multidimensional data and index compression unit,

блок управления загрузкой многомерных данных,multidimensional data loading control unit,

вторую сеть,second network

блок оперативного хранения многомерных данных,unit for operational storage of multidimensional data,

блок обработки ответов пользователям,response processing unit to users,

при этом вход блока генерации запросов пользователей является первым входом устройства - сигнальным входом, выход блока генерации запросов пользователей через первую сеть соединен с первым входом блока обработки многомерных запросов, первый выход которого соединен с первым входом блока оптимизации и выполнения запросов пользователей, выход которого соединен со вторым входом блока обработки многомерных запросов, второй вход блока оптимизации и выполнения запросов пользователей соединен с выходом блока хранения компрессированных данных, третий вход блока оптимизации и выполнения запросов пользователей соединен с выходом блока хранения компрессированных индексов, вход которого соединен с первым выходом блока компрессии многомерных данных и индексов, вход блока оперативного хранения данных является вторым входом устройства - информационным входом, второй выход блока обработки многомерных запросов соединен через первую сеть с входом блока обработки ответов пользователям, выход которого является выходом устройства,the input of the block for generating user requests is the first input of the device - a signal input, the output of the block for generating user requests through the first network is connected to the first input of the block for processing multidimensional requests, the first output of which is connected to the first input of the block for optimizing and executing user requests, the output of which is connected to the second input of the multidimensional query processing unit, the second input of the optimization unit and user query execution is connected to the output of the compressed data storage unit ny, the third input of the optimization block and the execution of user queries is connected to the output of the compressed index storage unit, the input of which is connected to the first output of the multi-dimensional data and index compression unit, the input of the online data storage unit is the second input of the device - the information input, the second output of the multi-dimensional request processing block connected through the first network to the input of the response processing unit to users, the output of which is the output of the device,

причем блок компрессии многомерных данных и индексов содержит формирователь индексов,moreover, the compression unit of multidimensional data and indices contains an index generator,

согласно изобретению:according to the invention:

выход блока оперативного хранения данных через вторую сеть соединен с входом блока управления загрузкой многомерных данных, выход которого соединен с входом блока компрессии многомерных данных и индексов, второй выход которого соединен с входом блока хранения компрессированных данных,the output of the operational data storage unit through the second network is connected to the input of the multidimensional data loading control unit, the output of which is connected to the input of the multidimensional data and index compression unit, the second output of which is connected to the input of the compressed data storage unit,

блок компрессии многомерных данных и индексов содержит:block compression multidimensional data and indices contains:

узел разделения входного потока многомерных данных,node splitting the input stream of multidimensional data,

формирователь поисковой группы данных с элементом памяти,shaper of a search data group with a memory element,

формирователь агрегируемой группы данных с элементом памяти,shaper aggregated data group with a memory element,

формирователь информационной группы данных с элементом памяти,shaper of an information data group with a memory element,

формирователь группового словаря,group dictionary shaper,

формирователь уникального словаря,unique dictionary shaper,

узел компрессии столбцов информационной группы данных,node compression columns information group data

формирователь первого шаблона,shaper of the first template,

формирователь второго шаблона,shaper of the second template,

узел компрессии данных и индексов,node for data and index compression,

узел компрессии группового словаря,group dictionary compression node,

первый узел компрессии столбцов данных,the first data column compression node,

второй узел компрессии столбцов данных,a second data column compression node,

узел формирования библиотеки декомпрессии,decompression library forming unit,

при этом вход узла разделения входного потока многомерных данных является входом блока компрессии многомерных данных и индексов, первый выход узла разделения входного потока многомерных данных соединен с входом формирователя поисковой группы данных с элементом памяти, второй его выход соединен с входом формирователя агрегируемой группы данных с элементом памяти, а третий выход - с входом формирователя информационной группы данных с элементом памяти,the input of the separation unit of the input stream of multidimensional data is the input of the compression unit of the multidimensional data and indices, the first output of the separation unit of the input stream of multidimensional data is connected to the input of the generator of the search data group with a memory element, its second output is connected to the input of the generator of the aggregated data group with a memory element , and the third output is with the input of the shaper of the information data group with a memory element,

первый выход формирователя поисковой группы данных с элементом памяти соединен с входом формирователя группового словаря, второй выход формирователя поисковой группы данных с элементом памяти соединен с входом формирователя уникального словаря, третий выход - с первым входом формирователя индексов, второй вход которого соединен с выходом формирователя агрегируемой группы данных с элементом памяти,the first output of the search data group shaper with a memory element is connected to the input of the group dictionary shaper, the second output of the search data group shaper with a memory element is connected to the input of the unique dictionary shaper, the third output is with the first input of the index shaper, the second input of which is connected to the output of the aggregate group shaper data with a memory element,

выход формирователя группового словаря соединен с первым входом узла компрессии группового словаря, первый выход и второй вход которого соединены соответственно с первыми входом и выходом узла формирования библиотеки декомпрессии,the output of the group vocabulary generator is connected to the first input of the compression unit of the group vocabulary, the first output and second input of which are connected respectively to the first input and output of the unit for generating the decompression library,

первый и второй выходы формирователя уникального словаря соединены соответственно с входами формирователя первого шаблона и формирователя второго шаблона,the first and second outputs of the unique dictionary former are connected respectively to the inputs of the former of the first pattern and the former of the second pattern,

выход формирователя первого шаблона соединен с первым входом первого узла компрессии столбцов данных, первый выход и второй вход которого соединены соответственно со вторыми входом и выходом узла формирования библиотеки декомпрессии,the output of the shaper of the first template is connected to the first input of the first node of compression of the data columns, the first output and second input of which are connected respectively with the second input and output of the node of the formation of the decompression library,

выход формирователя второго шаблона соединен с первым входом второго узла компрессии столбцов данных, первый выход и второй вход которого соединены соответственно с третьими входом и выходом узла формирования библиотеки декомпрессии,the output of the shaper of the second template is connected to the first input of the second node of compression of the data columns, the first output and second input of which are connected respectively with the third input and output of the node of the formation of the decompression library,

четвертые вход и выход узла формирования библиотеки декомпрессии соединены соответственно с первым выходом и вторым входом узла компрессии данных и индексов, первый вход которого соединен с выходом формирователя индексов,the fourth input and output of the decompression library generation unit are connected respectively to the first output and the second input of the data and index compression unit, the first input of which is connected to the output of the index generator,

пятые вход и выход узла формирования библиотеки декомпрессии соединены соответственно с первым выходом и вторым входом узла компрессии столбцов информационной группы данных, первый вход которого соединен с выходом формирователя информационной группы данных с элементом памяти,the fifth input and output of the decompression library forming unit are connected respectively to the first output and the second input of the column compression unit of the data information group, the first input of which is connected to the output of the data information group shaper with a memory element,

вторые выходы узла компрессии столбцов информационной группы данных, первого и второго узлов компрессии столбцов данных и узла компрессии группового словаря образуют по шине второй выход блока компрессии многомерных данных и индексов,the second outputs of the column compression unit of the data information group, the first and second data column compression units and the group dictionary compression unit form on the bus a second output of the compression unit of multidimensional data and indices,

второй выход узла компрессии данных и индексов является первым выходом блока компрессии многомерных данных и индексов.the second output of the compression unit for data and indices is the first output of the compression unit for multidimensional data and indices.

Сопоставительный анализ заявляемого изобретения (способа и устройства) с известным уровнем техники и наиболее близким из них прототипом показывает, что заявляемое изобретение в результате предложенной последовательности действий процедуры компрессии многомерных данных для хранения и поиска информации в системе управления базами данных позволяет получить лучший технический эффект, а именно:A comparative analysis of the claimed invention (method and device) with the prior art and the closest prototype shows that the claimed invention as a result of the proposed sequence of steps of the multidimensional data compression procedure for storing and searching information in the database management system allows to obtain the best technical effect, and exactly:

универсальность, т.к. применим для любого типа загружаемой информации (текстовой, числовой, тип даты и т.д.), что достигается путем анализа внутреннего представления значений, основанного либо на байтовой, либо на битовой последовательности;universality, as applicable for any type of loaded information (text, numeric, date type, etc.), which is achieved by analyzing the internal representation of values based on either a byte or a bit sequence;

позволяет сжимать не только индексы, но и данные, что достигается предлагаемой процедурой компрессии;allows you to compress not only indices, but also data, which is achieved by the proposed compression procedure;

позволяет сжимать повторяющиеся части элементов, используя сформированный групповой словарь;allows you to compress repeating parts of elements using the generated group dictionary;

позволяет сжимать повторяющиеся части текстовых строк, дат и других типов данных, используя байтовый и битовый шаблоны;allows you to compress the repeating parts of text strings, dates and other data types using byte and bit patterns;

позволяет сжимать повторяющиеся значения в пределах одного столбца, используя сформированный словарь уникальных значений;allows you to compress duplicate values within a single column using the generated dictionary of unique values;

алгоритм обеспечивает эффективную распаковку элементов входного потока на этапе анализа и динамическую генерацию кода для библиотек распаковки;the algorithm provides efficient unpacking of input stream elements at the analysis stage and dynamic code generation for unpacking libraries;

позволяет оптимизировать запросы с агрегируемыми функциями (сумма, среднее, минимум, максимум, количество), что достигается за счет использования индексов, например Bit-sliced индексов.allows you to optimize queries with aggregated functions (sum, average, minimum, maximum, quantity), which is achieved through the use of indexes, for example Bit-sliced indexes.

Рассмотрим подробнее признаки, общие с прототипом и отличительные от него, при этом для наглядности будем использовать фиг.9 и 10 заявляемого устройства и фиг.1 (описание к патенту США №6122628).Let us consider in more detail the features common to and distinctive from the prototype, and for clarity, we will use FIGS. 9 and 10 of the inventive device and FIG. 1 (description to US patent No. 6122628).

Так, например, блок генерации запросов пользователей 1, сеть 2 и блок управления загрузкой многомерных данных 8 заявляемого устройства (фиг.9) эквивалентны по общему функциональному назначению соответственно блокам 101, 102 и 103 прототипа (описание к патенту США №6122628). Блок обработки многомерных запросов 2, блок оптимизации и выполнения запроса пользователей 4 и блок обработки ответов пользователям 10 эквивалентны по функциональному назначению соответственно блоку 104 прототипа (описание к патенту США №6122628). Поэтому эти блоки (по общему функциональному назначению) отнесены в ограничительную часть формулы изобретения на устройство.So, for example, the block for generating user requests 1, network 2, and the multidimensional data loading control unit 8 of the claimed device (Fig. 9) are equivalent in terms of general functional purpose to prototype blocks 101, 102 and 103, respectively (description of US patent No. 6122628). The block for processing multidimensional requests 2, the block for optimizing and executing the request for users 4, and the block for processing responses to users 10 are equivalent in functionality according to the block 104 of the prototype (description to US patent No. 6122628). Therefore, these blocks (for general purpose) are assigned to the restrictive part of the claims on the device.

Блок хранения компрессированных данных 5 и блок оперативного хранения данных (сервер) 9 также имеются в прототипе - блоки под номером 105, только сгруппированы блоки 5 и 9 заявляемого устройства по другому и имеют другие, отличные от прототипа связи, поэтому блоки по своему общему функциональному назначению отнесены в ограничительную часть формулы, а соответственно новые связи вошли в отличительную часть формулы (второй выход блока компрессии многомерных данных и индексов 7 соединен с входом блока хранения компрессированных данных 5, а вход блока компрессии многомерных данных и индексов 7 соединен с выходом блока управления загрузкой многомерных данных 8 (в прототипе аналогичный блок под номером 103 соединен с блоками хранения данных через блок 104).The compressed data storage unit 5 and the online data storage unit (server) 9 are also available in the prototype - blocks at number 105, only the units 5 and 9 of the inventive device are grouped differently and have different communication prototypes, therefore, the blocks are designed for their general purpose assigned to the restrictive part of the formula, and accordingly, new connections are included in the distinctive part of the formula (the second output of the compression unit of multidimensional data and indices 7 is connected to the input of the storage unit of compressed data 5, and the input b the compression lock of multidimensional data and indices 7 is connected to the output of the multidimensional data loading control unit 8 (in the prototype, a similar block at number 103 is connected to data storage units through block 104).

Поскольку блоки под номерами 107, 108, 110, 111 и 112 (в прототипе) обеспечивают формирование, обработку индексов, их хранение и кластеризацию данных посредством индексов, то соответственно блок хранения компрессированных индексов 6 и формирователь индексов 11, а также связь - первый выход блока компрессии 7 соединен с входом блока хранения компрессированных индексов 6, отнесены в ограничительную часть формулы по общему функциональному назначению. Хотя заявляемое устройство (в этой части) работает по другому алгоритму и позволяет эффективно использовать любые известные индексы для компрессии многомерных данных (в отличие от прототипа, который позволяет сжимать только разработанные ими индексы, а не данные, но в индексах данные могут быть представлены с погрешностью).Since the blocks numbered 107, 108, 110, 111 and 112 (in the prototype) provide for the formation, processing of indices, their storage and clustering of data through indices, respectively, the storage unit for compressed indices 6 and the generator of indices 11, as well as communication, the first output of the block compression 7 is connected to the input of the storage unit of the compressed indices 6, are assigned to the restrictive part of the formula for general purpose. Although the claimed device (in this part) works according to a different algorithm and allows you to effectively use any known indexes for compressing multidimensional data (in contrast to the prototype, which allows you to compress only the indices developed by them, and not the data, but in the indices the data can be represented with an error )

Главным отличительным признаком заявляемого устройства является предлагаемая структурная схема блока компрессии многомерных данных и индексов, которая обеспечивает в совокупности с другими блоками и связями между ними в устройстве реализацию всех признаков заявляемого способа и обеспечивает выполнение поставленной задачи по повышению степени компрессии (сжатия) многомерных данных без потери их полезной функции, увеличению объема хранения информации и повышению надежности хранения и поиска различной информации в базах данных.The main distinguishing feature of the claimed device is the proposed block diagram of the compression unit of multidimensional data and indices, which provides, together with other blocks and the connections between them in the device, the implementation of all the features of the claimed method and ensures the achievement of the task of increasing the degree of compression (compression) of multidimensional data without loss their useful functions, increasing the volume of information storage and improving the reliability of storage and retrieval of various information in databases.

Все перечисленные преимущества позволяют использовать заявляемый способ в системах управления универсальными базами данных с высокой степенью надежности и оперативности.All these advantages allow you to use the inventive method in the control systems of universal databases with a high degree of reliability and efficiency.

Далее описание заявляемого изобретения поясняется примерами выполнения и чертежами.The following is a description of the claimed invention is illustrated by examples and drawings.

Фиг.1 иллюстрирует алгоритм компрессии многомерных данных (в общем виде) согласно заявляемому изобретению.Figure 1 illustrates the compression algorithm of multidimensional data (in General form) according to the claimed invention.

На фиг.2 показана процедура разбиения входного потока многомерных данных.Figure 2 shows the procedure for splitting the input stream of multidimensional data.

На фиг.3 - процедура формирования группового словаря и его компрессия.Figure 3 - procedure for the formation of a group dictionary and its compression.

Фиг.4 иллюстрирует алгоритм анализирования данных в сформированных группах данных.Figure 4 illustrates an algorithm for analyzing data in generated data groups.

Фиг.5 - алгоритм процедуры компрессии.5 is a compression procedure algorithm.

Фиг.6 - алгоритм построения индексов для процедуры поиска пользователем информации.6 is an algorithm for constructing indexes for the user search procedure information.

Фиг.7 - алгоритм динамической генерации библиотеки декомпрессии.7 is an algorithm for dynamic generation of a decompression library.

Фиг.8 иллюстрирует общую схему использования изобретения, когда пользователь обратился к системе с запросом на поиск затребованной информации.Fig. 8 illustrates a general scheme for using the invention when a user turned to the system with a request to search for the requested information.

На фиг.9 выполнена структурная схема заявляемого устройства.Figure 9 is a structural diagram of the inventive device.

На фиг.10 - структурная схема блока компрессии многомерных данных и индексов.Figure 10 is a structural diagram of a compression unit of multidimensional data and indices.

Заявляемое устройство компрессии многомерных данных для хранения и поиска информации в системе управления базами данных (фиг.9 и 10) содержит:The inventive compression device of multidimensional data for storing and retrieving information in a database management system (Fig.9 and 10) contains:

1 - блок генерации запросов пользователей,1 - block generating user requests,

2₁ - первую сеть,2 ₁ - the first network,

2₂ - вторую сеть,2 ₂ - the second network,

3 - блок обработки многомерных запросов,3 - block processing multidimensional queries,

4 - блок оптимизации и выполнения запросов пользователей,4 - block optimization and execution of user requests,

5 - блок хранения компрессированных данных,5 - block storage of compressed data,

6 - блок хранения компрессированных индексов,6 - block storage of compressed indices,

7 - блок компрессии многомерных данных и индексов,7 - block compression of multidimensional data and indices,

8 - блок управления загрузкой многомерных данных,8 - control unit download multidimensional data,

9 - блок оперативного хранения данных,9 - block operational data storage,

10 - блок обработки ответов пользователям,10 - block processing responses to users,

при этом вход блока генерации запросов пользователей 1 является первым входом устройства - сигнальным входом, выход блока генерации запросов пользователей 1 соединен через первую сеть 2₁ с первым входом блока обработки многомерных запросов 3, первый выход которого соединен с первым входом блока оптимизации и выполнения запросов пользователей 4, выход которого соединен со вторым входом блока обработки многомерных запросов 3, второй вход блока оптимизации и выполнения запросов пользователей 4 соединен с выходом блока хранения компрессированных данных 5, третий вход блока оптимизации и выполнения запросов пользователей 4 соединен с выходом блока хранения компрессированных индексов 6, вход которого соединен с первым выходом блока компрессии многомерных данных и индексов 7, вход блока оперативного хранения данных 9 является вторым входом устройства - информационным входом, второй выход блока обработки многомерных запросов 3 соединен через первую сеть 2₁ с входом блока обработки ответов пользователям 10, выход которого является выходом устройства,the input of the block for generating user requests 1 is the first input of the device — a signal input, the output of the block for generating requests for users 1 is connected through the first network 2 ₁ to the first input of the block for processing multidimensional requests 3, the first output of which is connected to the first input of the block for optimizing and executing user requests 4, the output of which is connected to the second input of the block for processing multidimensional requests 3, the second input of the block for optimizing and executing user requests 4 is connected to the output of the compressors storage unit data 5, the third input of the optimization unit and performing user requests 4 is connected to the output of the compressed index storage unit 6, the input of which is connected to the first output of the multi-dimensional data and index compression unit 7, the input of the online data storage unit 9 is the second input of the device - information input, the second output of the multidimensional request processing unit 3 is connected through the first network 2 ₁ to the input of the response processing unit to users 10, the output of which is the output of the device,

причем блок компрессии многомерных данных и индексов 7 содержит формирователь индексов 11,moreover, the compression unit of multidimensional data and indices 7 contains a generator of indices 11,

согласно изобретению:according to the invention:

выход блока оперативного хранения данных 9 соединен через вторую сеть 2₂ с входом блока управления загрузкой многомерных данных 8, выход которого соединен с входом блока компрессии многомерных данных и индексов 7, второй выход которого соединен с входом блока хранения компрессированных данных 5,the output of the online data storage unit 9 is connected through a second network 2 ₂ to the input of the multidimensional data loading control unit 8, the output of which is connected to the input of the multidimensional data and index compression unit 7, the second output of which is connected to the input of the compressed data storage unit 5,

блок компрессии многомерных данных и индексов 7 содержит:the compression unit of multidimensional data and indices 7 contains:

12 - узел разделения входного потока многомерных данных,12 - node separation of the input stream of multidimensional data,

13 - формирователь поисковой группы данных с элементом памяти,13 - shaper search group data with a memory element,

14 - формирователь агрегируемой группы данных с элементом памяти,14 - shaper aggregated data group with a memory element,

15 - формирователь информационной группы данных с элементом памяти,15 - shaper information data group with a memory element,

16 - формирователь группового словаря,16 - shaper group dictionary,

17 - формирователь уникального словаря,17 - shaper of a unique dictionary,

18 - узел компрессии столбцов информационной группы данных,18 - node compression columns information group data

19 - формирователь первого шаблона,19 - shaper of the first template,

20 - формирователь второго шаблона,20 - shaper of the second template,

21 - узел компрессии данных и индексов,21 - node compression of data and indices,

22 - узел компрессии группового словаря,22 - node compression group dictionary,

23 - первый узел компрессии столбцов данных,23 - the first node of compression of the data columns,

24 - второй узел компрессии столбцов данных,24 - the second node of the compression of the data columns,

25 - узел формирования библиотеки декомпрессии,25 - node formation library decompression,

при этом вход узла разделения входного потока многомерных данных 12 является входом блока компрессии многомерных данных и индексов 7, первый выход узла разделения входного потока многомерных данных 12 соединен с входом формирователя поисковой группы данных с элементом памяти 13, второй его выход соединен с входом формирователя агрегируемой группы данных с элементом памяти 14, а третий выход - с входом формирователя информационной группы данных с элементом памяти 15,the input of the separation unit of the input stream of multidimensional data 12 is the input of the compression unit of the multidimensional data and indices 7, the first output of the separation unit of the input stream of multidimensional data 12 is connected to the input of the shaper of the search data group with a memory element 13, its second output is connected to the input of the shaper of the aggregated group data with a memory element 14, and the third output with the input of the shaper of the information data group with memory element 15,

первый выход формирователя поисковой группы данных с элементом памяти 13 соединен с входом формирователя группового словаря 16, второй выход формирователя поисковой группы данных с элементом памяти 13 соединен с входом формирователя уникального словаря 17, третий выход - с первым входом формирователя индексов 11, второй вход которого соединен с выходом формирователя агрегируемой группы данных с элементом памяти 14,the first output of the search data group shaper with the memory element 13 is connected to the input of the group dictionary shaper 16, the second output of the search data group shaper with the memory element 13 is connected to the input of the unique dictionary shaper 17, the third output is with the first input of the index shaper 11, the second input of which is connected with the output of the shaper aggregated data group with a memory element 14,

выход формирователя группового словаря 16 соединен с первым входом узла компрессии группового словаря 22, первый выход и второй вход которого соединены соответственно с первыми входом и выходом узла формирования библиотеки декомпрессии 25,the output of the shaper of the group dictionary 16 is connected to the first input of the compression unit of the group dictionary 22, the first output and the second input of which are connected respectively with the first input and output of the unit of formation of the decompression library 25,

первый и второй выходы формирователя уникального словаря 17 соединены соответственно со входами формирователя первого шаблона 19 и формирователя второго 20 шаблона,the first and second outputs of the shaper of the unique dictionary 17 are connected respectively to the inputs of the shaper of the first template 19 and the shaper of the second 20 of the template,

выход формирователя первого шаблона 19 соединен с первым входом первого узла компрессии столбцов данных 23, первый выход и второй вход которого соединены соответственно со вторыми входом и выходом узла формирования библиотеки декомпрессии 25,the output of the former of the first template 19 is connected to the first input of the first compression node of the data columns 23, the first output and the second input of which are connected respectively with the second input and output of the node for the formation of the decompression library 25,

выход формирователя второго шаблона 20 соединен с первым входом второго узла компрессии столбцов данных 24, первый выход и второй вход которого соединены соответственно с третьими входом и выходом узла формирования библиотеки декомпрессии 25,the output of the shaper of the second template 20 is connected to the first input of the second compression node of the data columns 24, the first output and the second input of which are connected respectively to the third input and output of the node for the formation of the decompression library 25,

четвертые вход и выход узла формирования библиотеки декомпрессии 25 соединены соответственно с первым выходом и вторым входом узла компрессии данных и индексов 21, первый вход которого соединен с выходом формирователя индексов 11,the fourth input and output of the decompression library generation unit 25 are connected respectively to the first output and the second input of the data and index compression unit 21, the first input of which is connected to the output of the index generator 11,

пятые вход и выход узла формирования библиотеки декомпрессии 25 соединены соответственно с первым выходом и вторым входом узла компрессии столбцов информационной группы данных 18, первый вход которого соединен с выходом формирователя информационной группы данных с элементом памяти 15,the fifth input and output of the decompression library forming unit 25 are connected respectively to the first output and the second input of the column compression unit of the data information group 18, the first input of which is connected to the output of the data information group former with a memory element 15,

вторые выходы узла компрессии столбцов информационной группы данных 18, первого 23 и второго 24 узлов компрессии столбцов данных и узла компрессии группового словаря 22 образуют по шине второй выход блока компрессии многомерных данных и индексов 7,the second outputs of the compression unit of the columns of the information data group 18, the first 23 and second 24 compression nodes of the data columns and the compression unit of the group dictionary 22 form on the bus the second output of the compression unit of multidimensional data and indices 7,

второй выход узла компрессии данных и индексов 21 является первым выходом блока компрессии многомерных данных и индексов 7.the second output of the compression unit for data and indices 21 is the first output of the compression unit for multidimensional data and indices 7.

Далее рассмотрим примеры осуществления изобретения со ссылкой на чертежи и иллюстрации.Next, we consider examples of carrying out the invention with reference to the drawings and illustrations.

Рассмотрим фиг.1-7, которые подробно иллюстрируют заявляемый способ компрессии многомерных данных, при этом фиг.1 иллюстрирует алгоритм компрессии многомерных данных (в общем виде), на фиг.2 показана процедура разбиения входного потока многомерных данных, на фиг.3 показана процедура формирования группового словаря и его компрессия, на фиг.4 - алгоритм анализа данных, на фиг.5 - алгоритм процедуры компрессии, на фиг.6 - алгоритм построения индексов для процедуры поиска пользователем информации, на фиг.7 - алгоритм динамической генерации библиотеки декомпрессии.We consider Figs. 1-7, which illustrate in detail the inventive method for compressing multidimensional data, while Fig. 1 illustrates an algorithm for compressing multidimensional data (in general terms), Fig. 2 shows a procedure for splitting an input stream of multidimensional data, and Fig. 3 shows a procedure. the formation of a group dictionary and its compression, Fig. 4 is a data analysis algorithm, Fig. 5 is a compression procedure algorithm, Fig. 6 is an algorithm for constructing indices for a user information search procedure, Fig. 7 is a dynamic library generation algorithm and decompression.

Рассмотрим сначала фиг.1, которая иллюстрирует алгоритм компрессии многомерных данных в общем виде.Consider first figure 1, which illustrates the compression algorithm of multidimensional data in a general way.

1. Каждый входной поток многомерных данных разбивают на строки, каждая из которых представляет структурированный набор столбцов, каждый из которых имеет свой тип данных.1. Each input stream of multidimensional data is divided into rows, each of which represents a structured set of columns, each of which has its own data type.

2. Определяют для каждого столбца его функциональное назначение, формируя, таким образом, по меньшей мере, три группы данных: поисковую, агрегируемую и информационную.2. For each column, its functional purpose is determined, thus forming at least three data groups: search, aggregate, and information.

3. Анализируют данные в сформированных группах и по результатам анализа выполняют компрессию, для чего:3. Analyze the data in the formed groups and, based on the results of the analysis, perform compression, for which:

- для столбцов поисковой группы данных формируют групповой словарь, каждый элемент которого состоит из значений, которые принадлежат входному потоку многомерных данных;- for the columns of the data search group, a group dictionary is formed, each element of which consists of values that belong to the input multidimensional data stream;

- выполняют компрессию группового словаря;- perform compression of the group dictionary;

- для столбцов поисковой группы данных, элементы которых не вошли в групповой словарь, формируют словарь уникальных значений, в котором каждое значение встречается один раз и имеет свой номер, для чего формируют два типа шаблонов, используя которые выполняют компрессию столбцов,- for columns of a data search group whose elements are not included in the group dictionary, a dictionary of unique values is formed, in which each value occurs once and has its own number, for which two types of templates are formed, using which the columns are compressed,

- для столбцов поисковой и агрегируемой групп данных формируют кусочно-битовые индексы, используя которые осуществляют компрессию,- piecewise-bit indexes are formed for the columns of the search and aggregated data groups, using which they perform compression,

- осуществляют компрессию столбцов информационной группы данных.- carry out the compression of the columns of the information data group.

4. Осуществляют динамическую генерацию библиотеки декомпрессии.4. Carry out the dynamic generation of the decompression library.

5. Сохраняют компрессированные данные.5. Save the compressed data.

Для лучшего понимания заявляемого способа рассмотрим фиг.2-7.For a better understanding of the proposed method, consider figure 2-7.

Каждый входной поток многомерных данных, состоящий из элементов одинаковой структуры, где каждый элемент представлен набором полей с заданными значениями, а совокупность значений одного и того же поля в разных элементах образует столбец значений, разбивают на строки (фиг.2 и 4), каждая из которых представляет структурированный набор столбцов, каждый из которых имеет свой тип данных: текстовый или числовой, или тип даты, причем каждый тип данных представлен соответствующей последовательностью, например, N байтов, и/или последовательностью, например, М битов. Образуя, таким образом, как показано на фиг.2, М строк и N столбцов.Each input stream of multidimensional data, consisting of elements of the same structure, where each element is represented by a set of fields with specified values, and the combination of values of the same field in different elements forms a column of values, divided into rows (FIGS. 2 and 4), each of which represents a structured set of columns, each of which has its own data type: text or numeric, or date type, each data type being represented by a corresponding sequence, for example, N bytes, and / or a sequence, For example, M bits. Forming, thus, as shown in FIG. 2, M rows and N columns.

Причем столбцы поисковой группы данных представлены, например, последовательностью байтов или последовательностью битов. Столбцы агрегируемой группы данных - последовательностью байтов или последовательностью битов. Столбцы информационной группы данных представлены, например, последовательностью байтов.Moreover, the columns of the data search group are represented, for example, by a sequence of bytes or a sequence of bits. The columns of an aggregated data group are a sequence of bytes or a sequence of bits. The columns of an information data group are represented, for example, by a sequence of bytes.

Определяют для каждого столбца его функциональное назначение, формируя, таким образом, по меньшей мере, три группы данных: поисковую, агрегируемую и информационную.For each column, its functional purpose is determined, thus forming at least three groups of data: search, aggregate and information.

В поисковую группу данных определяют те столбцы, которые служат для ограничения выборки данных. Столбцы поисковой группы данных представляют собой набор столбцов, который используется для отбора данных (информации), необходимой для анализа.The search group of data defines those columns that are used to limit the selection of data. The columns of a data search group are a set of columns that is used to select the data (information) needed for analysis.

В агрегируемую группу данных формируют столбцы, которые представляют числовой показатель и используются в условиях отбора данных при поиске. Т.е. столбцы агрегируемой группы данных представляют числовые характеристики, для которых выполняются функции агрегирования. Использование этих данных происходит как на верхнем уровне оператора SELECT, так и во вложенных подзапросах. При больших объемах данных к таким операциям предъявляются высокие требования по скорости (особенности для аналитических систем).Columns are formed into an aggregated data group, which represent a numerical indicator and are used in the conditions of data selection during the search. Those. columns of an aggregated data group represent numerical characteristics for which aggregation functions are performed. The use of this data occurs both at the top level of the SELECT statement and in nested subqueries. With large amounts of data, such operations are subject to high speed requirements (especially for analytical systems).

В информационную группу данных определяют столбцы с длинными текстовыми полями, которые не участвуют в условиях отбора при поиске и не представляют числовой показатель. К информационной группе данных относят столбцы, информация о которых не используется как условие отбора или как количественный показатель для агрегируемых функций. Как правило, такие столбцы используются лишь на верхнем уровне оператора SELECT языка SQL.Columns with long text fields are defined in the data information group, which do not participate in the selection conditions during the search and do not represent a numerical indicator. The information group of data includes columns whose information is not used as a selection condition or as a quantitative indicator for aggregated functions. As a rule, such columns are used only at the top level of the SQL SELECT statement.

Запоминают информацию о сформированных группах данных.Information about the generated data groups is stored.

Анализируют данные (информацию) в сформированных группах данных (фиг.4) и по результатам анализа выполняют компрессию сформированных групп данных (фиг.5).Analyze the data (information) in the generated data groups (figure 4) and according to the results of the analysis, the compression of the formed data groups is performed (figure 5).

Для чего для столбцов поисковой группы данных формируют групповой словарь. Пример формирования группового словаря и его компрессия показаны на фиг.3. Каждый элемент группового словаря состоит из значений, которые принадлежат входному потоку многомерных данных. При этом в качестве группового словаря можно, например, использовать хэш-таблицу, количество элементов в которой не превышает мощности машинного слова, а каждый элемент состоит из совокупности элементов всех колонок, кроме колонок информационной группы данных. Алгоритмы работы с хэш-таблицей являются общеизвестными, например [Кнут Д.Э. Искусство программирования. Т.3: Сортировка и поиск: Пер. с англ. Изд. 2, М.: Вильямс, 2004. - 832 с., глава 6.4] или [М.Сибул и Ямамото "Алгоритмы обработки данных", М., Мир, 1995, 218 стр.], или [Гектор Гарсиа-Молина, Джеффри Ульман, Дженнифер Уидом "Системы баз данных. Полный курс" М.: Вильямс, 2003. - 1088 с., глава 13.4].Why for the columns of the search data group form a group dictionary. An example of the formation of a group dictionary and its compression are shown in figure 3. Each element of a group dictionary consists of values that belong to the input multidimensional data stream. In this case, as a group dictionary, for example, you can use a hash table, the number of elements in which does not exceed the power of the machine word, and each element consists of a set of elements of all columns, except the columns of the data information group. Algorithms for working with a hash table are well known, for example [Knut D.E. The art of programming. T.3: Sort and search: Per. from English Ed. 2, M.: Williams, 2004. - 832 p., Chapter 6.4] or [M.Sibul and Yamamoto "Data Processing Algorithms", M., Mir, 1995, 218 pp.] Or [Hector Garcia-Molina, Jeffrey Ullman, Jennifer Widom "Database Systems. Full Course" M .: Williams, 2003. - 1088 p., Chapter 13.4].

Используя сформированный групповой словарь (фиг.3 и 4), анализируют данные, содержащиеся в столбцах, которые относятся к поисковой системе данных, таким образом, что для каждого входного элемента потока данных добавляют элемент в групповой словарь, если такой элемент уже имеется, то увеличивают количество повторений элемента, если элемент отсутствует в групповом словаре, то добавляют элемент, соответствующий входному элементу потока данных.Using the generated group dictionary (FIGS. 3 and 4), the data contained in the columns that relate to the data search system is analyzed, so that for each input element of the data stream, an element is added to the group dictionary, if such an element is already available, then it is increased the number of repetitions of the element, if the element is not in the group dictionary, then add the element corresponding to the input element of the data stream.

При переполнении группового словаря осуществляют подсчет различных значений для каждого столбца, значения которого входят в элемент группового словаря.When the group dictionary overflows, various values are calculated for each column whose values are included in the group dictionary element.

Удаляют из группового словаря столбец, имеющий наибольшее количество различных значений без повторений (фиг.4).The column having the largest number of different values without repetitions is deleted from the group dictionary (FIG. 4).

Групповой словарь перестраивают путем объединения элементов, которые отличаются значениями полей удаленного столбца, выполняя, таким образом, компрессию группового словаря (как показано на фиг.3).The group dictionary is rebuilt by combining elements that differ in the values of the fields of the deleted column, thus performing compression of the group dictionary (as shown in FIG. 3).

Для столбцов поисковой группы данных, элементы которых не вошли в групповой словарь, формируют словарь уникальных значений (фиг.4), в котором каждое значение встречается один раз и имеет свой номер. Для чего формируют два типа шаблонов, первый из которых формируют для типов данных, в которых семантические части значения представлены последовательностью байт, второй тип шаблонов - для типа данных, значения которых представлены последовательностью бит.For columns of a data search group whose elements are not included in the group dictionary, a dictionary of unique values is formed (Fig. 4), in which each value occurs once and has its own number. For this, two types of patterns are formed, the first of which is formed for data types in which the semantic parts of the value are represented by a sequence of bytes, and the second type of templates are used for a data type whose values are represented by a sequence of bits.

Первый шаблон представляет собой общий вид всех значений в рамках столбцов для типа данных, в которых семантические части значений представлены записью с полями, размер которых кратен байту.The first template is a general view of all values within the columns for a data type in which the semantic parts of the values are represented by a record with fields the size of which is a multiple of a byte.

Второй шаблон представляет собой общий вид всех значений в рамках столбцов для типа данных, в которых семантические части значений представлены записью с полями, размер которых кратен биту.The second template is a general view of all values within the columns for a data type in which the semantic parts of the values are represented by a record with fields whose size is a multiple of a bit.

Причем выделяют шаблон путем сравнения изменений соответственно значений байта или бита.Moreover, a template is selected by comparing changes, respectively, byte or bit values.

Используя сформированные шаблоны, выполняют компрессию столбцов данных (фиг.5) таким образом, что при компрессии сохраняют только те части, которые не являются частью шаблона, и удаляют часть значения, соответствующую шаблону.Using the generated templates, the data columns are compressed (Fig. 5) in such a way that only the parts that are not part of the template are saved during compression and the part of the value corresponding to the template is deleted.

Осуществляют компрессию столбцов информационной группы путем преобразования значений столбцов этой группы, представленных последовательностью байтов, в последовательность, представленную меньшим количеством байтов, за счет преобразования подряд идущих одинаковых значений байтов. Так как большинство столбцов информационной группы данных представлены текстовой информацией, которая востребована только на верхнем уровне запросов SELECT, требуется лишь однократная декомпрессия данных. Поэтому компрессию данных можно осуществлять без дальнейшей индексации данных столбцов. Единственным условием является высокая компрессия текстовой информации, поэтому можно использовать такие известные методы компрессии, или RLE (Run Length Encoding), или метод кодирования Хаффмана, или LZW (Lempel-Ziv-Welch), или другие, например, компрессию столбцов информационной группы данных, представленных последовательностью N байтов, можно осуществить путем замены последовательности подряд идущих одинаковых значений байтов на последовательность из двух элементов - количество и само значение (метод RLE). Более детально о способах компрессии можно прочитать [Д.Сэлмон "Сжатие данных, изображений и звука", М.: Техносфера, 2004. - 368 с, LZW - с.97, Хаффман - с.30].Compress the columns of the information group by converting the column values of this group, represented by a sequence of bytes, into a sequence represented by fewer bytes, by converting consecutive identical byte values. Since most of the columns of the data information group are represented by textual information that is in demand only at the top level of SELECT queries, only a single decompression of the data is required. Therefore, data compression can be performed without further indexing the data columns. The only condition is high compression of text information, so you can use such well-known compression methods, or RLE (Run Length Encoding), or Huffman coding method, or LZW (Lempel-Ziv-Welch), or others, for example, compression of columns of an information data group, represented by a sequence of N bytes, can be done by replacing a sequence of consecutive identical byte values with a sequence of two elements - the number and the value itself (RLE method). For more details on compression methods, see [D. Salmon, “Compressing Data, Images, and Sound,” M .: Technosphere, 2004. - 368 s, LZW - p. 97, Huffman - p.30].

Сохраняют информацию об использованных процедурах выполненной компрессии поисковой, агрегируемой и информационной групп данных и данных в них.They store information about the used procedures for the compression of the search, aggregate and information groups of data and the data in them.

Для выполнения процедуры поиска формируют индексы в зависимости от типа столбцов и количества различных значений в словаре уникальных значений (фиг.6).To perform the search procedure, indexes are formed depending on the type of columns and the number of different values in the dictionary of unique values (Fig.6).

Для столбцов поисковой группы данных формируют такие индексы, которые при поиске позволяют вычислять отношения среди значений данного столбца, в случае, если количество различных значений в столбце не превышает заданное пороговое значение, которое определяют в зависимости от конфигурации системы. При этом, например, для столбцов поисковой группы формируют Bitmap индексы, которые представляют собой структуру из нескольких последовательностей битов, каждый из которых соответствует уникальному значению столбца, для которого строится индекс, установленный в значение «1» бит в n-й позиции последовательности означает, что строка в указанном столбце содержит значение соответствующей последовательности.For the columns of the data search group, such indices are formed that during the search allow you to calculate the relationships among the values of this column, if the number of different values in the column does not exceed a predetermined threshold value, which is determined depending on the system configuration. In this case, for example, for the columns of the search group, Bitmap indexes are formed, which are a structure of several sequences of bits, each of which corresponds to a unique value of the column for which an index is built, set to “1” bit in the n-th position of the sequence means that the row in the specified column contains the value of the corresponding sequence.

Для столбцов агрегируемой группы данных формируют такие индексы, которые при поиске в зависимости от запроса пользователем требуемой информации позволяют вычислять отношение значений среди значений данного столбца, или суммировать значения столбцов, или находить минимальное или максимальное значения среди значений данного столбца, или количество значений или среднее значение среди значений данного столбца. При этом, например, для столбцов агрегируемой группы формируют кусочно-битовые индексы Bit-sliced, представляющие собой последовательности битов, причем каждая последовательность образована конкатенацией i-го бита каждого значения столбца, при этом количество последовательностей определяется количеством битов, необходимых для представления каждого значения.For the columns of the aggregated data group, such indices are formed that when searching, depending on the user’s request for the required information, they can calculate the ratio of values among the values of a given column, or summarize the values of columns, or find the minimum or maximum values among the values of this column, or the number of values or average value among the values of this column. In this case, for example, piece-bit indexes Bit-sliced are formed for the columns of the aggregated group, which are sequences of bits, each sequence being formed by concatenating the ith bit of each column value, and the number of sequences is determined by the number of bits required to represent each value.

Сохраняют сформированные индексы.Save the generated indexes.

Осуществляют динамическую генерацию библиотеки декомпрессии для каждого входного потока многомерных данных (фиг.7), для чегоDynamic decompression library is generated for each input multidimensional data stream (Fig. 7), for which

выполняют генерацию стандартных точек входа для исходного кода библиотеки декомпрессии,generating standard entry points for the decompression library source code,

для столбцов группового словаря выполняют генерацию исходного кода на языке программирования (например, ASM) для распаковки значений столбцов, вошедших в групповой словарь,for columns in a group dictionary, source code is generated in a programming language (for example, ASM) to unpack the values of the columns included in the group dictionary,

для каждого столбца, не вошедшего в групповой словарь, генерируют исходный код на языке программирования (например, ASM) для распаковки значений столбцов, не вошедших в групповой словарь,for each column that is not included in the group dictionary, source code is generated in a programming language (for example, ASM) to unpack the values of columns not included in the group dictionary,

завершают генерацию библиотеки декомпрессии.complete the decompression library generation.

По запросу пользователя (фиг.8), который содержит информацию о выбираемых столбцах потока данных или условиях, ограничивающих выбираемый набор строк, или функции агрегирования, осуществляют восстановление запомненных данных, для чегоAt the request of the user (Fig. 8), which contains information about the selected columns of the data stream or the conditions that limit the selected set of rows, or the aggregation function, the stored data is restored, for which

распознают команду пользователя путем трансляции - лексического и семантического анализа и преобразование его во внутреннее представление (форму), используя которое определяют те столбцы данных, которые должны быть переданы пользователю, известные алгоритмы такой процедуры описаны, например, [Ахо "Компиляторы. Принципы, технология, инструменты" М.: Вильямс, 2003. - 768 с.];they recognize the user’s command by translating - lexical and semantic analysis and converting it into an internal representation (form), using which the data columns to be transmitted to the user are determined, known algorithms of this procedure are described, for example, [Aho "Compilers. Principles, technology, tools "M .: Williams, 2003. - 768 p.];

определяют набор поисковых столбцов, используя которые осуществляют фильтрацию данных;determine a set of search columns using which filter data;

определяют набор столбцов, для которых вызывают агрегируемые функции;define a set of columns for which aggregated functions are called;

определяют набор индексов для использования операций поиска и агрегирования в зависимости от операций, указанных в запросе, если при вычислении запроса используются функции агрегирования, то выбирается Bit-sliced индекс для соответствующих столбцов, для вычисления операций отношений над столбцами выбирают В-tree или Bitmap индексы, в зависимости от их наличия;determine a set of indices for using search and aggregation operations depending on the operations specified in the query, if aggregation functions are used in calculating the query, then the Bit-sliced index is selected for the corresponding columns, B-tree or Bitmap indices are selected to calculate the relations between the columns, depending on their availability;

определяют порядок извлечения и процедуру декомпрессии данных, вычисляют результат запроса пользователя, используя определенные порядок извлечения и процедуру декомпрессии данных, для чегоdetermine the extraction order and the data decompression procedure, calculate the result of the user’s request using certain extraction procedures and the data decompression procedure, for which

1. при использовании индексов, для столбцов поисковой группы данных определяют результаты операций отношений в виде номеров элементов исходного потока данных, например так, как описано [Гектор Гарсиа-Молина, Джеффри Ульман, Дженнифер Уидом "Системы баз данных. Полный курс" М.: Вильямс, 2003. - 1088 с., главы 15, 16];1. when using indexes, for the columns of the data search group, determine the results of the relationship operations in the form of the element numbers of the initial data stream, for example, as described by [Hector Garcia-Molina, Jeffrey Ullman, Jennifer Weed "Database Systems. Full Course" M: Williams, 2003. - 1088 p., Chapters 15, 16];

при использовании индексов для столбцов агрегируемой группы определяют результаты операции агрегирования, например, так, как описано [Р.O'Neil, D.Quass, Improved Query Performance with Variant Indexes. Proc. ACM SIGMOD Int. Conf. on Management of Data, c.38-49, 1997];when using indexes for columns of an aggregated group, the results of the aggregation operation are determined, for example, as described by [P. O'Neil, D.Quass, Improved Query Performance with Variant Indexes. Proc. ACM SIGMOD Int. Conf. on Management of Data, c. 38-49, 1997];

выполняют поиск сохраненных данных и библиотеки декомпрессии и осуществляют декомпрессию необходимых элементов данных путем вызова библиотеки декомпрессии через стандартные точки входа, восстанавливая, таким образом, входной поток многомерных данных, удовлетворяющих условиям поиска, который передают пользователю.they search for stored data and decompression libraries and decompress the necessary data elements by calling the decompression library through standard entry points, thus restoring the input stream of multidimensional data satisfying the search conditions that are transmitted to the user.

Рассмотрим реализацию заявляемого способа на устройстве, структурная схема которого выполнена на фиг.9 и 10.Consider the implementation of the proposed method on a device whose structural diagram is made in Fig.9 and 10.

По выходному сигналу блока управления загрузкой многомерных данных 8 (фиг.9), который поступает на вход блока компрессии многомерных данных и индексов 7, загружается через вторую сеть 2₂ каждый входной поток многомерных данных с блока оперативного хранения данных 9, на вход которого со второго (информационного) входа устройства поступает сигнал записи данных (информации).According to the output signal of the multidimensional data loading control unit 8 (Fig. 9), which is fed to the input of the multidimensional data and indices compression unit 7, each input multidimensional data stream is loaded from the online data storage unit 9 through the second network 2 ₂ , the input of which is from the second (information) input device receives a signal recording data (information).

С входа блока компрессии многомерных данных и индексов 7 каждый входной поток многомерных данных поступает на вход узла разделения входного потока многомерных данных 12 (фиг.10), который разбивает его на строки, каждая из которых представляет структурированный набор столбцов, каждый из которых имеет свой тип данных, и определяет для каждого столбца его функциональное назначение. Таким образом, с первого, второго и третьего выходов узла разделения входного потока многомерных данных 12 структурированные наборы столбцов по функциональному назначению поступают соответственно на входы формирователя поисковой группы данных с элементом памяти 13, формирователя агрегируемой группы данных с элементом памяти 14 и формирователя информационной группы данных с элементом памяти 15.From the input of the compression unit of multidimensional data and indices 7, each input multidimensional data stream is fed to the input of the separation unit of the input multidimensional data stream 12 (Fig. 10), which splits it into rows, each of which represents a structured set of columns, each of which has its own type data, and determines for each column its functional purpose. Thus, from the first, second, and third outputs of the separation unit of the input stream of multidimensional data 12, the structured sets of columns for functional purpose are respectively supplied to the inputs of the shaper of the search data group with memory element 13, the shaper of the aggregated data group with memory element 14 and the shaper of the information data group with memory element 15.

Формирователи 13, 14 и 15 формируют соответственно три группы данных: поисковую, агрегируемую и информационную, запоминают информацию о сформированных группах данных и данных в них.Shapers 13, 14 and 15 form three groups of data, respectively: search, aggregate and information, remember information about the generated data groups and the data in them.

Анализируют данные в сформированных группах и по результатам анализа выполняют компрессию, для чегоAnalyze the data in the formed groups and according to the results of the analysis perform compression, for which

- для столбцов поисковой группы данных в формирователе группового словаря 16 формируют групповой словарь, каждый элемент которого состоит из значений, которые принадлежат входному потоку многомерных данных;- for columns of the data search group in the group dictionary compiler 16, a group dictionary is formed, each element of which consists of values that belong to the input multidimensional data stream;

- в узле компрессии группового словаря 22 выполняют компрессию группового словаря;- in the compression node of the group dictionary 22 perform compression of the group dictionary;

- для столбцов поисковой группы данных, элементы которых не вошли в групповой словарь, в формирователе уникального словаря 17 формируют словарь уникальных значений, в котором каждое значение встречается один раз и имеет свой номер, для чего в первом 19 и втором 20 формирователях шаблонов формируют два типа шаблонов (соответственно байтовый шаблон и битовый шаблон), используя которые соответственно в первом 23 и втором 24 узлах компрессии столбцов данных выполняют компрессию;- for columns of a search data group whose elements are not included in the group dictionary, a unique values dictionary is generated in the unique dictionary shaper 17, in which each value occurs once and has its own number, for which two types are formed in the first 19 and second 20 shapers templates (byte pattern and bit pattern, respectively), using which, respectively, in the first 23 and second 24 compression nodes of the data columns perform compression;

- для столбцов поисковой и агрегируемой групп данных формируют кусочно-битовые индексы, используя для этого формирователь индексов 11; используя сформированные индексы, осуществляют компрессию в узле компрессии данных и индексов 21;- piecewise-bit indexes are formed for the columns of the search and aggregated data groups using the index generator 11 for this; using the generated indices, perform compression in the compression node of the data and indices 21;

- осуществляют компрессию столбцов информационной группы данных в узле компрессии столбцов для информационной группы данных 18;- carry out the compression of the columns of the information data group in the node compression columns for the information data group 18;

- сохраняют компрессированные данные и индексы.- Save compressed data and indexes.

Компрессированные данные со вторых выходов узлов 22, 23, 24 и 18 поступают на второй выход блока компрессии многомерных данных и индексов 7, а на первый выход - компрессированные индексы со второго выхода узла 21.Compressed data from the second outputs of the nodes 22, 23, 24 and 18 are fed to the second output of the compression unit of multidimensional data and indices 7, and the first output is the compressed indices from the second output of the node 21.

Для процедуры поиска информации осуществляют динамическую генерацию библиотеки декомпрессии в узле формирования библиотеки декомпрессии 25, для чегоFor the information retrieval procedure, dynamic decompression library generation is carried out in the decompression library generation unit 25, for which

По запросу пользователя, поступающему на вход блока генерации запросов пользователей 1 (фиг.9), который содержит информацию о выбираемых столбцах потока данных или условиях, ограничивающих выбираемый набор строк, или функции агрегирования, осуществляют восстановление запомненных данных, для чегоAt the request of the user, which is received at the input of the block for generating user requests 1 (Fig. 9), which contains information about the selected columns of the data stream or the conditions that limit the selected set of rows or the aggregation function, the stored data is restored, for which

в блоке обработки многомерных запросов 3:in block processing multidimensional requests 3:

- распознают команду пользователя путем трансляции - лексического и семантического анализа и преобразование его во внутреннее представление (форму), используя которое определяют те столбцы данных, которые должны быть переданы пользователю;- recognize the user’s command by translating - lexical and semantic analysis and converting it into an internal representation (form), using which determine the columns of data that must be transmitted to the user;

- определяют набор поисковых столбцов, используя которые осуществляют фильтрацию данных;- determine the set of search columns using which filter data;

- определяют набор столбцов, для которых вызывают агрегируемые функции;- define a set of columns for which aggregate functions are called;

- определяют набор индексов для использования операций поиска и агрегирования в зависимости от операций, указанных в запросе, если при вычислении запроса используются функции агрегирования, то выбирается Bit-sliced индекс для соответствующих столбцов, для вычисления операции отношений над столбцами выбирают В-tree или Bitmap индексы, в зависимости от их наличия в блоке 4.- determine the set of indices for using search and aggregation operations depending on the operations specified in the query, if aggregation functions are used when calculating the query, then the Bit-sliced index is selected for the corresponding columns, to calculate the relationship operation on the columns, B-tree or Bitmap indices are selected , depending on their availability in block 4.

В блоке оптимизации и выполнения запросов пользователей 4 определяют порядок извлечения и процедуру декомпрессии данных, вычисляют результат запроса пользователя, используя определенные порядок извлечения и процедуру декомпрессии данных, для чегоIn the optimization and execution unit of user requests 4, the extraction order and data decompression procedure are determined, the result of the user request is calculated using specific extraction procedures and data decompression procedures, for which

при использовании индексов для столбцов поисковой группы данных определяют результаты операций отношений в виде номеров элементов исходного потока данных с использованием информации из блока хранения компрессированных индексов 6;when using indexes for columns of a data search group, the results of relations operations are determined in the form of element numbers of the original data stream using information from the compressed index storage unit 6;

при использовании индексов для столбцов агрегируемой группы определяют результаты операции агрегирования с использованием информации из блока хранения компрессированных индексов 6;when using indices for the columns of the aggregated group, the results of the aggregation operation are determined using information from the storage block of the compressed indices 6;

выполняют поиск сохраненных данных и библиотеки декомпрессии в узле 25 (фиг.10) и осуществляют декомпрессию необходимых элементов данных путем вызова библиотеки декомпрессии через стандартные точки входа блоком хранения компрессированных данных 5 (фиг.9), восстанавливая, таким образом, входной поток многомерных данных, удовлетворяющих условиям поиска, который передают пользователю с блока обработки ответов 10.they search for stored data and decompression libraries in node 25 (Fig. 10) and decompress the necessary data elements by calling the decompression library through standard entry points with the compressed data storage unit 5 (Fig. 9), thus restoring the input multidimensional data stream, satisfying the search conditions that are transmitted to the user from the response processing unit 10.

Заявляемое устройство может быть реализовано как на микро ЭВМ, так и в аппаратурной реализации.The inventive device can be implemented both on a microcomputer, and in hardware implementation.

Таким образом, заявляемое изобретение позволяет получить лучший технический эффект по сравнению с известными техническими решениями в данной области техники, а именно:Thus, the claimed invention allows to obtain the best technical effect in comparison with the known technical solutions in the art, namely:

универсальность, т.к. способ применим для любого типа загружаемой информации (текстовой, числовой, тип даты и т.д.), что достигается путем анализа внутреннего представления значений, основанного либо на байтовой, либо на битовой последовательности;universality, as the method is applicable for any type of loaded information (text, numeric, date type, etc.), which is achieved by analyzing the internal representation of values based on either a byte or a bit sequence;

Что позволяет повысить степень компрессии (сжатия) многомерных данных без потери их полезной функции, увеличить объем хранения информации и повысить надежность хранения и поиска различной информации в базах данных.This makes it possible to increase the degree of compression (compression) of multidimensional data without losing their useful function, increase the amount of information storage and increase the reliability of storage and retrieval of various information in databases.

Claims

1. A method of compressing multidimensional data for storing and retrieving information in a database management system, which consists in the fact that each input stream of multidimensional data, consisting of elements of the same structure, where each element is represented by a set of fields with specified values, and a set of values of the same the fields in different elements form a column of values, divided into rows, each of which represents a structured set of columns, each of which has its own data type: text or numeric, or date type s, and each data type is represented by a corresponding sequence of bytes, or a sequence of bits, determines its functional purpose for each column, thus forming at least three data groups: search, aggregate and information, those columns are determined in the search data group which serve to limit the selection of data, form columns in the aggregated data group that represent a numerical indicator and are used in the conditions of data selection during the search, into information groups The data is determined by columns with long text fields that do not participate in the selection conditions during the search and do not represent a numerical indicator, store information about the generated data groups, analyze the data (information) in the generated data groups and, based on the analysis results, compress the generated data groups, for which For the columns of the search data group, a group dictionary is formed, and each element of the group dictionary consists of values that belong to the input multidimensional data stream, using Using the generated group dictionary, analyze the data contained in the columns of the search data group, so that for each input element of the data stream add an element to the group dictionary, if such an element already exists, then increase the number of repetitions of the element if the element is not in the group dictionary, then add the element corresponding to the input element of the data stream, when the group dictionary is full, they calculate various values for each column whose values are included in the element t of the group dictionary, the column with the largest number of different values without repetitions is removed from the group dictionary, the group dictionary is rebuilt by combining the elements that differ in the values of the fields of the deleted column, thus compressing the group dictionary for columns of the search data group whose elements are not included in the group dictionary, form a dictionary of unique values, in which each value occurs once and has its own number, for each set of unique values form two types of templates, the first of which is formed for a data type in which the semantic parts of the value are represented by a sequence of bytes, the second template is for a data type whose values are represented by a sequence of bits, the first template is a general view of all values within the columns for the data type, in of which the semantic parts of values are represented by a record with fields the size of which is a multiple of a byte, the second template is a general view of all the values within the columns for a data type in which the semantic parts values are represented by a record with fields that are a multiple of a bit, and a template is selected by comparing the changes, respectively, byte or bit values using the generated templates, the data columns are compressed in such a way that only those parts that are not part of the template are saved, and deleted parts of the column values corresponding to the template compress the columns of the information group by converting the column values of this group, represented by a sequence of bytes, into the sequence represented by fewer bytes, by converting consecutive identical byte values, saves information about the procedures used to perform compression of the information, aggregated and search data groups and the data in them, to perform the search procedure, indexes are formed depending on the type of columns and the number of different values in the dictionary of unique values in such a way that for the columns of the data search group such indices are formed that when searching allow you to calculate from Among the values of this column, if the number of different values in the column does not exceed a predetermined threshold value, which is determined depending on the system configuration, indexes are formed for the columns of the aggregated data group, which, when searching for the information requested by the user, allow calculating the ratio of values among the values of a given column, or summarize the values of columns, or find the minimum or maximum values among the values of this column, or The set of values, or the average value among the values of this column, stores the generated indices, dynamically generates a decompression library for each input multidimensional data stream, which generates standard entry points for the source code of the decompression library, while generating source code in a programming language for unpacking the values columns included in the group dictionary, for each column not included in the group dictionary, source code is generated in a programming language for I unpack the values of columns that are not included in the group dictionary, start the compiler to create a decompression library based on the source code, remember the decompression library, register the decompression library for the input multidimensional data stream, complete the generation of the decompression library, at the request of the user, which contains information about the selected columns data flow, or conditions that limit the selected set of rows, or aggregation functions, restore the stored data, for which they recognize the user’s command by translating — lexical and semantic analysis and converting it into an internal representation — a form using which the columns of data that are to be transmitted to the user are determined, a set of search columns is determined using which filter the data, and the corresponding indices determine the set the columns for which the aggregated functions are called, and the corresponding indices, determine the extraction order and data decompression procedure, calculate the result t of the user’s request, using specific extraction procedures and data decompression procedures, for which, when using indexes for columns of a data search group, determine the results of relationship operations in the form of element numbers of the original data stream, when using indexes for columns of an aggregated group, determine the results of the aggregation operation, perform a search stored data and decompression libraries and decompress the necessary data elements by calling the decompression library through standard entry points, thus restoring the input stream of multidimensional data satisfying the search conditions that are transmitted to the user.

2. The method according to claim 1, characterized in that the columns of the data search group are represented by a sequence of bytes or a sequence of bits.

3. The method according to claim 1, characterized in that the columns of the aggregated data group are represented by a sequence of bytes or a sequence of bits.

4. The method according to claim 1, characterized in that the columns of the data information group are represented by a sequence of bytes.

5. The method according to claim 1, characterized in that a hash table is used as a group dictionary, the number of elements in which does not exceed the power of a machine word, and each element consists of a combination of elements of all columns except the columns of an information data group.

6. The method according to claim 1, characterized in that the compression of the columns of the information group of data represented by a sequence of bytes is carried out by replacing a sequence of consecutive identical values of bytes with a sequence of two elements - the number and the value itself, thereby converting to the sequence represented fewer bytes.

7. The method according to claim 1, characterized in that for the columns of the search group they create Bitmap indexes, which are a structure of several sequences of bits, each of which corresponds to a unique value of the column for which the index is built.

8. The method according to claim 1, characterized in that piecewise bit indexes Bit-sliced are formed for the columns of the aggregated group, which are sequences of bits, each sequence being formed by concatenating the ith bit of each column value, the number of sequences being determined by the number of bits required to represent each value.

9. The method according to claim 1, characterized in that if it is determined that it is possible to apply Bitmap indexes to the search group columns, then the data requested by the user is searched using Bitmap indexes.

10. The method according to claim 1, characterized in that if it is determined that it is possible to apply piecewise bit-indexes Bit-sliced to perform a search procedure for columns of an aggregated group, then the data requested by a user is searched using piecewise-bit Bit-sliced indexes.

11. A multidimensional data compression device for storing and searching information in a database management system, comprising a user query generation unit, a first network, a multidimensional query processing unit, a user query optimization and execution unit, a compressed data storage unit, a compressed index storage unit, a compression unit multidimensional data, a control unit for loading multidimensional data, a second network, a unit for online storage of multidimensional data, a unit for processing responses to users, while the input of the user request generation unit is the first input of the device - a signal input, the output of the user request generation unit through the first network is connected to the first input of the multidimensional request processing unit, the first output of which is connected to the first input of the user request optimization and execution unit, the output of which is connected to the second input multidimensional query processing unit, the second input of the optimization unit and user query execution is connected to the output of the compressed data storage unit, the input of the optimization and user query block is connected to the output of the compressed index storage block, the input of which is connected to the first output of the multi-dimensional data and index compression block, the input of the online data storage block is the second input of the device - an information input, the second output of the multi-dimensional request processing block is connected through the first network with the input of the response processing unit to users, the output of which is the output of the device, and the compression unit of multidimensional data and sod indices holds an index generator, characterized in that the output of the online data storage unit through the second network is connected to the input of the multidimensional data loading control unit, the output of which is connected to the input of the multidimensional data and index compression unit, the second output of which is connected to the input of the compressed data storage unit, compression unit multidimensional data and indices contains a node for splitting the input stream of multidimensional data, a shaper of a search data group with a memory element, a shaper of an aggregate group data with a memory element, a data information group shaper with a memory element, a group dictionary shaper, a unique dictionary shaper, a column of a data group information compression unit, a first template shaper, a second template shaper, a data and index compression unit, a group dictionary compression unit, a first compression unit columns of data, the second node of compression of the columns of data, the node of the formation of the decompression library, while the input of the node of the separation of the input stream of multidimensional data is the input of the compression unit of multidimensional data and indices, the first output of the splitting unit of the input stream of multidimensional data is connected to the input of the shaper of the search data group with a memory element, its second output is connected to the input of the shaper of the aggregated data group with a memory element, and the third output to the input of the information shaper data groups with a memory element, the first output of the shaper of the search data group with the memory element is connected to the input of the shaper of the group dictionary, the second output of the shaper is of a search data group with a memory element connected to the input of the unique dictionary former, the third output to the first input of the index generator, the second input of which is connected to the output of the aggregated data group former with the memory element, the output of the group dictionary former is connected to the first input of the group dictionary compression node, the first the output and the second input of which are connected respectively to the first input and output of the decompression library forming unit, the first and second outputs of the unique vocabulary generator I are connected respectively to the inputs of the former of the first template and the former of the second template, the output of the former of the first template is connected to the first input of the first node of compression of the data columns, the first output and second input of which are connected respectively to the second input and output of the node of the formation of the decompression library, the output of the former of the second template is connected with the first input of the second node of compression of the data columns, the first output and the second input of which are connected respectively to the third input and output of the form node decompression library, the fourth input and output of the decompression library generation unit are connected respectively to the first output and the second input of the data and index compression unit, the first input of which is connected to the output of the index generator, the fifth input and output of the decompression library formation unit are connected to the first output and second the input node of the compression of the columns of the information data group, the first input of which is connected to the output of the shaper of the information data group with a memory element, second e outputs node compression columns information data groups, the first and second nodes compression data columns and node compression group dictionary form bus second output multidimensional data compression unit and indexes, the second output data compression unit and the indices a first output of the compression multidimensional data and indexes.