RU2490702C1

RU2490702C1 - Method of accelerating processing of multiple select-type request to rdf database using graphics processor

Info

Publication number: RU2490702C1
Application number: RU2012117709/08A
Authority: RU
Inventors: Антон Александрович Жерздев; Джехюок РЮ; Гю-тае ПАРК; Хюнсик ШИМ
Original assignee: Корпорация "САМСУНГ ЭЛЕКТРОНИКС Ко., Лтд."
Priority date: 2012-05-02
Filing date: 2012-05-02
Publication date: 2013-08-20

Abstract

FIELD: information technology.

SUBSTANCE: disclosed is a method for parallel processing of multiple requests to RDF databases using a graphics processor, which is characterised by that a linking request received at a server is first broken down into elementary requests, after which said elementary requests are placed into a common input queue, from where they are loaded in blocks into the memory of the graphics processor and passed through the graphics processor conveyor, where for each elementary request, a set of elementary responses is calculated, and the obtained responses are then merged into a list of threes, which is a response to the linking request.

EFFECT: high capacity of the request processing server.

18 cl, 5 dwg, 2 tbl

Description

Изобретение относится к технологиям обработки информации, а более конкретно - к способам обеспечения доступа к информации, хранящейся в базах данных.The invention relates to information processing technologies, and more specifically to methods for providing access to information stored in databases.

Данные и базы данных играют ключевую роль в современном мире. С развитием сетевых технологий появилась тенденция разделения сетевых ресурсов. Одна база данных обслуживает множество клиентов по сети. Таким образом, важнейшей характеристикой базы данных становится ее пропускная способность - количество запросов, которые она может обработать в единицу времени. Зачастую узким местом является вычислительная мощность машины. Для борьбы с этой проблемой были разработаны различные схемы параллельной обработки запросов.Data and databases play a key role in today's world. With the development of network technologies, a tendency has emerged to share network resources. One database serves many clients over the network. Thus, the most important characteristic of a database is its throughput - the number of queries that it can process per unit of time. The bottleneck is often the processing power of the machine. To combat this problem, various parallel query processing schemes have been developed.

Одна из попыток описана в [1]. Авторы этой работы предлагают метод декомпозиции SQL запроса, а также оптимальную стратегию параллельной обработки компонент запроса на нескольких процессорах. Недостаток описанного метода заключается в том, что потенциальная степень параллельности этого метода ограничена числом компонент запроса. Некоторые простейшие виды SQL запросов вообще не подлежат декомпозиции.One of the attempts is described in [1]. The authors of this work propose a method for decomposing an SQL query, as well as an optimal strategy for parallel processing of query components on several processors. The disadvantage of the described method is that the potential degree of parallelism of this method is limited by the number of query components. Some of the simplest types of SQL queries are not decomposable at all.

В [2] предлагается привязать каждый процессор в кластере к отдельной порции данных. Тогда SQL запрос будет применяться одновременно ко всем порциям, а результаты затем будут объединены. Теоретически степень параллельности этого подхода ограничена только количеством процессоров. На практике, однако, объединение результатов может занимать значительное время для большого числа порций. Узким местом может стать и канал передачи данных между процессорами в кластере. Кроме того, экономическая эффективность использования для этой задачи кластера с большим числом машин представляется сомнительной.In [2], it is proposed to bind each processor in a cluster to a separate piece of data. Then the SQL query will be applied simultaneously to all portions, and the results will then be combined. Theoretically, the degree of parallelism of this approach is limited only by the number of processors. In practice, however, combining the results can take considerable time for a large number of servings. A bottleneck could also be a data transfer channel between processors in a cluster. In addition, the economic efficiency of using a cluster with a large number of machines for this task is doubtful.

В настоящее время графические процессорные устройства ГПУ (GPU, Graphics Processing Unit) являются сравнительно недорогой альтернативой многоядерным процессорам и кластерам. В основе ГПУ лежит графический мультипроцессор. Он состоит из множества процессоров (streaming processor), которые логически объединяются в группы. На таком мультипроцессоре можно запускать сотни (по числу процессоров) параллельных потоков - копий одной и той же процедуры, называемой ядром (английский термин - kernel). Из уровня техники на данный момент для нескольких классов задач известны параллельные алгоритмы, позволяющие решать их па графических процессорах намного эффективнее, чем на многоядерном процессоре или в кластере.Currently, GPU (Graphics Processing Unit) GPUs are a relatively inexpensive alternative to multi-core processors and clusters. GPU is based on a graphic multiprocessor. It consists of many processors (streaming processor), which are logically grouped. On such a multiprocessor, you can run hundreds (by the number of processors) of parallel threads - copies of the same procedure called the kernel (the English term is kernel). In the prior art, parallel algorithms are known for several classes of problems, allowing them to be solved by GPUs much more efficiently than on a multicore processor or in a cluster.

Основная трудность, с которой сталкиваются при попытке использовать графический мультипроцессор для вычислений общего характера, заключается в том, что мощность одного процессора недостаточно велика, что компенсируется большим их числом. Также, устройство современных графических процессоров таково, что полная параллельность выполнения возможна лишь для простейших ядер. Наличие в коде ядра ветвлений и обращений к памяти может послужить причиной задержек и привести к сильному снижению степени параллелизма. Отсюда вытекает необходимость разделения любой задачи на как можно большее число очень простых независимых подзадач.The main difficulty encountered when trying to use a graphic multiprocessor for general computing is that the power of one processor is not large enough, which is offset by a large number of them. Also, the device of modern graphic processors is such that full parallelism of execution is possible only for the simplest cores. The presence of branching and memory accesses in the kernel code can cause delays and lead to a strong decrease in the degree of parallelism. This implies the need to divide any task into as many very simple independent sub-tasks as possible.

В [3] была предложена техника, в рамках которой ядро обработки запроса запускается одновременно для множества различных запросов. Чтобы использование ГПУ было оправдано, необходимо обрабатывать сотни запросов одновременно. Очевидно, что все данные, которые обрабатываются запросом в процессе поиска, должны находиться в памяти ГПУ (это может быть вся база данных или какой-то ее сегмент). Данные, обнаруженные в процессе поиска, также должны храниться в памяти ГПУ, прежде чем они будут переданы в основную память. Чтобы запустить сотни потоков одновременно, потребуется очень большой объем памяти ГПУ для хранения результатов. Кроме того, код ядра, которое обрабатывает целиком запрос типа SELECT, будет, очевидно, довольно сложным, и реальная степень параллелизма при его запуске будет далека от максимальной.In [3], a technique was proposed in which the query processing kernel is launched simultaneously for many different queries. To use the GPU was justified, you need to process hundreds of requests at the same time. Obviously, all the data that is processed by the request during the search must be in the GPU memory (this may be the entire database or some segment of it). Data found during the search should also be stored in the GPU's memory before being transferred to the main memory. To run hundreds of threads at once, you will need a very large amount of GPU memory to store the results. In addition, the kernel code that processes the entire SELECT-type query will obviously be quite complex, and the actual degree of parallelism at its launch will be far from the maximum.

Особый интерес в связи с данной проблемой представляют семантические базы данных, в частности - представленные в виде RDF (Resource Description Framework). По сути, в такой базе хранятся тройки (s, p, о) - субъект, предикат и объект. Каждое поле представлено строкой, но довольно распространенной техникой является хранение троек целочисленных хэш-кодов, чего достаточно для идентификации. Запросы типа SELECT к RDF базе данных обычно формулируются на языке SPARQL (SPARQL Protocol and RDF Query Language).Of particular interest in connection with this problem are semantic databases, in particular, presented in the form of RDF (Resource Description Framework). In fact, in such a base, triples (s, p, o) are stored - the subject, predicate and object. Each field is represented by a string, but a fairly common technique is to store triples of integer hash codes, which is enough for identification. SELECT queries to the RDF database are usually formulated in SPARQL (SPARQL Protocol and RDF Query Language).

Существенная простота структуры RDF баз данных позволяет надеяться, что для реализации запросов типа SELECT можно сконструировать весьма простые алгоритмы, допускающие параллельную реализацию. Основу SPARQL запроса составляет формула, оперирующая шаблонами троек, между которыми стоят логические операции «И» и «ИЛИ». Каждый шаблон содержит три поля, в которых находится либо конкретное значение, либо имя переменной.The significant simplicity of the RDF database structure allows us to hope that for the implementation of SELECT queries, we can construct very simple algorithms that allow parallel implementation. The basis of the SPARQL query is a formula that operates on triples templates, between which there are logical operations “AND” and “OR”. Each template contains three fields that contain either a specific value or a variable name.

Распространенная в настоящее время техника обработки SPARQL запроса заключается в том, что на основе данного набора шаблонов выводится формула, оперирующая другим типом шаблонов, также соединенных знаками «И» и «ИЛИ». При этом каждый шаблон в каждой позиции содержит либо набор допустимых значений, либо специальную маску, означающую, что в этой позиции допустимо любое значение. Каждый такой шаблон фактически является простейшим запросом на извлечение данных из базы (так называемый запрос на связывание). Результатом его выполнения является множество конкретных троек, извлеченных из базы. Над этими множествами затем производят вычисление вышеупомянутой формулы (операция «И» заменяется на пересечение множеств, а «ИЛИ» - на объединение) и получают результирующее множество троек.The currently used technique for processing a SPARQL query is that, based on this set of templates, a formula is derived that operates on a different type of template, also connected by the signs “AND” and “OR”. Moreover, each template in each position contains either a set of valid values or a special mask, which means that any value is allowed in this position. Each such template is actually the simplest request to retrieve data from the database (the so-called request for binding). The result of its implementation is a lot of specific triples extracted from the base. The above-mentioned formulas are then calculated over these sets (the operation “AND” is replaced by the intersection of the sets, and “OR” by the union) and the resulting set of triples is obtained.

В [4] описан метод обработки запросов типа SELECT к RDF базе данных, где всему вышеописанному процессу получения ответа на запрос предшествует его оптимизация - представление в виде другого, равносильного, набора шаблонов, который характеризуется меньшей длиной пути поиска данных в базе. Делается это на основе семантических правил, выведенных автоматически из текущего содержания базы на этапе предобработки. В принципе, эту оптимизацию можно применять независимо от того, как реализован сам процесс извлечения данных, но ее эффект может от этого зависеть.In [4], a method for processing queries of the SELECT type to an RDF database is described, where the entire process described above for receiving a response to a query is preceded by its optimization - presentation in the form of another, equivalent set of templates, which is characterized by a shorter data search path in the database. This is done on the basis of semantic rules derived automatically from the current contents of the database at the pre-processing stage. In principle, this optimization can be applied regardless of how the data extraction process itself is implemented, but its effect may depend on this.

Задача, на решение которой направлено заявляемое изобретение, заключается в увеличении пропускной способности сервера обработки запросов типа SELECT к RDF базе данных.The problem to which the invention is directed, is to increase the throughput of the server for processing queries of the SELECT type to the RDF database.

Технический результат достигается благодаря разработке и применению оригинального алгоритма параллельной обработки множественных запросов на связывание, исполняемого на графическом процессорном устройстве (ГПУ), а также усовершенствованной системы RDF базы данных на основе устройства графического процессора (ГПУ).The technical result is achieved through the development and application of an original algorithm for parallel processing of multiple binding requests executable on a graphic processor unit (GPU), as well as an improved RDF database system based on a graphic processor unit (GPU).

Для решения поставленной задачи в заявляемом изобретении предлагается способ параллельной обработки множественных запросов к RDF базам данных при помощи графического процессора, отличающийся тем, что поступивший па сервер запрос на связывание предварительно разбивают на элементарные запросы, после чего указанные элементарные запросы передают в общую очередь на вход, откуда их блоками загружают в память графического процессорного устройства (ГПУ) и пропускают через ГПУ-конвейер, где для каждого элементарного запроса вычисляют набор элементарных ответов, и затем объединяют полученные ответы в список троек, являющийся ответом на запрос на связывание.To solve this problem, the claimed invention proposes a method for parallel processing of multiple queries to RDF databases using a GPU, characterized in that the binding request received by the server is first divided into elementary queries, after which these elementary queries are transferred to the input queue, from where they are loaded with blocks into the memory of a graphic processing unit (GPU) and passed through a GPU pipeline, where for each elementary request a set of electronic elementary answers, and then combine the received answers into a list of triples, which is a response to a request for linking.

Согласно заявляемому изобретению указанное ГПУ устройство позволяет использовать его вычислительные мощности для вычислений общего назначения с помощью технологии OpenCL (Open Computing Language).According to the claimed invention, said GPU device allows using its computing power for general purpose computing using OpenCL technology (Open Computing Language).

Согласно заявляемому изобретению указанный запрос на связывание представляет собой набор из трех списков, каждый из которых содержит набор допустимых значений для субъекта, предиката и объекта, соответственно, причем пустой список означает, что в соответствующей позиции допускается любое значение.According to the claimed invention, said linking request is a set of three lists, each of which contains a set of valid values for the subject, predicate and object, respectively, and an empty list means that any value is allowed in the corresponding position.

Согласно заявляемому изобретению указанный элементарный запрос представляет собой шаблон с тремя полями, в каждом из которых содержится либо конкретное значение для субъекта, предиката и объекта, либо маска, означающая, что в этой позиции допускается любое значение.According to the claimed invention, the indicated elementary query is a template with three fields, each of which contains either a specific value for the subject, predicate and object, or a mask, which means that any value is allowed in this position.

Согласно заявляемому изобретению указанное преобразование запроса па связывание в набор элементарных запросов заключается в том, что вычисляют декартово произведение множеств, заданных списками, составляющими запрос на связывание.According to the claimed invention, said transformation of a request for binding to a set of elementary queries consists in calculating the Cartesian product of the sets given by the lists that make up the binding request.

Согласно заявляемому изобретению каждый элементарный запрос содержит запись для идентификации с породившим его запросом на связывание.According to the claimed invention, each elementary request contains a record for identification with the binding request that generated it.

Согласно заявляемому изобретению обработку элементарных запросов на ГПУ конвейере осуществляют в три стадии, на каждой из которых для каждого элементарного запроса, поступившего с предыдущей стадии, вычисляют набор элементарных запросов для следующей стадии и записывают в каждом из полученных запросов в поле, соответствующее текущей стадии, определенное значение, таким образом, что оно в комбинации со значениями полей, соответствующих предыдущим стадиям, образует допустимое сочетание.According to the claimed invention, the processing of elementary requests on the GPU pipeline is carried out in three stages, at each of which for each elementary request received from the previous stage, a set of elementary requests for the next stage is calculated and recorded in each of the received requests in the field corresponding to the current stage, defined value, so that it, in combination with the values of the fields corresponding to the previous stages, forms an acceptable combination.

Согласно заявляемому изобретению при обработке запросов на ГПУ конвейере последовательно осуществляют связывание по предикату, связывание по субъекту, связывание по объекту.According to the claimed invention, when processing requests on the GPU pipeline, the predicate binding, the subject binding, the object binding are sequentially performed.

Согласно заявляемому изобретению на каждой стадии ГПУ конвейера сначала с помощью ГПУ процедуры расширения для каждого элементарного запроса из входного буфера вычисляют набор элементарных запросов для следующей стадии и записывают их в промежуточный буфер, где выделено место под максимально возможное количество запросов, проверяя при этом для каждого полученного запроса, содержит ли он допустимое сочетание полей, а затем с помощью ГПУ процедуры отбрасывания перемещают прошедшие проверку запросы в выходной буфер, который является одновременно входным буфером для следующей стадии.According to the claimed invention, at each stage of the GPU pipeline, first, using the GPU expansion procedure, for each elementary request, a set of elementary requests for the next stage is calculated from the input buffer and written to an intermediate buffer, where space is allocated for the maximum number of requests, checking for each received query whether it contains a valid combination of fields, and then using the GPU, the drop procedures move the validated requests to the output buffer, which is one at the same time the input buffer for the next stage.

Согласно заявляемому изобретению после окончания текущей стадии обработки данных ГПУ устройством следующую стадию для обработки выбирают с таким расчетом, чтобы она использовала максимальное количество ГПУ потоков.According to the claimed invention, after the end of the current stage of processing GPU data by the device, the next processing step is selected so that it uses the maximum number of GPU flows.

Кроме того, для решения поставленной задачи в заявляемом изобретении также предлагается система RDF базы данных на основе устройства графического процессора (ГПУ) для параллельной обработки множественных запросов, включающая в себя:In addition, to solve the problem, the claimed invention also proposes an RDF database system based on a GPU device for parallel processing of multiple requests, including:

- сервер для получения запросов на связывание к базе данных, где осуществляют преобразование запроса на связывание в набор элементарных запросов,- a server for receiving requests for binding to the database, where they convert the request for binding into a set of elementary queries,

- базу данных,- database

- графическое процессорное устройство, выполненное с возможностью параллельной обработки запросов к базе данных, содержащее- a graphics processing unit configured to process queries to a database in parallel, comprising

- память и- memory and

- ГПУ конвейер, выполненный с возможностью вычисления набора элементарных ответов для каждого элементарного запроса, и последующего объединения полученных ответов в список троек, являющихся ответом на запрос на связывание к базе данных.- GPU pipeline, configured to calculate a set of elementary answers for each elementary request, and then combine the received answers into a list of triples, which are a response to a request for linking to the database.

Согласно заявляемому изобретению конвейер устройства графического процессора состоит из трех последовательно выполняемых стадий обработки элементарных запросов.According to the claimed invention, the GPU device conveyor consists of three sequentially executed stages of processing elementary requests.

Согласно заявляемому изобретению стадии обработки элементарных запросов на конвейере ГПУ представляют собой связывание по предикату, связывание по субъекту и связывание по объекту.According to the claimed invention, the stages of processing elementary queries on the GPU pipeline are predicate binding, subject binding, and object binding.

Согласно заявляемому изобретению на каждой стадии ГПУ конвейера элементарный запрос порождает ноль или более запросов для следующей стадии.According to the claimed invention, at each stage of the GPU pipeline, an elementary request generates zero or more requests for the next stage.

Согласно заявляемому изобретению часть памяти ГПУ предназначена для хранения базы данных, а другая часть отведена под буферы для хранения элементарных запросов, ожидающих обработки на одной из стадий ГПУ конвейера, и под промежуточные буферы.According to the claimed invention, part of the GPU memory is intended for storing a database, and the other part is reserved for buffers for storing elementary requests awaiting processing at one of the stages of the GPU pipeline, and for intermediate buffers.

Согласно заявляемому изобретению буферы для хранения элементарных запросов представляют собой кольцевую очередь.According to the claimed invention, buffers for storing elementary queries are a ring queue.

Согласно заявляемому изобретению базу данных хранят в памяти в виде иерархии сбалансированных бинарных деревьев, в основе которой лежит дерево предикатов, каждый узел которого соответствует определенному предикату и ссылается на дерево субъектов, образующих с этим предикатом допустимую пару.According to the claimed invention, the database is stored in memory in the form of a hierarchy of balanced binary trees, which is based on a predicate tree, each node of which corresponds to a specific predicate and refers to a tree of subjects forming a valid pair with this predicate.

Согласно заявляемому изобретению при хранении базы данных в памяти используют сбалансированные бинарные деревья субъектов, каждый узел которых соответствует определенному субъекту и ссылается на дерево объектов, образующих допустимую тройку с этим субъектом и тем предикатом, которому подчинено дерево субъектов.According to the claimed invention, when storing a database in memory, balanced binary trees of subjects are used, each node of which corresponds to a specific subject and refers to a tree of objects that form an admissible triple with this subject and the predicate to which the subject tree is subordinate.

Для лучшего понимания заявленного изобретения далее приводится его детальное описание с привлечением графических материалов.For a better understanding of the claimed invention the following is a detailed description with the involvement of graphic materials.

Фиг.1 - Схема обработки запроса на связывание.Figure 1 - Scheme of processing a request for binding.

Фиг.2 - Схема обработки SPARQL запроса.Figure 2 - Scheme of processing SPARQL query.

Фиг.3 - Схема устройства стадии конвейера.Figure 3 - Scheme of the device stage of the conveyor.

Фиг.4 - Блок-схема процедуры порождения.4 is a flowchart of the generation procedure.

Фиг.5 - Блок-схема процедуры отбрасывания.5 is a flowchart of a drop procedure.

Табл.1. Стадии GPU конвейера.Table 1. GPU pipeline stages.

Табл.2. Результаты стадии 2 в зависимости от входных данных.Table 2. Stage 2 results depending on the input.

Схема способа обработки запроса на связывание, представлена на Фиг.1. Когда запрос 101 приходит на сервер базы данных, он преобразуется в набор элементарных запросов 102. Элементарный запрос по сути - это шаблон с тремя полями. Каждое поле либо задает конкретное значение субъекта (предиката, объекта), либо содержит маску, означающую, что в этой позиции может стоять любое значение. Процесс преобразования представляет собой не более, чем вычисление декартова произведения списков, составляющих запрос. Каждый элементарный запрос содержит запись для идентификации с породившим его запросом па связывание.A diagram of a method for processing a binding request is shown in FIG. When a query 101 arrives at the database server, it is converted to a set of elementary queries 102. An elementary query is essentially a template with three fields. Each field either sets a specific value for the subject (predicate, object), or contains a mask, which means that any value can be in this position. The conversion process is nothing more than calculating the Cartesian product of the lists that make up the query. Each elementary query contains an entry for identification with the parenting query that generated it.

Элементарные запросы от разных запросов на связывание (и даже от разных клиентов) попадают в общую очередь па вход 103. Оттуда они блоками загружаются в память ГПУ и попадают на ГПУ конвейер 105. На выходе ГПУ конвейера 105 получают набор элементарных ответов. Каждый ответ представляет собой тот же самый шаблон с тремя полями, но на этот раз все поля в нем фиксированы (в них указаны конкретные значения). Присутствие такого ответа в выходной очереди 107 означает, что соответствующая тройка была обнаружена в базе, и что она удовлетворяет условиям запроса на связывание. Число элементарных ответов, таким образом, зависит от того, сколько таких троек было обнаружено.Elementary requests from different binding requests (and even from different clients) fall into the general queue at input 103. From there they are loaded in blocks into the GPU memory and transferred to the GPU pipeline 105. At the output of the GPU pipeline 105, they receive a set of elementary answers. Each answer is the same template with three fields, but this time all the fields in it are fixed (they contain specific values). The presence of such a response in the output queue 107 means that the corresponding triple was found in the database, and that it satisfies the conditions of the request for binding. The number of elementary answers, therefore, depends on how many such triples were found.

Список троек 108, полученных из элементарных ответов, соответствующих одному и тому же запросу на связывание, является ответом на этот запрос.The list of triples 108 obtained from the elementary responses corresponding to the same binding request is a response to this request.

На Фиг.2 показана известная из уровня техники методика обработки SPARQL запроса 201. Процедура заключается в том, что на основе данного набора шаблонов выводится формула, оперирующая другим типом шаблонов, также соединенных знаками «И» и «ИЛИ». Здесь каждый шаблон в каждой позиции содержит либо набор допустимых значений, либо специальную маску, означающую, что в этой позиции допустимо любое значение. Каждый такой шаблон 202 фактически является простейшим запросом на извлечение данных из базы. Результатом его выполнения является множество конкретных троек 203, извлеченных из базы. Над этими множествами затем производят вычисление вышеупомянутой формулы (операция «И» заменяется на пересечение множеств, а «ИЛИ» - на объединение) и получают результирующее множество троек 204.Figure 2 shows a well-known technique for processing SPARQL query 201. The procedure is that based on this set of patterns, a formula is derived that operates on a different type of patterns, also connected by the signs "AND" and "OR". Here, each template in each position contains either a set of valid values or a special mask, which means that any value is allowed in this position. Each such template 202 is actually the simplest request to retrieve data from the database. The result of its implementation is a lot of specific triples 203, extracted from the base. These sets are then computed with the above formula (the operation “AND” is replaced by the intersection of the sets, and “OR” by the union) and the resulting set of triples 204 is obtained.

Что касается принципа работы ГПУ конвейера, то он поясняется па Фиг.3 и в Таблице 1.As for the principle of operation of the GPU conveyor, it is illustrated in FIG. 3 and in Table 1.

В заявляемом изобретении ГПУ позволяет использовать его вычислительные мощности для вычислений общего назначения с помощью технологии OpenCL (Open Computing Language).In the claimed invention, the GPU allows the use of its computing power for general purpose computing using the OpenCL technology (Open Computing Language).

ГПУ конвейер состоит из трех стадий (см. Табл.1). В каждый конкретный момент времени графический процессор занят обработкой только одной стадии. При этом он обрабатывает несколько элементарных запросов сразу. В этот момент для процессора не играет никакой роли, каким запросом на связывание был порожден исходно тот или иной элементарный запрос. Множество запросов совершенно однородно.GPU conveyor consists of three stages (see Table 1). At each particular point in time, the graphics processor is busy processing only one stage. At the same time, it processes several elementary queries at once. At this moment, the processor does not play any role in what kind of binding request an elementary request was originally generated. Many queries are completely uniform.

На каждой стадии каждый элементарный запрос может породить ноль или более запросов для следующей стадии. В итоге на каждый элементарный запрос, попавший во входную очередь, конвейер выдает ноль или более элементарных ответов, которые записываются в выходную очередь.At each stage, each elementary query can generate zero or more queries for the next stage. As a result, for each elementary request that falls into the input queue, the pipeline gives zero or more elementary answers that are written to the output queue.

Каждый раз, когда некоторая стадия заканчивает обработку, необходимо решить, какая стадия теперь загрузит графический процессор. Чтобы максимально эффективно использовать мощности графического процессора, выбирают стадию, которая обладает наибольшим потенциалом с точки зрения количества потоков. Это зависит как от числа запросов во входном буфере стадии, так и от количества свободного места в выходном буфере.Each time a certain stage finishes processing, it is necessary to decide which stage will now load the GPU. In order to maximize the power of the GPU, choose the stage that has the greatest potential in terms of the number of threads. This depends on the number of requests in the input buffer of the stage, and on the amount of free space in the output buffer.

ГПУ состоит из графического мультипроцессора и некоторого объема памяти. Память эта делится па локальную (для отдельных процессоров и групп процессоров) и глобальную. Основной объем памяти приходится на глобальную память, где хранятся данные, доступ к которым может потребоваться любому процессору.GPU consists of a graphic multiprocessor and a certain amount of memory. This memory is divided into local (for individual processors and processor groups) and global. Most of the memory is in global memory, where data is stored, access to which may be required by any processor.

При использовании заявляемого способа часть глобальной памяти отводят под собственно базу данных (база может целиком находиться на одном ГПУ, а может быть разделена на сегменты, распределенные между несколькими устройствами), а другую часть - под буферы для хранения элементарных запросов, ожидающих обработки на одной из стадий конвейера. Важно отметить, что все данные, необходимые для ответа на любой запрос, постоянно находятся в памяти ГПУ, т.к. загрузка данных из основной памяти в память ГПУ требует весьма больших ресурсов.When using the proposed method, part of the global memory is allocated to the database itself (the database can be entirely located on one GPU, and can be divided into segments distributed between several devices), and the other part - under buffers for storing elementary requests awaiting processing on one of the stages of the conveyor. It is important to note that all the data necessary to respond to any request is constantly in the memory of the GPU, because loading data from the main memory into the GPU memory requires very large resources.

Описанный конвейер последовательно проводит связывание сначала по предикату, затем по субъекту и, наконец, по объекту. Поэтому необходимо хранить базу данных в памяти таким образом, чтобы обеспечить быстрое выполнение следующих операций.The described pipeline sequentially conducts binding first by predicate, then by subject and, finally, by object. Therefore, it is necessary to store the database in memory in such a way as to ensure the quick execution of the following operations.

1) Проверить, является ли допустимым данный предикат р.1) Check if the given predicate p.

2) Получить список предикатов в базе.2) Get a list of predicates in the database.

3) Проверить, является ли допустимой данная пара предикат-субъект (р, s).3) Check if the given predicate-subject pair is valid (p, s).

4) Получить список субъектов, образующих допустимую пару с предикатом р.4) Get a list of entities forming an acceptable pair with the predicate p.

5) Проверить, является ли допустимой данная тройка (р, s, о).5) Check if the given triple (p, s, o) is admissible.

6) Получить список объектов, образующих допустимую тройку с данной парой (р, s).6) Get a list of objects forming an admissible triple with a given pair (p, s).

Эффективная реализация этих операций достигается при хранении данных в виде иерархии сбалансированных бинарных деревьев. В основе иерархии лежит дерево предикатов. Каждый узел дерева соответствует одному допустимому предикату и ссылается на дерево субъектов, образующих с этим предикатом допустимую пару. Каждый узел дерева субъектов, в свою очередь, ссылается на дерево объектов, образующих допустимую тройку с этим объектом и вышележащим предикатом. Бинарное дерево хранится в виде линейного массива узлов, где сыновья узла с номером n, имеют номера 2*n (левый) и 2*n+1 (правый). Поиск узла в таком дереве осуществляется за время O(log N), где N - число узлов в дереве.Effective implementation of these operations is achieved by storing data in the form of a hierarchy of balanced binary trees. The hierarchy is based on a predicate tree. Each tree node corresponds to one valid predicate and refers to a tree of subjects that form a valid pair with this predicate. Each node of the subject tree, in turn, refers to a tree of objects that form an admissible triple with this object and an overlying predicate. The binary tree is stored as a linear array of nodes, where the sons of the node with number n have the numbers 2 * n (left) and 2 * n + 1 (right). A node is searched in such a tree in O (log N) time, where N is the number of nodes in the tree.

Для работы конвейера требуется следующий набор буферов:The pipeline requires the following set of buffers:

1) Элементарные запросы, ожидающие обработки на стадии 1 (в свободную часть этого же буфера загружаются новые элементарные запросы из основной памяти; на некоторых типах архитектур это может происходить параллельно с обработкой данных графическим процессором).1) Elementary requests awaiting processing at stage 1 (new elementary requests from the main memory are loaded into the free part of the same buffer; on some types of architectures, this can occur in parallel with data processing by the GPU).

2) Элементарные запросы, ожидающие обработки на стадии 2 (в свободную часть этого же буфера попадают элементарные запросы на выходе со стадии 1).2) Elementary requests awaiting processing at stage 2 (elementary requests at the output from stage 1 fall into the free part of the same buffer).

3) Элементарные запросы, ожидающие обработки на стадии 3 (в свободную часть этого же буфера попадают элементарные запросы на выходе со стадии 2).3) Elementary requests awaiting processing at stage 3 (elementary requests at the output from stage 2 fall into the free part of the same buffer).

4) Элементарные ответы (сюда попадают элементарные запросы на выходе со стадии 3. Поскольку в них все поля уже зафиксированы, их можно назвать элементарными ответами. Отсюда же они выгружаются в основную память. На некоторых типах архитектур это может происходить параллельно с обработкой данных графическим процессором).4) Elementary answers (elementary requests from stage 3 get here. Since all the fields are already fixed in them, they can be called elementary answers. From here they are uploaded to the main memory. On some types of architectures, this can happen in parallel with data processing by the GPU )

5) Промежуточный буфер (используется для временного хранения данных в процессе работы стадии).5) Intermediate buffer (used for temporary storage of data during the operation of the stage).

Каждый из буферов, кроме промежуточного, представляет собой кольцевую очередь (очередь с ограничением на число элементов, одновременно находящихся в ней). Известно, где в данный момент находится «голова» очереди, и где - «хвост». При запуске очередной стадии конвейера выбирается максимально возможное число элементов, находящихся в «голове» входной очереди, а для результатов работы стадии выделяется место в «хвосте» выходной очереди.Each of the buffers, except for the intermediate one, is a ring queue (a queue with a limit on the number of elements that are simultaneously in it). It is known where the “head” of the line is currently located, and where is the “tail”. When starting the next stage of the pipeline, the maximum possible number of elements located in the "head" of the input queue is selected, and for the results of the stage, a place is allocated in the "tail" of the output queue.

Структура стадии.Stage structure.

Рассмотрим, для примера, стадию 2 (остальные работают похожим образом). Каждый элементарный запрос во входном буфере для этой стадии содержит шаблон, где указано конкретное значение предиката.Consider, for example, stage 2 (the rest work in a similar way). Each elementary query in the input buffer for this stage contains a template where a specific predicate value is indicated.

Будем говорить, что субъект s удовлетворяет предикату р (образует с ним допустимую пару), если в базе присутствует хотя бы одна тройка вида (р, s, о), т.е. имеется хотя бы один объект, дополняющий эту пару до полноценной тройки. Конкретное число дочерних запросов, порожденных одним элементарным запросом, зависит от того, сколько субъектов удовлетворяют указанному в нем предикату, а также от шаблона. Все возможные варианты приведены в Табл.2 (на этой стадии нас не интересует поле «объект», поэтому оно помечено знаком вопроса).We say that subject s satisfies the predicate p (forms an admissible pair with it) if at least one triple of the form (p, s, о) is present in the base, i.e. there is at least one object that complements this pair to a full three. The specific number of child queries generated by one elementary query depends on how many entities satisfy the predicate specified in it, as well as on the template. All possible options are given in Table 2 (at this stage we are not interested in the “object” field, therefore it is marked with a question mark).

Устройство одной стадии конвейера представлено на Фиг.3. Обработка происходит в три шага:The device of one stage of the conveyor shown in Fig.3. Processing takes place in three steps:

1) Порождение.1) Generation.

2) Расчет смещений.2) Calculation of displacements.

3) Отбрасывание.3) Drop.

На первом шаге запускается ядро порождения. Число потоков при этом определяется тем, сколько свободного места имеется в выходном буфере 307. Блок-схема работы ядра порождения (для стадии 2) приведена на Фиг.4 (шаги 401-411). Каждый поток 303 порождает один элементарный запрос 305 и под него выделено место в промежуточном буфере 304. Изначально каждый поток 303 имеет лишь один параметр (число), отличающий его от других. Это - так называемый, глобальный ID. По нему вычисляется, какой запрос 302 из входного буфера 301 послужит родительским для данного, и шаблон из родительского запроса копируется в данный.In the first step, the spawning kernel is launched. The number of threads in this case is determined by how much free space is available in the output buffer 307. The block diagram of the operation of the generation core (for stage 2) is shown in FIG. 4 (steps 401-411). Each thread 303 generates one elementary request 305 and a place is allocated for it in the intermediate buffer 304. Initially, each thread 303 has only one parameter (number) that distinguishes it from others. This is the so-called global ID. It computes which request 302 from the input buffer 301 will serve as the parent for the given one, and the template from the parent request is copied to this one.

В этом шаблоне уже однозначно указан предикат (в случае стадии 2). Этому предикату в базе удовлетворяет некоторое число субъектов. По глобальному ID каждому потоку сопоставляется определенный субъект из этого списка. Формула определения родительского запроса такова, что число потоков 303, которые определили себя как дочерние для данного входного запроса 302, будет равно числу субъектов в этом списке.This template already explicitly indicates the predicate (in the case of stage 2). This predicate in the database is satisfied by a certain number of subjects. By the global ID, each thread is mapped to a specific subject from this list. The formula for determining the parent request is such that the number of threads 303 that have identified themselves as children for this input request 302 will be equal to the number of subjects in this list.

Таким образом, каждый поток 303 создает в промежуточном буфере 304 дочерний запрос 305 с шаблоном, где фиксированы предикат и субъект. Осталось определить, имеет ли этот запрос 305 право на существование. Для этого мы должны убедиться, что либо в шаблоне в позиции «субъект» стоит маска, допускающая любые значения, либо, если там указано конкретное значение субъекта, что оно совпадает с выбранным для данного потока. Все прочие порожденные запросы помечаются как недопустимые.Thus, each thread 303 creates in the intermediate buffer 304 a child query 305 with a template where the predicate and subject are fixed. It remains to be determined whether this request 305 has a right to exist. To do this, we must make sure that either in the template in the “subject” position there is a mask that allows any value, or if a specific value of the subject is indicated there, that it matches the one selected for this stream. All other generated queries are marked as invalid.

Для уплотнения массива порожденных запросов используется широко известный алгоритм подсчета префиксной суммы. Для каждого допустимого запроса в соответствующую ячейку специального массива смещений заносится 1, а для недопустимого - 0. Затем запускается ядро подсчета префиксной суммы с числом потоков, равным числу порожденных запросов. В результате в каждой ячейке массива смещений оказывается сумма значений всех предыдущих ячеек. Это число можно использовать как смещение данного запроса в выходном буфере.To compact the array of generated queries, the well-known algorithm for calculating the prefix sum is used. For each valid request, 1 is entered in the corresponding cell of the special array of offsets, and 0 for the invalid one. Then, the kernel starts calculating the prefix sum with the number of threads equal to the number of generated requests. As a result, in each cell of the offset array there is a sum of the values of all previous cells. This number can be used as the offset of this request in the output buffer.

Наконец, на третьем шаге, запускается ядро отбрасывания с числом потоков 306, равным числу порожденных запросов 303. Блок-схема работы ядра отбрасывания приведена на Фиг.5 (шаги 501-506). Каждый поток по глобальному ID определяет смещение соответствующего запроса 305 в промежуточном буфере 304. В случае если запрос 305 оказывается допустимым, поток копирует его в выходной буфер 307 по смещению, извлеченному из массива смещений.Finally, in the third step, the drop core is launched with the number of threads 306 equal to the number of generated requests 303. The block diagram of the drop core is shown in FIG. 5 (steps 501-506). Each thread by global ID determines the offset of the corresponding request 305 in the intermediate buffer 304. If the request 305 is valid, the stream copies it to the output buffer 307 at the offset extracted from the offset array.

Легко видеть, что используемые ГПУ процедуры имеют достаточно простой вид с небольшим числом ветвлений. В то же время, число параллельно запускаемых потоков ограничено только числом поступивших в обработку запросов и размерами буферов. Ограничение на размер буфера не является существенным, так как объем памяти, необходимый для работы одного потока равен размеру структуры элементарного запроса, что составляет всего около 30 байт. Таким образом, используемый способ обеспечивает более высокую степень параллелизма и более высокий уровень загрузки графического процессора (при достаточном числе запросов на входе) по сравнению с способом, описанным в [3].It is easy to see that the procedures used by the GPU are quite simple with a small number of branches. At the same time, the number of threads running in parallel is limited only by the number of requests received in processing and the size of buffers. The restriction on the size of the buffer is not significant, since the amount of memory required for the operation of one stream is equal to the size of the structure of the elementary request, which is only about 30 bytes. Thus, the method used provides a higher degree of parallelism and a higher level of loading of the GPU (with a sufficient number of input requests) compared to the method described in [3].

Ничто не препятствует применению совместно с заявляемым способом оптимизацию, предложенную в [4], но эффект ее в данном случае не может быть гарантирован, поскольку она рассчитана на определенный способ извлечения данных, скорость которого зависит от длины пути поиска, сокращение которого и является целью оптимизации.Nothing prevents the use of the optimization proposed in [4] together with the claimed method, but its effect in this case cannot be guaranteed, since it is designed for a certain method of data extraction, the speed of which depends on the length of the search path, the reduction of which is the aim of optimization .

Замеры производительности в описанном варианте реализации заявляемого способа показывают, что на некоторых видах запросов появляется возможность получения выигрыша в три раза по отношению к алгоритму обработки, исполнявшемуся на современном центральном процессоре. С увеличением числа процессоров в графическом мультипроцессоре и объема памяти на ГПУ устройстве способ будет давать все больший эффект.Measurements of performance in the described embodiment of the proposed method show that on some types of requests it is possible to get a win three times in relation to the processing algorithm executed on a modern central processor. With the increase in the number of processors in the graphic multiprocessor and the amount of memory on the GPU device, the method will give an increasing effect.

ЛитератураLiterature

1. - Program product for optimizing parallel processing of database queries (US 6009265, 12/28/1999)1.- Program product for optimizing parallel processing of database queries (US 6009265, 12/28/1999)

2. - Database early parallelism method and system (US 20050131893, 06/16/2005)2. - Database early parallelism method and system (US 20050131893, 06/16/2005)

3. - GPU ENABLED DATABASE SYSTEMS (US 20110264626, 10/27/2011)3.- GPU ENABLED DATABASE SYSTEMS (US 20110264626, 10/27/2011)

4. - METHOD AND SERVER FOR HANDLING DATABASE QUERIES (WO 2011162645, 12/29/2011)4. - METHOD AND SERVER FOR HANDLING DATABASE QUERIES (WO 2011162645, 12/29/2011)

Claims

1. A method for parallel processing of multiple queries to RDF databases using a GPU, characterized in that the binding request received on the server is preliminarily divided into elementary queries, after which these elementary queries are placed in a common input queue, from where they are loaded into memory in blocks GPU devices and pass through the GPU pipeline, where for each elementary request a set of elementary answers is calculated, and then the received answers are combined into a list of triples, vlyayuschiysya response to the link request.

2. The method according to claim 1, characterized in that the GPU device is configured to use its computing power for general purpose computing using the Open Computing Language technology.

3. The method according to claim 1, characterized in that the binding request is represented as a set of three lists, each of which contains a set of valid values for the subject, predicate and object, respectively, and an empty list means that any value is allowed in the corresponding position .

4. The method according to claim 1, characterized in that the elementary query is a template with three fields, each of which contains either a specific value for the subject, predicate and object, or a mask, which means that any value is allowed in this position.

5. The method according to claim 1, characterized in that the conversion of the binding request into a set of elementary queries is performed by calculating the Cartesian product of the sets specified by the lists constituting the binding request.

6. The method according to claim 1, characterized in that each elementary request contains an entry for identification with the binding request that generated it.

7. The method according to claim 1, characterized in that the processing of elementary requests on the GPU pipeline is carried out in three stages, at each of which for each elementary request received from the previous stage, a set of elementary requests for the next stage is calculated and recorded in each of the received queries in the field corresponding to the current stage, a certain value, so that it, in combination with the values of the fields corresponding to the previous stages, forms an acceptable combination.

8. The method according to claim 7, characterized in that when processing requests on the GPU pipeline, the predicate binding, the subject binding, the object binding are sequentially performed.

9. The method according to claim 7, characterized in that at each stage of the GPU pipeline, first using the GPU expansion procedure for each elementary request, from the input buffer, a set of elementary requests for the next stage is calculated and written to an intermediate buffer, where space is allocated for the maximum possible the number of requests, checking for each request received whether it contains a valid combination of fields, and then using the GPU, the drop procedures move the tested requests to the output buffer, which is simultaneously input buffer for the next stage.

10. The method according to claim 7, characterized in that after the end of the current stage of processing GPU data by the device, the next stage for processing is selected so that it uses the maximum number of GPU flows.

11. RDF database system based on a GPU device for parallel processing of multiple queries, including a server for receiving binding requests to a database, where the binding request is converted into a set of elementary queries, a database, and a GPU device with the possibility of parallel processing of queries to the database, containing the memory and the GPU pipeline, configured to calculate a set of elementary answers for each elementary request a, and the subsequent combination of the received answers in a list of triples, which are a response to a request for linking to the database.

12. The system according to claim 11, characterized in that the pipeline of the GPU device consists of three sequentially executed stages of processing elementary requests.

13. The system of claim 12, wherein the steps of processing elementary queries are predicate binding, subject binding, and object binding.

14. The system according to p. 12, characterized in that at each stage of the GPU pipeline elementary request generates zero or more requests for the next stage.

15. The system according to p. 12, characterized in that part of the memory of the GPU device is configured to store a database, and the other part is configured to be used as buffers for storing elementary requests awaiting processing at one of the stages of the GPU pipeline, and for intermediate buffers .

16. The system of clause 15, wherein the database is stored in memory in the form of a hierarchy of balanced binary trees, which is based on a predicate tree, each node of which corresponds to a specific predicate and refers to a tree of subjects that form a valid pair with this predicate.

17. The system according to clause 15, characterized in that when the database is stored in memory, balanced binary trees of subjects are used, each node of which corresponds to a specific subject and refers to a tree of objects forming an admissible triple with this subject and the predicate to which the subject tree is subordinate .

18. The system according to p. 12, characterized in that the buffers for storing elementary queries are a ring queue.