CN117667994A

CN117667994A - Data processing system, method and parallel processing module

Info

Publication number: CN117667994A
Application number: CN202310703441.7A
Authority: CN
Inventors: 龚瑞楠; 施云峰; 李峰; 张振祥
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2023-06-13
Filing date: 2023-06-13
Publication date: 2024-03-08

Abstract

The embodiment of the application provides a data processing system, a data processing method and a parallel processing module. In the embodiment of the application, the data query is unloaded to the parallel processing module for parallel query, and compared with the traditional CPU data query, the data query speed can be improved. When the host computer performs data segmentation on the data to be queried, the host computer directly takes the S times of the memory page as the segmentation boundary, and does not need to search the element boundary, so that the data segmentation efficiency can be improved, and the follow-up data query speed can be improved. On the other hand, the host machine segments the data to be queried by taking the S times of the memory page as a segmentation boundary, namely, the data is segmented by taking the memory page boundary, so that the segmented data blocks are aligned with cache lines, and when the parallel processing module queries the data blocks, the cache line alignment operation is not needed, so that the replication times of the data blocks can be reduced, and the data query speed is further improved.

Description

Data processing system, method and parallel processing module

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data processing system, a data processing method, and a parallel processing module.

Background

With the development of informatization technology, data explosion grows, and databases are continuously developed and utilized. Databases serve as the core and foundation of information technology, carrying a multitude of critical data. Databases are widely used for data storage, management, maintenance and querying based on their advantages.

A database is a computer software system that stores and manages data in data structures, often requiring different dimensions of data to be provided for different users. Thus, data querying is a basic function in databases. In some existing schemes, a central processing unit (Central Processing Unit, CPU) of a device where a query engine is located uses operators such as a Filter to perform data filtering, so as to implement data query. The CPU inquiry mode has slower inquiry speed.

Disclosure of Invention

Aspects of the present application provide a data processing system, method, and parallel processing module for improving data query speed.

An embodiment of the present application provides a data processing system, including: a host and a parallel processing module; the host is in communication connection with the parallel processing module;

the host is used for storing the data to be queried corresponding to the query request into a continuous memory space of the host; dividing the data to be queried into a plurality of data blocks according to the target data quantity supported to be processed by the parallel processing module; overlapping the data of N memory pages from head to tail of the data blocks adjacent to the memory position; the data quantity of other data blocks except the last data block in the plurality of data blocks is S times of the memory page; the data volume of the data block is smaller than or equal to the target data volume; wherein N and S are positive integers, S > N;

The parallel processing module reads target data blocks from the plurality of data blocks to a continuous memory space of the parallel processing module;

the parallel processing module is used for carrying out parallel query on the target data block so as to obtain a query result of the target data block; reading the query result of the target data block from the parallel processing module to a result memory space preset for the target data block in the host; and deleting the query results of the N memory pages at the head of the target data block in the process of reading the query results of the target data block aiming at the condition that the target data block is not the first data block in the plurality of data blocks.

The embodiment of the application also provides a data processing method, which comprises the following steps:

the method comprises the steps that a host stores data to be queried corresponding to a query request into a continuous memory space of the host;

the host computer segments the data to be queried into a plurality of data blocks according to the target data quantity supported to be processed by the parallel processing module; overlapping the data of N memory pages from head to tail of the data blocks adjacent to the memory position; the data quantity of other data blocks except the last data block in the plurality of data blocks is S times of the memory page; the data volume of the data block is smaller than or equal to the target data volume; wherein N and S are positive integers, S > N;

the parallel processing module performs parallel query on the target data block to obtain a query result of the target data block; reading the query result of the target data block from the parallel processing module to a result memory space preset for the target data block in the host; and deleting the query results of the N memory pages at the head of the target data block in the process of reading the query results of the target data block aiming at the condition that the target data block is not the first data block in the plurality of data blocks.

The embodiment of the application also provides a parallel processing module, which comprises: memory and computing unit; the memory is electrically connected with the computing unit;

the parallel processing module is used for being in communication connection with a host, and is used for reading target data blocks from a plurality of data blocks stored in a continuous memory space of the host to the continuous memory space of the memory; the data blocks are obtained by dividing the data to be queried corresponding to the query request according to the target data quantity supported to be processed by the parallel processing module by the host; overlapping the data of N memory pages from head to tail of the data blocks adjacent to the memory positions in the plurality of data blocks; and the data quantity of the gas data block except the last data block in the plurality of data blocks is S times of the memory page; the data volume of the data block is smaller than or equal to the target data volume; wherein N and S are positive integers, S > N;

The computing unit is used for carrying out parallel query on the target data block so as to obtain a query result of the target data block;

the parallel processing module is further configured to read a query result of the target data block from the parallel processing module to a result memory space preset for the target data block in the host; and deleting the query results of the N memory pages at the head of the target data block in the process of reading the query results of the target data block aiming at the condition that the target data block is not the first data block in the plurality of data blocks.

In the embodiment of the application, the data query is unloaded to the parallel processing module for parallel query, and compared with the traditional CPU data query, the data query speed can be improved. Because the data query is unloaded to the parallel processing module, the load of the CPU of the equipment where the query engine is located is reduced, the CPU resource consumption is reduced, and the probability of occurrence of CPU performance bottleneck is reduced. In addition, when the host computer performs data segmentation on the data to be queried, the host computer directly takes the S times of the memory page as the segmentation boundary, and does not need to search the element boundary, so that the data segmentation efficiency can be improved, and the follow-up data query speed can be improved. On the other hand, the host machine segments the data to be queried by taking the S times of the memory page as a segmentation boundary, namely, the data is segmented by taking the memory page boundary, so that the segmented data blocks are aligned with cache lines, and when the parallel processing module queries the data blocks, the cache line alignment operation is not needed, so that the replication times of the data blocks can be reduced, and the data query speed is further improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a schematic diagram of a data processing system according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a data query process according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of data distribution according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a segmentation process of a data block according to an embodiment of the present application;

FIG. 5 is a schematic diagram of another data distribution provided in an embodiment of the present application;

FIG. 6 is a schematic flow chart of a data processing method according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a parallel processing module according to an embodiment of the present application.

Detailed Description

For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In some conventional schemes for data query, a CPU of a device in which a query engine is located performs data filtering using a Filter (Filter) operator or the like, so as to implement data query. The CPU performs data query, so that on one hand, the software query speed is slower; on the other hand, the data query occupies more CPU resources of the device, and is easy to cause CPU performance bottleneck.

In order to improve the data query efficiency, the heterogeneous resource is used for offloading the data query function of the database to the heterogeneous resource device, and the heterogeneous resource device utilizes a Filter operator and the like to perform data filtering, so that the data query is realized, and the offloading of the data query is realized. Wherein heterogeneous resource devices refer to devices or appliances having other forms of computing resources besides CPUs. Such as Field programmable gate arrays (Field-Programmable Gate Array, FPGA), etc. The heterogeneous resource unloading method for data query can improve the data query speed on one hand; on the other hand, as the data query is unloaded to the heterogeneous resource device, the load of the CPU of the device where the query engine is located is reduced, the CPU resource consumption is reduced, and the probability of occurrence of CPU performance bottleneck is reduced.

Since the memory of the computing device (i.e., host) in which the query engine resides is different from the memory capacity of the heterogeneous resource device. Typically, the memory of the heterogeneous resource device is relatively small, while the memory of the host is relatively large. Therefore, in the conventional scheme, according to the element boundary of the database, the data to be queried in the host memory is segmented into a plurality of data blocks with the data volume not larger than the memory capacity of the heterogeneous resource device, and then the data blocks are issued to the heterogeneous resource device for data query.

Wherein the elements of the database are attributes of the data object. For example, the data object is a commodity, and the element of the data object may be an identification, a number, a manufacturer, a unit price, a production date, etc. of the commodity, but is not limited thereto. In big data scenes, the length of the elements is different. Therefore, the host computer needs to traverse the data to be queried when cutting the data block, and searches for element boundaries, so that the data cutting efficiency is low, and the subsequent data query speed is influenced.

On the other hand, element boundaries are often located at non-page boundaries, resulting in blocks of data that are often cache line (Cacheline) aligned. In the case of Copy (Copy), the entire cache line is loaded at once from the cache line aligned address (typically 64B), without finer granularity of splitting. Therefore, when the heterogeneous resource device performs data query on the data block, the data block needs to be copied to an idle Buffer (Buffer), the data block is subjected to cache alignment in the Buffer, and then the data block aligned with the cache is copied to the memory of the heterogeneous resource device, so that the data query speed can be definitely reduced by multiple times of data copying.

In some embodiments of the present application, in order to increase the data query speed, the data query is offloaded to the parallel processing module for parallel query, which may increase the data query speed compared to the conventional CPU data query. Because the data query is unloaded to the parallel processing module, the load of the CPU of the equipment where the query engine is located is reduced, the CPU resource consumption is reduced, and the probability of occurrence of CPU performance bottleneck is reduced. In addition, when the host computer performs data segmentation on the data to be queried, the host computer directly takes the S times of the memory page as the segmentation boundary, and does not need to search the element boundary, so that the data segmentation efficiency can be improved, and the follow-up data query speed can be improved. On the other hand, the host machine segments the data to be queried by taking the S times of the memory page as a segmentation boundary, namely, the data is segmented by taking the memory page boundary, so that the segmented data blocks are aligned with cache lines, and when the parallel processing module queries the data blocks, the cache line alignment operation is not needed, so that the replication times of the data blocks can be reduced, and the data query speed is further improved.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

It should be noted that: like reference numerals denote like objects in the following figures and embodiments, and thus once an object is defined in one figure or embodiment, further discussion thereof is not necessary in the subsequent figures and embodiments.

FIG. 1 is a schematic diagram of a data processing system according to an embodiment of the present application. With reference to FIG. 1, the data processing system includes: a host 10 and a parallel processing module 20.

In this embodiment, the host 10 refers to any computer device having computing, storage, and communication functions. For example, the host 10 may be a server, a computer, a cell phone, or the like. In this embodiment, the host may include: a general purpose processing unit, etc. In the present embodiment, the number of general-purpose processing units is not limited. The general purpose processing units may be at least 1, i.e. 1 or more; each general processing unit may be a single-core processing unit or a multi-core processing unit.

In this embodiment, the general processing unit is generally a processing chip disposed on a motherboard of the host 10, such as a central processing unit (Central Processing Unit, CPU) 101 of the host, and cannot implement single-machine expansion. A general purpose processing unit may be any processing device with processing computing capabilities. The general processing unit may be a serial processing unit or a parallel processing unit. For example, the general-purpose processing unit may be a general-purpose processor such as a CPU or the like. Parallel processing units refer to processing devices that can perform parallel computing processing. For example, the parallel processing unit may be a graphics processing unit (Graphics Processing Unit, GPU) or a Field programmable gate array (Field-Programmable Gate Array, FPGA), or the like. Optionally, the memory of the general purpose processing unit is larger than the memory of the parallel processing unit. Fig. 1 illustrates a general-purpose processing unit as an example of a CPU, but is not limited thereto.

The parallel processing module 20 refers to any device or means having parallel computing and storage functions. Parallel processing module 20 may be a GPU or a programmable hardware device, etc.

The programmable hardware device may be a hardware processor built by an electronic device, or may be a hardware processor for performing data processing by using a hardware description language (Hardware Description Language, HDL). The hardware description language may be Very High-speed-Speed Integrated Circuit Hardware Description Language (VHDL), verilog HDL, system Verilog, system C, etc. Accordingly, the parallel processing module 20 may be an FPGA, a programmable array logic device (Programmable Array Logic, PAL), a generic array logic device (General Array Logic, GAL), a complex programmable logic device (Complex Programmable Logic Device, CPLD), or the like. Alternatively, the parallel processing module 20 may also be an application specific integrated circuit (Application Specific Integrated Circuit, ASIC) or a data processing unit (Data Processing Unit, DPU), etc. The DPU is a special electronic circuit with a hardware acceleration function and is used for data-centric computation.

In the present embodiment, the parallel processing module 20 is communicatively connected to the host 10, specifically, the parallel processing module 20 is communicatively connected to the CPU 101 of the host 10. Specifically, the parallel processing module 20 and the host 10 may be communicatively connected through a bus interface. The bus interface may be a serial bus interface, such as a peripheral component interconnect express (Peripheral Component Interconnect Express, PCIe) bus interface, a peripheral component interconnect express (Peripheral Component Interconnect, PCI) bus interface, a super channel interconnect express (Ultra Path Interconnect, UPI) bus interface, a universal serial bus (Universal Serial Bus, USB) serial interface, an RS485 interface, or an RS232 interface, etc. Preferably, the bus interface is a PCIe interface, which increases the data transfer rate between the parallel processing module 20 and the host 10.

The bus interface of the host 10 may be extended according to the specification of the host 10, and typically, the communication interfaces of the host 10 are plural. In the embodiments of the present application, "multiple" means more than 1, i.e., 2 or more than 2. Where parallel processing module 20 is communicatively coupled to host 10 via a bus interface, parallel processing module 20 may be 1 or more. The implementation forms of the plurality of parallel processing modules 20 may be the same or may be partially or completely different. For example, in some embodiments, the plurality of parallel processing modules 20 may all be FPGAs, or all be DPUs, ASICs, GPUs, or the like. In other embodiments, the plurality of parallel processing modules 20 may be partially an FPGA, partially a DPU, ASIC chip, GPU, or the like, but is not limited thereto.

In some embodiments, the parallel processing module 20 and the host 10 may be disposed on different physical machines, and the host 10 and the parallel processing module 20 may be connected through a network communication. For example, the host 10 and the parallel processing module 20 may be disposed in different cloud servers and connected through network communication; etc. In fig. 1, the host computer 10 and the parallel processing module 20 are illustrated as being provided on the same physical machine, but the present invention is not limited thereto.

In actual use, the database is typically separate from the computation and storage. The storage node is used for storing data, and the computing node is used for performing computing operations in the data query process and the like. The computing operations in the data query process may be aggregate operations, or the like. In order to increase the data query speed, some databases are also provided with cache nodes, and some hot data with higher query frequency can be stored on the cache nodes. The storage node generally uses a magnetic disk to store data, and the cache node generally uses a memory or a solid state disk to store data. The data reading speed of the cache node is higher than that of the storage node. The cache node and the computing node may be implemented as the same physical machine, or may be implemented as different physical machines.

In this embodiment, the host 10 may be a physical machine where a computing node is located, or may be a physical machine where a cache node is located, or may be a physical machine where a query engine of a data source is located, etc. The data source is for storing data. The data source may be a database or a server where the database resides. The database may be an analytical database, a relational database, or other type of database. In this embodiment, the host 10 is deployed with a query engine that can obtain a query request for a data source. The query request may include filter criteria. The filtering conditions are used to define the query scope and the conditions of the final data screened out. For data sources in column storage format, the filter criteria define the scope of the query in terms of columns to be queried. In some embodiments, the data source stores data in a parallel file format such as par.

In this embodiment, the host 10 may obtain the query request; and obtaining the filtering condition from the query request, and then reading the data to be queried from the data source according to the filtering condition. In some embodiments, the data source stores data in a column storage file format, and the host 10 may read, as the data to be queried, a column to be queried corresponding to the query request from the data source according to the filtering condition. That is, the host 10 reads the column to be queried defined by the filter conditions from the data source as data to be queried. Further, the host 10 may store the data to be queried to a contiguous memory space of the memory 102 of the host 10.

Because the memory storage format is different from the data source storage format, the host 10 may also convert the data to be queried (i.e., the column to be queried) into a column storage format supported by the memory. The column storage format supported by the memory may be an arow format. Arrow is a column storage format supported by memory. Accordingly, the host 10 may store the column to be queried in a column storage format (e.g., an arow format) supported by the memory into a continuous memory space of the memory 102 of the host 10.

In this embodiment, in order to decompress the CPU of the host 10 and reduce the probability of the CPU reaching the performance bottleneck, the host 10 may map the parallel processing module 20 to a virtual device of the host through a virtualization technology; and offload data processing functions to the parallel processing module 20 via the mapped virtual device. The parallel processing module 20 can increase the data processing speed due to the strong parallel computing capability.

In the present embodiment, the parallel processing module 20 includes: a memory 201. The memory 201 of the parallel processing module 20 is typically smaller than the memory 102 of the host 10. To conserve memory resources, memory 201 and memory 102 both store data in contiguous memory spaces.

In this embodiment, the parallel processing module 20 further includes: 1 or more computing units 202. The plural means 2 or more than 2. The Computation Units (CU) 202 are meta units that measure the computation power of the parallel processing module 20, and one parallel processing module 20 may include each computation Unit 202 providing parallel computation power to the parallel computation units. The computing unit 202 may be a hardware processor built by electronic devices, or may be a computing unit encoded by a hardware description language. To increase the parallel computing power of the parallel processing module 20, the parallel processing module 20 generally includes a plurality of computing units 202. The present application focuses on an example in which the parallel processing module 20 includes a plurality of computing units 202, and an exemplary description is given of a data processing method provided in an embodiment of the present application.

In this embodiment, the parallel processing module 20 may read a portion of the data stored in succession from the data to be queried stored in the host 10 to the continuous memory space of the memory 201 of the parallel processing module 20. The data amount of the read part of the continuously stored data is smaller than or equal to the memory capacity of the parallel processing module 20 and does not exceed the target data amount that the parallel processing module 20 supports processing. The target data amount that the parallel processing module 20 supports for processing is used to characterize the data processing capability of the parallel processing module 20, and may be the maximum data amount that the parallel processing module 20 can process, where the maximum data amount does not exceed the memory capacity of the parallel processing module 20.

In the present embodiment, the specific data amount of the partially continuously stored data read by the parallel processing module 20 is not limited. In some embodiments, the parallel processing module 20 may read the continuously stored data to the continuous memory space of the memory 201 of the parallel processing module 20 in batches according to the target data amount supported by the parallel processing module 20 in order of the offset address in the memory 102 from small to large. And the data read by each batch is the data which is continuously stored in the part corresponding to the current batch. The data amount of the continuously stored data read per batch is less than or equal to the target data amount.

In some embodiments, in conjunction with fig. 1 and 2, the host 10 may split the data to be queried into multiple data blocks according to the target amount of data supported by the parallel processing module 20. The plural means 2 or more than 2. Preferably, the number of data blocks is an integer multiple of the number of parallel processing modules 20. The plural means 2 or more than 2. Preferably, the number of data blocks is an integer multiple of the number of parallel processing modules 20.

The data quantity of other data blocks except the last data block in the plurality of data blocks is S times of the memory page. S is a positive integer. S is generally more than or equal to 2. In some embodiments, S.gtoreq.3. The memory offset addresses of the same data block are continuous, and the last data block in the plurality of data blocks refers to the data block with the largest initial memory offset address in the plurality of data blocks. The data size of the last data block in the plurality of data blocks is determined by the total data size of the data to be queried, and may be an integer multiple of the memory page or a non-integer multiple of the memory page. In general, the target amount of data that the parallel processing module 20 supports for processing is characterized by the maximum number of memory pages that the parallel processing module 20 can process. Accordingly, S may be determined by the data processing capability of the parallel processing module 20, i.e. by the target amount of data that the parallel processing module 20 supports processing.

For example, it is assumed that the target data amount that the parallel processing module 20 supports processing is represented by split_len. Split_len is equal to S times the memory page size. If the memory page is 4kB, split_len=s×4kb. Each data block may be marked with a sequence number n. n=0, 1,2, 3. The starting memory offset address for the segmentation of each data block n is: split_len n-n memory page size. Accordingly, the data to be queried can be segmented into a plurality of data blocks according to the initial memory offset address of the data block segmentation.

In other embodiments, the parallel processing module 20 is multiple. To increase the data query speed, multiple parallel processing modules 20 may be utilized to query the data to be queried in parallel. In this embodiment, the host 10 may segment the data to be queried into a plurality of data blocks according to the target data amount that the plurality of parallel processing modules 20 support processing. The data processing capabilities of the plurality of parallel processing modules 20 may be the same or different. That is, the target data amounts that the plurality of parallel processing modules 20 support to process may be the same or different. In an embodiment in which the target data amounts of the plurality of parallel processing modules 20 supporting the processing are different, the values of the memory page multiples S of the data amounts of the other data blocks preceding the last data block of the plurality of data blocks are different.

Considering that the elements of the database are different in length, the data volume of some elements is not an integer multiple of the memory page, and therefore part of element boundaries are non-memory page boundaries, namely, the data of some elements are stored across the memory page. For example, as shown in fig. 3, assuming that the memory page is 4kB and the data amount of an element is 9kB, the data of the element is stored across 3 memory pages. If the slicing boundary of the data block is boundary a, the header data (e.g., the last 1kB of the 9kB element) of the data of the element in the next data block is incomplete, and thus the data query in the next data block is invalid.

In order to solve the technical problem, referring to fig. 1 and fig. 2, in the embodiment of the present application, data of N memory pages are overlapped end to end for a data block adjacent to a memory location in a plurality of data blocks. Wherein N is a positive integer, and N < S. The adjacent data blocks in the memory location are the data blocks with the sequence numbers of n and (n+1). The data of N memory pages overlapped from the beginning to the end of the data block adjacent to the memory location refers to the data of N memory pages overlapped from the end of the previous data block adjacent to the memory location (i.e. data block N) and the head of the next data block (i.e. data block n+1).

The value of N is determined by the data quantity Q of the element to which the data to be queried belongs. Specifically, N is not less than X. Wherein X represents the data of the element to which the data to be queried belongs The value of the upward rounding is performed after the amount Q is divided by the memory page size. Namely:where Q represents the data amount of the element to which the data to be queried belongs, and is generally determined by the database, and the maximum data amount is specified for the element by the database. For example, if the data amount specifying the price of the commodity does not exceed 7kB, q=7kb. P represents the memory page size, e.g., p=4kb. />Representing an upward rounding. For example, q=7 kB; p=4 kB, then x=2. Correspondingly, N is more than or equal to 2. In order to reduce the calculation amount, n=x is generally taken, that is, the head and tail of 2 data blocks adjacent to each other in memory locations in the plurality of data blocks overlap X memory pages.

Referring to fig. 1 and fig. 2, after obtaining a plurality of data blocks, the parallel processing module 20 may read the target data block from the data to be queried stored in the host to the continuous memory space of the parallel processing module 20. The target data block read by the parallel processing module 20 is the data that is continuously stored in the portion read by the parallel processing module 20.

In this embodiment, the specific implementation manner of the parallel processing module 20 to read the target data block from the data to be queried stored in the host is not limited. For the parallel processing modules 20, a plurality of data blocks can be read from the data to be queried stored in the host in parallel to respective continuous memory spaces. Each parallel processing module 20 may read one block of data at a time.

In this embodiment, as shown in fig. 1, the parallel processing module 20 may utilize a direct memory access (Direct Memory Access, DMA) manner to read the continuous memory space of the target data block parallel processing module 20 from the data to be queried stored in the host 10.

Specifically, as shown in fig. 1, the host 10 side is deployed with a DMA drive 103; the parallel processing module 20 is deployed with a DMA engine 203. The DMA drive 103 refers to a software functional module, plug-in, or the like for driving the DMA engine 203 to operate. In this embodiment, after splitting the data to be queried into a plurality of data blocks, the host 10 may invoke the DMA driver 103 to drive the DMA engine 203 to act. The DMA drive 103 may issue a data query instruction to the DMA engine 203 that may carry the starting memory offset address of the data block to be read by the DMA engine 203 and the length of the data block to be read. The data block to be read is the target data block corresponding to the parallel processing module 20. Further, the DMA engine 203 may respond to the data query command, and read the data block to be read (i.e. the target data block) to the continuous memory space of the parallel processing module 20 in a DMA manner from the data to be queried stored in the host 10 according to the initial memory address offset of the data block to be read and the length of the data block to be read. The data block to be read is the target data block read by the parallel processing module 20.

Referring to fig. 1 and fig. 2, the plurality of parallel processing modules 20 may perform data query on the respective read target data blocks in parallel, so as to obtain query results of the plurality of data blocks. In this embodiment of the present application, the parallel processing module 20 may perform parallel data query on the read target data block in an asynchronous manner, so as to obtain a query result of the data block.

In an embodiment with 1 parallel processing module 20, in order to increase the data query speed, the parallel processing module 20 may perform data query on the target data block in an asynchronous manner, that is, the parallel processing module 20 may read the next data block without waiting for the data query of the data block to be completed.

In the asynchronous data query scenario or the embodiments in which the parallel processing module 20 is a plurality of embodiments, since the order in which the data query is completed for the plurality of data blocks cannot be determined, the result storage area may be set in advance for each data block in the memory of the host 10. Wherein the result storage areas of adjacent data blocks are adjacent. Host 10 may allocate a result storage area for each data block before, after, or during allocation of the plurality of data blocks to parallel processing module 20. Wherein the size of the result storage area may be equal to the data amount of the data block. In this way, the host 10 cannot determine the sequence of the query results returned by each parallel processing module 20, and may store the query results of the plurality of data blocks in the result storage area set in advance for the plurality of data blocks based on the preset result storage area corresponding to each data block.

Because the head and tail of the 2 adjacent data blocks in the plurality of data blocks overlap the data of the N memory pages, the head and tail of the 2 adjacent data blocks in the memory position overlap the query result of the N memory pages. Since the query process of the data block is sequentially queried from small to large from the memory offset address, the query results of N memory pages at the tail of the previous data block in 2 adjacent data sub-blocks with memory locations are correct, and the query results of N memory pages at the head of the next data block may be invalid because the data stored in the N memory pages only includes the second half of the data of the element.

Based on this, the parallel processing module 20 may read the query result of the target data block from the parallel processing module 20 to the memory space preset by the host 10 for the target data block; specifically, the DMA engine in the parallel processing module 20 may utilize a DMA method to read the query result of the target data block from the parallel processing module 20 to the memory space preset for the target data block by the host 10.

Further, referring to fig. 1 and fig. 2, for the case that the target data block is a non-first data block of the plurality of data blocks, the parallel processing module 20 may delete the query results of the N memory pages at the head of the target data block during the reading process of the query results of the target data block, so as to obtain the query results of the data (such as the data block N) that is read by the parallel processing unit and is stored continuously. Thus, the end data of the previous data block is used for being out of limit, and the matching failure of the head of the next data block caused by incomplete data due to data segmentation can be corrected.

For example, assuming that the memory page is 4kB, the data quantity Q of the element to which the data to be queried belongs is less than or equal to 4kB, the head and tail of the adjacent data blocks overlap 1 memory page, i.e. overlap 4kB of data. The tail of the data block n overlaps with the head of the data block (n+1) by 1 memory page (namely, 4 kB), so that the query result of the tail 4kB data of the data block n overlaps with the query result of the head 4kB data of the data block (n+1), and therefore, the query result of the head 4kB of the data block (n+1) can be deleted in the reading process of the query result of the data block (n+1), the purpose of utilizing the tail data of the previous data block to cross the boundary is achieved, and the matching failure caused by the incomplete data of the head of the next data block due to data division is corrected.

Correspondingly, if the target data block is the first data block of the plurality of data blocks, the query result of the target data block can be directly read from the parallel processing module 20 to the memory space preset for the target data block in the host 10, so as to obtain the query result of the plurality of data blocks, that is, the query result of the data to be queried.

In this embodiment, the data query is offloaded to the parallel processing module for parallel query, so that the data query speed can be improved compared with the conventional CPU data query. Because the data query is unloaded to the parallel processing module, the load of the CPU of the equipment where the query engine is located is reduced, the CPU resource consumption is reduced, and the probability of occurrence of CPU performance bottleneck is reduced. In addition, when the host computer performs data segmentation on the data to be queried, the host computer directly takes the S times of the memory page as the segmentation boundary, and does not need to search the element boundary, so that the data segmentation efficiency can be improved, and the follow-up data query speed can be improved. On the other hand, the host machine segments the data to be queried by taking the S times of the memory page as a segmentation boundary, namely, the data is segmented by taking the memory page boundary, so that the segmented data blocks are aligned with cache lines, and when the parallel processing module queries the data blocks, the cache line alignment operation is not needed, so that the replication times of the data blocks can be reduced, and the data query speed is further improved.

In the above embodiment, the data query process of the plurality of parallel processing modules 20 is the same, and the data query process inside the parallel processing module 20 is exemplified by any one of the parallel processing modules 20.

Referring to fig. 2 and 4, the parallel processing module 20 includes 1 or more computing units 202. The parallel processing module 20 may utilize a plurality of computing units 202 to perform parallel query on the read target data block, so as to increase the data query speed. Thus, for a portion of the continuously stored data Φ in the memory of the parallel processing module 20, the parallel processing module 20 may divide the portion of the continuously stored data Φ into a plurality of data sub-blocks having an integer multiple of the memory page. The data volume of other data sub-blocks except the last data sub-block in the plurality of data sub-blocks is M times of the memory page. M is a positive integer, N < M < S. Generally, M.gtoreq.2. The last data sub-block in the plurality of data sub-blocks refers to the data sub-block with the largest memory offset address in the plurality of data sub-blocks. The data amount of the last data sub-block of the plurality of data sub-blocks is determined by the size of the read portion of the consecutively stored data phi. Assuming that the data size of the data phi of the partial continuous storage is Y and the memory page size is P, the data size of the last data sub-block is equal to (Y-M x P).

Considering that the element sizes of the database are different, the data amount of some elements is not an integer multiple of the memory page, which may cause a part of element boundaries to be non-memory page boundaries, that is, some element data is stored across the memory page, so that the data of the element is incomplete in the header data (such as the last 1kB of the 9kB element in fig. 3) of the next data block, and further, the data query in the next data block is invalid, which can be seen in fig. 3.

To solve this problem, in this embodiment, data of N memory pages may be set to overlap from the beginning to the end of two data sub-blocks adjacent to each other in memory location. Regarding the value of N, reference may be made to the related content of the above embodiment, and the description is omitted herein.

In some embodiments, the parallel processing module 20 may divide the read portion of the data (such as the data block n) stored in succession into a plurality of data sub-blocks with a data size that is an integer multiple of the memory page according to the data size Q of the element to which the data to be queried belongs and the number K of the plurality of computing units.

Specifically, the data amount Z of the data sub-block may be determined according to the number K of the plurality of calculation units and the data amount Y of the data phi (data block n) which is partially continuously stored. Wherein z=y/K. Further, as shown in fig. 4, the target data block (e.g., data block n) may be divided into a plurality of initial data sub-blocks (e.g., initial data sub-block 0- (k-1) in fig. 4) according to the data amount Z of the data sub-blocks. The data amount of the initial data sub-block is Z. Wherein the number of the initial data sub-blocks is equal to the number of the data sub-blocks, which are equal to the number of the calculation units K.

Further, the number N of memory pages with the head and tail overlapping of two adjacent data sub-blocks in the memory position can be determined according to the value X obtained by dividing the data quantity Q of the element to which the data to be queried belongs by the memory page size P and performing upward rounding. Wherein, representing a rounding down. Generally, n=x.

Further, as shown in fig. 4, for two adjacent initial data sub-blocks in the memory location, data of N memory pages in the head of the subsequent initial data sub-block may be added to the tail of the previous initial data sub-block, so as to obtain a plurality of data sub-blocks. Such as data sub-block 0- (K-1) in fig. 4. In fig. 4, the diagonally filled portion represents N memory pages with 2 data sub-blocks adjacent to each other in memory location overlapping from head to tail. The header of the next initial data sub-block still has the data of the N memory pages. Thus, the head and tail of two adjacent data sub-blocks in the memory position overlap the data of N memory pages. The data amount of the other data sub-blocks except the last data sub-block in the plurality of data sub-blocks is as follows: z+n×p. P represents the memory page size.

Due toX represents the data quantity Q of the element to which the data to be queried belongs divided by the memory page size P to be rounded upwards, so that even if the element is stored across the memory page, the former data sub-block of the two adjacent data sub-blocks in the memory position can retain all data of the element, and the problem of matching failure caused by the fact that the head of the latter data sub-block is not full of data can be avoided during subsequent data query.

After the parallel processing module 20 has sliced out multiple sub-blocks of data, the multiple sub-blocks of data may be distributed to multiple computing units 202 in the parallel processing module 20. The computing unit 202 may query the data sub-blocks in parallel to obtain query results for a plurality of data sub-blocks.

In the embodiment of the present application, the specific implementation manner of the computing unit 202 for querying the data sub-block is not limited. In some embodiments, the computing unit 202 may perform character matching on the target data sub-block allocated to the computing unit and the filtering condition corresponding to the query request by using a character string matching algorithm; and marking the matching result of each character in the target data sub-block by using the binary bit value to obtain the query result of the target data sub-block. The binary bit value is either 0 or 1. Wherein, the character string matching algorithms are different, and the binary bit values are different.

In the embodiment of the present application, a specific implementation form of the string matching algorithm is not limited. Alternatively, the string matching algorithm may be a Shift-and (Shift-and) algorithm, a Shift-or (Shift-or) algorithm, a Sunday algorithm, or a Rabin-Karp algorithm, etc. Wherein shifting matches results with each character in the target data sub-block marked with binary 1; the shift or algorithm marks the matching result for each character in the target data sub-block with a binary 0.

The following exemplifies a Shift-and algorithm for the data query procedure of the target data sub-block. The general idea of the Shift-and algorithm is to preprocess the pattern string (i.e. the string corresponding to the filtering condition) into a special coding form and then match the text string (target data sub-block) bit by bit according to this coding form.

Firstly, the character string corresponding to the filtering condition is preprocessed, and binary digits are used for encoding. The character can be encoded into binary numbers with the same number of bits as the number of bits of the character string according to the number of bits of the character in the character string corresponding to the filtering condition. Wherein the number position 1 where the character is located, and the other number positions 0. For example, if the string corresponding to the filtering condition is "abac", a appears at the 0 th bit and the 2 nd bit, the coding information of a is: binary 0101, and likewise, b occurs in bit 1, then the encoded information of b is: 0010; c appears in bit 3, and the coding information of c is: 1000.

further toFor each character of the target data sub-block, a corresponding state code number D is defined, and the ith bit of D is 1, so that the target data sub-block is completely matched with the 0-i bits of the character string corresponding to the filtering condition when the character of the bit is taken as the end. For example, the string corresponding to the filtering condition is "acbace", the target data sub-block is the second "a" character in "..acbaef", "and when this" a "character is taken as the end, it may be completely matched with the 0 th bit of the string corresponding to the filtering condition, or may be completely matched with the first 4 bits (i.e., 0 th bit to 3 rd bit) of the string corresponding to the filtering condition, so that the status code d=2 of the second" a "character in"..acbaef ", ⁰ +2 ³ ＝9。

Initial state d=0. Assuming that the previous state code D has been determined, e.g. the example d=2 above ⁰ +2 ³ For a new target data sub-block, the first 1 or 4 bits of the string of known filter conditions are completely matched, then the target data sub-block text i is checked]And whether the 1 st bit or the 4 th bit of the character string of the filtering condition can be completely matched. The new state code should be equal to (D<<1)&(code[text[i]]). Wherein D is<<1 represents shifting the original state code left'<<"one-position; code [ text [ i ]]And (3) representing the binary code corresponding to the character of the ith bit of the target character string. "&"means a bitwise AND. In the above example, i=1, or i=4. If it is checked whether the 1 st bit of the string of the target data sub-block and the filter condition matches, i=1; if it is checked whether the 4 th bit of the string of the target data sub-block and the filter condition match, i=4.

In yet another case, if code [ text [ i ] ] is exactly equal to 1, i.e., text [ i ] character is exactly the 0 th bit of the string corresponding to the filter condition, then the binary 0 th bit of the new D should also be 1. In combination with the above two cases, the new state code d= ((D < < 1) |1) & (code [ text [ i ] ]). Where "|" indicates a bitwise OR. If the bit of the binary j of the state code D is found to be 1 (j is the length of the character string corresponding to the filtering condition), the filtering condition is completely matched, namely the matching is successful.

The above description is given by taking shift-and (shift-and) as an example, and the character matching process of the filtering condition corresponding to the target data sub-block and the query request is not limited thereto. The Shift or algorithm is the same as the Shift-and algorithm in principle, except that the Shift or algorithm changes the core AND operation in the Shift-and into OR operation, so that one additional OR operation of 'D < < 1|1' can be saved. I.e. Shift-or algorithm, the new state code d=d < <1 > (code [ text [ i ]). One bit operation (i.e., an "or 1" operation) is reduced compared to the Shift-and algorithm. The Shift-or algorithm uses a binary 0 to represent character matching.

For the embodiment of performing character matching on the target data sub-block and the filtering condition corresponding to the query request by using the character string matching algorithm, the query result of the target data sub-block is a binary character string. The length of the binary string is equal to the target data sub-block. For Shift-and algorithm, if the i bit of the query result of the target data sub-block is 1, the first i bit of the target data sub-block is matched with the filtering condition. For a Shift-or algorithm, if the ith bit of the query result of the target data sub-block is 0, the first i bit of the target data sub-block is matched with the filtering condition.

The above description is given by taking the example of the query procedure of the parallel processing module 20 on the target data sub-block allocated thereto. The plurality of parallel processing modules 20 query the plurality of data sub-blocks in parallel to obtain query results corresponding to the plurality of data sub-blocks.

Since the order in which the plurality of calculation units 202 in the parallel processing module 20 complete the query of the data sub-blocks cannot be determined, in this embodiment, a result storage area may be set in advance in the parallel processing module 20 for each data sub-block. Wherein the result storage areas of adjacent data sub-blocks are adjacent. Parallel processing module 20 may allocate a result storage area for each data sub-block before, after, or during allocation of the plurality of data sub-blocks to the plurality of computing units 202. Wherein the size of the result storage area may be equal to the data amount of the data sub-block. In this way, the parallel processing module 20 cannot determine the sequence of the query results returned by each parallel processing module 20, and may store the query results of the plurality of data sub-blocks into the result storage area set for the plurality of data sub-blocks in advance based on the preset result storage area corresponding to each data sub-block.

Because the head and tail of the 2 data sub-blocks adjacent to each other in the memory positions in the plurality of data sub-blocks overlap the data of the N memory pages, the head and tail of the 2 data sub-blocks adjacent to each other in the memory positions overlap the query results of the N memory pages. Since the query process of the data sub-blocks is sequentially queried from small to large from the memory offset address, the query results of the N memory pages at the tail of the previous data sub-block in the 2 adjacent data sub-blocks with memory locations are correct, and the query results of the N memory pages at the head of the next data sub-block may fail because the data stored in the N memory pages only includes the second half data of the element. Based on the above, in the process of storing the query results of the data sub-blocks, for two adjacent data sub-blocks in the memory location, the query results of the data of the last N memory pages of the previous data sub-block in the 2 data sub-blocks can be used to cover the query results of the data of the first N memory pages of the subsequent data sub-block, so as to obtain the query results of the target data block (such as the data block N) read by the parallel processing unit. Thus, the end data of the former data sub-block is used for being out of limit, and the matching failure of the head of the latter data sub-block caused by data division due to incomplete data can be corrected.

For example, as shown in fig. 5, assuming that the memory page is 4kB, the data quantity Q of the element to which the data to be queried belongs is less than or equal to 4kB, the head and tail of the adjacent data sub-blocks overlap 1 memory page, i.e. overlap 4kB of data. The initial memory address of the nth data sub-block is n 4k r. Wherein, if r=the data size of the data sub-block divided by the memory page size (e.g. 4 kB), the data size of the data sub-block is 4k× (r+1), and the query result of the data sub-block n may be stored in the n-th pre-allocated result storage area.

Since the tail of the data sub-block n overlaps with the head of the data sub-block (n+1) by 1 memory page (i.e. 4 kB), the query result of the offset address of 4k of the data sub-block n (r+1) overlaps with the query result of the start offset address 4k of the data sub-block (n+1), so that the query result of the head 4kB of the data sub-block n can be used to cover the query result of the head 4kB of the data sub-block (n+1) in the process of storing the query results of a plurality of data sub-blocks, thereby obtaining the query result of the target data block (such as the data block n), realizing that the end data of the previous data sub-block is used beyond the boundary, and the matching failure of the head of the subsequent data sub-block caused by the incomplete data due to the data division can be corrected.

Further, the parallel processing module 20 may provide the query result of the read target data block to the host 10. Specifically, the parallel processing module 20 may use a DMA engine to read the query result of the target data block to the memory area preset for the target data block in the host 10 in a DMA manner. For the case that the target data block is a non-first data block of the plurality of data blocks, the parallel processing module 20 may delete the query results of the N memory pages at the head of the target data block during the reading process of the query results of the target data block, so as to obtain the query results of the data (such as the data block N) that is read by the parallel processing unit and is stored in part continuously. Thus, the end data of the previous data block is used for being out of limit, and the matching failure of the head of the next data block caused by incomplete data due to data segmentation can be corrected.

For the embodiment of carrying out data query on the target data block by adopting the character string matching algorithm, the query result of the data to be queried is the matching result of the filtering condition corresponding to the query request and each character in the data to be queried marked by the bit value. After obtaining the query result of the data to be queried, the host 10 may also screen out target data satisfying the filtering condition included in the query request from the data to be queried according to the matching result of the filtering condition corresponding to the query request of each character in the data to be queried marked by the bit value; and provides the target data to the device that issued the query result, and so on.

In the embodiment of the application, the parallel processing module reasonably utilizes the tail data out of range to correct the matching failure caused by incomplete data at the head of the data sub-block when the computing unit is operated due to violent segmentation (directly taking integer multiple of a memory page as a boundary), and ensures the continuous query result of the data block under the condition of no data replication, and simultaneously enables the upper software to start the parallel operation without sense.

In addition to the above-described data processing system, the embodiments of the present application further provide a data processing method, and the data processing method provided by the embodiments of the present application is described below as an example.

Fig. 6 is a flow chart of a data processing method according to an embodiment of the present application. As shown in fig. 6, the method mainly includes:

601. and the host stores the data to be queried corresponding to the query request into the continuous memory space of the host.

602. The host computer divides the data to be inquired into a plurality of data blocks according to the target data quantity supported by the parallel processing module; overlapping the data of N memory pages from head to tail of the data blocks adjacent to the memory position; the data quantity of other data blocks except the last data block in the plurality of data blocks is S times of the memory page; the data volume of the data block is less than or equal to the target data volume; wherein M and S are positive integers, S > N.

603. The parallel processing module reads the target data block from the plurality of data blocks to the continuous memory space of the parallel processing module.

604. And the parallel processing module performs parallel query on the target data block to obtain a query result of the target data block.

605. And the parallel processing module reads the query result of the target data block from the parallel processing module to a result memory space preset for the target data block in the host.

606. Aiming at the situation that the target data block is a non-first data block in a plurality of data blocks, the parallel processing module deletes the query results of the N memory pages at the head of the target data block in the process of reading the query results of the target data block.

In this embodiment, regarding the implementation manner of the parallel processing module and the host, reference may be made to the relevant content of the above system embodiment, which is not described herein again.

In this embodiment, the host is deployed with a query engine, which can obtain a query request for a data source. The query request may include filter criteria. The filtering conditions are used to define the query scope and the conditions of the final data screened out. For data sources in column storage format, the filter criteria define the scope of the query in terms of columns to be queried.

In this embodiment, the host may obtain the query request; and obtaining the filtering condition from the query request, and then reading the data to be queried from the data source according to the filtering condition. In some embodiments, the data source stores data in a column storage file format, and the host may read, as the data to be queried, a column to be queried corresponding to the query request from the data source according to the filtering condition. That is, the host reads the column to be queried defined by the filtering condition from the data source as the data to be queried. Further, the host may store the data to be queried to a contiguous memory space of the host's memory.

Because the memory storage format is different from the data source storage format, the host can also convert the data to be queried (i.e., the column to be queried) into a column storage format supported by the memory. The column storage format supported by the memory may be an arow format. Arrow is a column storage format supported by memory. Accordingly, the host may store the column to be queried in a column storage format (e.g., an arow format) supported by the memory into a contiguous memory space of the memory of the host.

In this embodiment, in order to decompress the CPU of the host, to reduce the probability that the CPU reaches the performance bottleneck, the host may map the parallel processing module to a virtual device of the host through a virtualization technology; and unloading the data processing function to the parallel processing module through the mapped virtual equipment. The parallel processing module can improve the data processing speed due to the strong parallel computing capability.

In this embodiment, the parallel processing module includes: and (3) a memory. The memory capacity of the parallel processing module is typically smaller than the memory capacity of the host 10. In order to save memory resources, both memory and storage store data in contiguous memory space.

In this embodiment, the parallel processing module further includes: 1 or more computing units. The plural means 2 or more than 2. In this embodiment, the parallel processing module may read a portion of the data stored in succession from the data to be queried stored in the host to the continuous memory space of the memory of the parallel processing module. The data volume of the read part of the continuously stored data is smaller than or equal to the memory capacity of the parallel processing module, and does not exceed the target data volume supported to be processed by the parallel processing module. The target data amount supported by the parallel processing module refers to the data processing capability of the parallel processing module, and may be the maximum data amount that can be processed by the parallel processing module, where the maximum data amount does not exceed the memory capacity of the parallel processing module.

In the present embodiment, the specific data amount of the partially continuously stored data read by the parallel processing module is not limited. In some embodiments, the parallel processing module may read the continuously stored data to the continuous memory space of the memory of the parallel processing module in batches according to the order of the offset address of the data to be queried in the memory from small to large according to the target data amount supported by the parallel processing module. And the data read by each batch is the data which is continuously stored in the part corresponding to the current batch.

In some embodiments, the host may segment the data to be queried into multiple data blocks according to the target amount of data supported by the parallel processing module for processing. The plural means 2 or more than 2. Preferably, the number of data blocks is an integer multiple of the number of parallel processing modules. The plural means 2 or more than 2. Preferably, the number of data blocks is an integer multiple of the number of parallel processing modules.

The data quantity of other data blocks except the last data block in the plurality of data blocks is S times of the memory page. S is a positive integer. Preferably S.gtoreq.3. The memory offset addresses of the same data block are continuous, and the last data block in the plurality of data blocks refers to the data block with the largest initial memory offset address in the plurality of data blocks. The data size of the last data block in the plurality of data blocks is determined by the total data size of the data to be queried, and may be an integer multiple of the memory page or a non-integer multiple of the memory page. In general, the target amount of data that a parallel processing module supports processing is characterized by the maximum number of memory pages that the parallel processing module can process. Accordingly, S may be determined by the amount of target data that the parallel processing module supports processing.

In other embodiments, the parallel processing module is multiple. In order to increase the data query speed, a plurality of parallel processing modules can be utilized to query the data to be queried in parallel. In this embodiment, the host may segment the data to be queried into a plurality of data blocks according to the target data amount that each of the plurality of parallel processing modules supports processing. The plurality of parallel processing modules support a target amount of data for processing. In an embodiment in which the target data amounts supported by the plurality of parallel processing modules are different, the values of the memory page multiples S of the data amounts of other data blocks preceding the last data block of the plurality of data blocks are different.

Considering that the element lengths of the database are different, the data quantity of some elements is not an integer multiple of the memory page, which results in that part of element boundaries are non-memory page boundaries, that is, some element data are stored across memory pages, so that the head data (such as the last 1kB of 9kB element in fig. 3) of the next data block after the boundary crossing element is cut is incomplete, and further, the data query failure is caused in the next data block.

In order to solve the technical problem, in the embodiment of the present application, data of N memory pages are overlapped end to end for data blocks adjacent to memory locations in the plurality of data blocks. Wherein N is a positive integer, and N < S. The adjacent data blocks in the memory location are the data blocks with the sequence numbers of n and (n+1). The data of N memory pages overlapped from the beginning to the end of the data block adjacent to the memory location refers to the data of N memory pages overlapped from the end of the previous data block adjacent to the memory location (i.e. data block N) and the head of the next data block (i.e. data block n+1).

Wherein the value of N is determined by the element to which the data to be queried belongsIs determined by the data quantity Q of (a). Specifically, N is not less than X. Wherein X represents a value obtained by dividing the data quantity Q of the element to which the data to be queried belongs by the memory page size and rounding up. Namely:where Q represents the data amount of the element to which the data to be queried belongs, and is generally determined by the database, and the maximum data amount is specified for the element by the database.

After obtaining a plurality of data blocks, the parallel processing module can read target data blocks from data to be queried stored by the host to the continuous memory space of the parallel processing module. The target data block read by the parallel processing module is the data which is continuously stored in the part read by the parallel processing module.

In this embodiment, the specific implementation manner of the parallel processing module for reading the target data block from the data to be queried stored in the host is not limited. For a plurality of parallel processing modules, a plurality of data blocks can be read from data to be queried stored in a host in parallel to respective continuous memory spaces. Each parallel processing module may read one block of data at a time.

In this embodiment of the present application, the parallel processing module may use a DMA method to read a continuous memory space of the target data block parallel processing module from the data to be queried stored in the host.

Specifically, the host 10 side is deployed with a DMA drive; the parallel processing module is deployed with a DMA engine. After the host computer divides the data to be queried into a plurality of data blocks, a DMA drive can be utilized to issue a data query instruction to the DMA engine, and the data query instruction can carry the initial memory offset address of the data block to be read by the DMA engine and the length of the data block to be read. The data block to be read is the target data block corresponding to the parallel processing module. Further, the parallel processing module can respond to the data query instruction by using the DMA engine, and read the data block to be read (i.e. the target data block) to the continuous memory space of the parallel processing module in a DMA mode from the data to be queried stored in the host according to the initial memory address offset of the data block to be read and the length of the data block to be read. The data block to be read is the target data block read by the parallel processing module 20.

And the plurality of parallel processing modules can perform data query on the respective read target data blocks in parallel to obtain query results of the plurality of data blocks. In this embodiment of the present application, the parallel processing module may perform parallel data query on the read target data block in an asynchronous manner, so as to obtain a query result of the data block.

In an embodiment with 1 parallel processing module, in order to increase the data query speed, the parallel processing module may perform data query on the target data block in an asynchronous manner, that is, the parallel processing module may read the next data block without waiting for the data query of the data block to complete.

In the case of an asynchronous data query scenario or in the case of multiple embodiments of the parallel processing module, since the sequence in which the data query is completed by the multiple data blocks cannot be determined, a result storage area may be set in advance for each data block in the memory of the host. Wherein the result storage areas of adjacent data blocks are adjacent. The host may allocate a result storage area for each data block before, after, or during allocation of the plurality of data blocks to the parallel processing module. Wherein the size of the result storage area may be equal to the data amount of the data block. In this way, the host cannot judge the sequence of the query results returned by each parallel processing module, and the query results of the plurality of data blocks can be stored into the result storage areas preset for the plurality of data blocks based on the preset result storage areas corresponding to the data blocks.

Based on the data block, the parallel processing module can read the query result of the target data block from the parallel processing module to the memory space preset by the host for the target data block; specifically, the DMA engine in the parallel processing module may utilize a DMA manner to read the query result of the target data block from the parallel processing module to the memory space preset by the host for the target data block.

Further, for the case that the target data block is a non-first data block of the plurality of data blocks, the parallel processing module may delete the query results of the N memory pages at the head of the target data block in the process of reading the query results of the target data block, so as to obtain the query results of the data (such as the data block N) that are read by the parallel processing unit and are stored in part continuously. Thus, the end data of the previous data block is used for being out of limit, and the matching failure of the head of the next data block caused by incomplete data due to data segmentation can be corrected.

Correspondingly, if the target data block is the first data block in the plurality of data blocks, the query result of the target data block can be directly read from the parallel processing module to the memory space preset for the target data block in the host, so as to obtain the query result of the plurality of data blocks, namely the query result of the data to be queried.

In the above embodiment, the data query processes of the plurality of parallel processing modules are the same, and an example of the data query process inside the parallel processing module is described below by taking any one of the parallel processing modules 0 as an example.

The parallel processing module includes 1 or more computing units. The parallel processing module 20 may utilize a plurality of computing units to perform parallel query on the read target data block, so as to increase the data query speed. Therefore, for the data phi of partial continuous storage in the memory of the parallel processing module, the parallel processing module can divide the data phi of partial continuous storage into a plurality of data sub-blocks with the data quantity being integral multiple of the memory page. The data volume of other data sub-blocks except the last data sub-block in the plurality of data sub-blocks is M times of the memory page. M is a positive integer, N < M < S. Generally, M.gtoreq.2. The last data sub-block in the plurality of data sub-blocks refers to the data sub-block with the largest memory offset address in the plurality of data sub-blocks. The data amount of the last data sub-block of the plurality of data sub-blocks is determined by the size of the read portion of the consecutively stored data phi. Assuming that the data size of the data phi of the partial continuous storage is Y and the memory page size is P, the data size of the last data sub-block is equal to (Y-M x P).

Considering that the lengths of the elements of the database are different, the data amount of some elements is not an integer multiple of the memory page, which may cause that part of the element boundaries are non-memory page boundaries, that is, some element data are stored across the memory page, so that the header data (such as the last 1kB of the 9kB element in fig. 3) of the next data block of the slicing boundary is incomplete, and further, the data query in the next data block is invalid, which can be seen in fig. 3.

Further, for two adjacent initial data sub-blocks in the memory location, the data of the first N memory pages of the next initial data sub-block may be added to the tail of the previous initial data sub-block, so as to obtain a plurality of data sub-blocks.

Due toX represents the data quantity Q of the element to which the data to be queried belongs divided by the memory page size P to be rounded up, so that even if the element is stored across the memory page, the previous data sub-block of two adjacent data sub-blocks in the memory position can retain all the data of the element, and the following data query can avoid the following data queryThe header of the latter data sub-block causes the problem of matching failure due to data insufficiency.

After the parallel processing module cuts out a plurality of data sub-blocks, the data sub-blocks can be distributed to a plurality of computing units in the parallel processing module, and the data sub-blocks are queried in parallel by the computing units so as to obtain query results of the data sub-blocks.

In the embodiment of the present application, the specific implementation manner of the computing unit to query the data sub-block is not limited. In some embodiments, the computing unit may perform character matching on the target data sub-block allocated to the computing unit and the filtering condition corresponding to the query request by using a character string matching algorithm; and marking the matching result of each character in the target data sub-block by using the binary bit value to obtain the query result of the target data sub-block. The binary bit value is either 0 or 1. Wherein, the character string matching algorithms are different, and the binary bit values are different.

Since the order in which the plurality of computing units in the parallel processing module complete the query of the data sub-blocks cannot be determined, in this embodiment, a result storage area may be set in advance in the parallel processing module for each data sub-block. Wherein the result storage areas of adjacent data sub-blocks are adjacent. The parallel processing module may allocate a result storage area for each data sub-block before, after, or during allocation of the plurality of data sub-blocks to the plurality of computing units. Wherein the size of the result storage area may be equal to the data amount of the data sub-block. In this way, the parallel processing module cannot judge the sequence of the query results returned by each parallel processing module, and the query results of the plurality of data sub-blocks can be stored into the result storage areas preset for the plurality of data sub-blocks based on the preset result storage areas corresponding to the data sub-blocks.

Further, the parallel processing module may provide the query result of the read target data block to the host. Specifically, the parallel processing module may use a DMA engine to read the query result of the target data block to the memory area preset for the target data block in the host 10 in a DMA manner. Aiming at the situation that the target data block is a non-first data block in a plurality of data blocks, the parallel processing module can delete the query results of the N memory pages at the head of the target data block in the process of reading the query results of the target data block, so as to obtain the query results of the data (the data block N) which are read by the parallel processing unit and are stored continuously. Thus, the end data of the previous data block is used for being out of limit, and the matching failure of the head of the next data block caused by incomplete data due to data segmentation can be corrected.

For the embodiment of carrying out data query on the target data block by adopting the character string matching algorithm, the query result of the data to be queried is the matching result of the filtering condition corresponding to the query request and each character in the data to be queried marked by the bit value. After obtaining the query result of the data to be queried, the host can also screen out target data meeting the filtering condition contained in the query request from the data to be queried according to the matching result of the filtering condition corresponding to the query request of each character in the data to be queried marked by the bit value; and provides the target data to the device that issued the query result, and so on.

It should be noted that, the execution subjects of each step of the method provided in the above embodiment may be the same device, or the method may also be executed by different devices. For example, the execution subject of steps 601 and 602 may be device a; for another example, the execution body of step 601 may be device a, and the execution body of step 602 may be device B; etc.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations appearing in a specific order are included, but it should be clearly understood that the operations may be performed out of the order in which they appear herein or performed in parallel, the sequence numbers of the operations such as 601, 602, etc. are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.

Fig. 7 is a schematic structural diagram of a parallel processing module according to an embodiment of the present application. As shown in fig. 7, the parallel processing module includes: memory 701 and a computing unit 702. The memory 701 is electrically connected to the computing unit 702.

The parallel processing module is used for being in communication connection with the host, and is used for reading target data blocks from a plurality of data blocks stored in a continuous memory space of the host to the continuous memory space of the memory 701; the data blocks are obtained by dividing the data to be queried corresponding to the query request by the host according to the target data quantity supported to be processed by the parallel processing module; overlapping data of N memory pages from head to tail of the data blocks adjacent to the memory positions in the data blocks; and the data quantity of the gas data block except the last data block in the plurality of data blocks is S times of the memory page; the data volume of the data block is less than or equal to the target data volume; wherein N and S are positive integers, S > N.

The computing unit 702 is configured to perform parallel query on the target data block, so as to obtain a query result of the target data block. The parallel processing module is further used for reading the query result of the target data block from the parallel processing module to a result memory space preset for the target data block in the host; aiming at the condition that the target data block is a non-first data block in a plurality of data blocks, deleting the query results of the N memory pages at the head of the target data block in the query result reading process of the target data block.

In some embodiments, as shown in fig. 7, the parallel processing module further includes: a DMA engine 703. The DMA engine 703 is configured to obtain a data query instruction issued by the host; acquiring a starting memory offset address of a target data block and the length of the target data block from a data query instruction; and reading the target data block to the continuous memory space of any parallel processing module in a DMA mode according to the initial memory offset address of the target data block and the length of the target data block.

The parallel processing module further comprises: a memory controller 704. A memory controller 704, configured to segment the target data block into a plurality of data sub-blocks with a data size that is an integer multiple of the memory page; overlapping the data of N memory pages from head to tail of two adjacent data sub-blocks in the memory position; wherein, the data quantity of other data subblocks except the last data subblock in the plurality of data subblocks is M times of the memory page; m and N are positive integers, and N < M < S; s is more than or equal to 3.

In some embodiments, the computing unit 702 is a plurality. The plurality of computing units 702 perform parallel queries on the plurality of data sub-blocks to obtain query results of the plurality of data sub-blocks. The memory controller 704 is configured to store query results of the plurality of data sub-blocks in a result storage area preset for the plurality of data sub-blocks in the parallel processing module; and in the storage process, utilizing the query results of the data of the N memory pages at the tail of the previous data sub-block in the two adjacent data sub-blocks in the memory position to cover the query results of the data of the N memory pages at the head of the next data sub-block so as to obtain the query results of the target data block.

The memory controller 704 is specifically configured to, when dividing the target data block data into a plurality of data sub-blocks having a data size that is an integer multiple of the memory page: and dividing the target data block into a plurality of data sub-blocks with the data volume being integral multiples of the memory page according to the data volume of the element to which the data to be queried belongs and the number of the plurality of calculation units.

The memory controller 704 is specifically configured to, when dividing the target data block into a plurality of data sub-blocks with data sizes that are integer multiples of the memory page according to the data sizes of the elements to which the data to be queried belongs and the number of the plurality of calculation units: determining the data quantity of the data sub-blocks according to the quantity of the plurality of computing units and the data quantity of the target data block; dividing a target data block into a plurality of initial data sub-blocks according to the data quantity of the data sub-blocks; the number of the plurality of initial data sub-blocks is equal to the number of the plurality of data sub-blocks; according to the value X obtained by dividing the data quantity Q of the element to which the data to be queried belongs by the memory page size and performing upward rounding, determining the number N of memory pages with the head and tail overlapped of two adjacent data sub-blocks in the memory position; wherein N is more than or equal to X; and adding the data of the first N memory pages of the next initial data sub-block to the tail of the previous initial data sub-block for two adjacent initial data sub-blocks in the memory positions to obtain a plurality of data sub-blocks.

For any computing unit 702 in the plurality of computing units, performing character matching on the target data sub-block allocated to any computing unit and the filtering condition corresponding to the query request by using a character string matching algorithm; and marking the matching result of each character in the target data sub-block by using the binary bit value to obtain the query result of the target data sub-block.

Optionally, the string matching algorithm is a displacement or algorithm, or a displacement and algorithm.

The parallel processing module provided by the embodiment can unload the data query to the parallel processing module for parallel query when being connected with the host, and compared with the traditional CPU data query, the parallel processing module can improve the data query speed. Because the data query is unloaded to the parallel processing module, the load of the CPU of the equipment where the query engine is located is reduced, the CPU resource consumption is reduced, and the probability of occurrence of CPU performance bottleneck is reduced. In addition, when the host computer performs data segmentation on the data to be queried, the host computer directly takes the S times of the memory page as the segmentation boundary, and does not need to search the element boundary, so that the data segmentation efficiency can be improved, and the follow-up data query speed can be improved. On the other hand, the host machine segments the data to be queried by taking the S times of the memory page as a segmentation boundary, namely, the data is segmented by taking the memory page boundary, so that the segmented data blocks are aligned with cache lines, and when the parallel processing module queries the data blocks, the cache line alignment operation is not needed, so that the replication times of the data blocks can be reduced, and the data query speed is further improved.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

It should be further noted that, the descriptions of "first" and "second" herein are used to distinguish between different messages, devices, modules, etc., and do not represent a sequence, nor do they limit that "first" and "second" are different types.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, CD-ROM (Compact Disc Read-Only Memory), optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (or systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (e.g., CPUs, etc.), input/output interfaces, network interfaces, and memory.

The Memory may include volatile Memory, random-Access Memory (RAM), and/or nonvolatile Memory in a computer-readable medium, such as Read Only Memory (ROM) or Flash Memory (Flash RAM). Memory is an example of computer-readable media.

The storage medium of the computer is a readable storage medium, which may also be referred to as a readable medium. Readable storage media, including both permanent and non-permanent, removable and non-removable media, may be implemented in any method or technology for information storage. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase-Change Memory (PRAM), static Random-Access Memory (SRAM), dynamic Random-Access Memory (Dynamic Random Access Memory, DRAM), other types of Random-Access Memory (RAM), read-only Memory (ROM), electrically erasable programmable read-only Memory (Electrically Erasable Programmable Read Only Memory, EEPROM), flash Memory or other Memory technology, read-only compact disc read-only Memory (CD-ROM), digital versatile discs (Digital Video Disc, DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer-readable Media, as defined herein, does not include Transitory computer-readable Media (transmission Media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims

1. A data processing system, comprising: a host and a parallel processing module; the host is in communication connection with the parallel processing module;

2. The system of claim 1, wherein the parallel processing module is a plurality of; the target data quantity supported to be processed by the plurality of parallel processing modules is the same or different;

the host is used for calling Direct Memory Access (DMA) drive and issuing a data query instruction to DMA engines in the plurality of parallel processing modules;

any one of the parallel processing modules is used for acquiring a starting memory offset address of the target data block and the length of the target data block from the data query instruction; and according to the initial memory offset address of the target data block and the length of the target data block, the DMA engine is utilized to read the target data block to the continuous memory space of any parallel processing module in a DMA mode.

3. The system of claim 1, wherein the parallel processing module comprises a plurality of computing units;

the parallel processing module is further configured to: dividing the target data block into a plurality of data sub-blocks with the data volume being integral multiple of the memory page; overlapping the data of N memory pages from head to tail of two adjacent data sub-blocks in the memory position; wherein, the data quantity of other data subblocks except the last data subblock in the plurality of data subblocks is M times of the memory page; m is a positive integer, and N < M < S; s is more than or equal to 3;

the plurality of computing units query the plurality of data sub-blocks in parallel to obtain query results of the plurality of data sub-blocks; storing the query results of the plurality of data sub-blocks into a result storage area which is preset for the plurality of data sub-blocks in the parallel processing module; and in the storage process, the query results of the data of the N memory pages at the tail of the previous data sub-block in the two adjacent data sub-blocks in the memory position are used for covering the query results of the data of the N memory pages at the head of the next data sub-block, so as to obtain the query results of the target data block.

4. A method of data processing, comprising:

5. The method of claim 4, wherein the parallel processing module is a plurality of; the target data quantity supported to be processed by the plurality of parallel processing modules is the same or different; the method comprises the following steps:

the host calls a Direct Memory Access (DMA) driver and issues a data query instruction to a DMA engine in the plurality of parallel processing modules;

the parallel processing module reads a target data block from the plurality of data blocks to a continuous memory space of the parallel processing module, and the method comprises the following steps:

acquiring a starting memory offset address of the target data block from the data query instruction and the length of the target data block; and according to the initial memory offset address of the target data block and the length of the target data block, the target data block is read to the continuous memory space of the parallel processing module by utilizing the DMA engine in a DMA mode.

6. The method of claim 4, wherein the parallel processing module comprises a plurality of computing units; the parallel processing module performs parallel query on the target data block, including:

the parallel processing module divides the target data block into a plurality of data sub-blocks with the data volume being integral multiple of the memory page; overlapping the data of N memory pages from head to tail of two adjacent data sub-blocks in the memory position; wherein, the data quantity of other data subblocks except the last data subblock in the plurality of data subblocks is M times of the memory page; m is a positive integer, and N < M < S; s is more than or equal to 3;

The plurality of computing units are utilized to carry out parallel query on the plurality of data sub-blocks so as to obtain query results of the plurality of data sub-blocks;

storing the query results of the plurality of data sub-blocks into a result storage area which is preset for the plurality of data sub-blocks in the parallel processing module;

and in the storage process, utilizing the query results of the data of the N memory pages at the tail of the previous data sub-block in the two adjacent data sub-blocks in the memory position to cover the query results of the data of the N memory pages at the head of the next data sub-block so as to obtain the query results of the target data block.

7. The method of claim 6, wherein the parallel processing module is configured to divide the target data block into a plurality of data sub-blocks having an integer multiple of a memory page, comprising:

and the parallel processing module divides the target data block into a plurality of data sub-blocks with the data volume being integral multiples of a memory page according to the data volume of the element to which the data to be queried belongs and the number of the plurality of calculation units.

8. The method of claim 7, wherein the parallel processing module divides the target data block into a plurality of data sub-blocks with data volume being an integer multiple of a memory page according to the data volume of the element to which the data to be queried belongs and the number of the plurality of computing units, and the method comprises:

The parallel processing module determines the data quantity of the data sub-blocks according to the quantity of the plurality of computing units and the data quantity of the target data block;

dividing the target data block into a plurality of initial data sub-blocks according to the data quantity of the data sub-blocks; the number of the plurality of initial data sub-blocks is equal to the number of the plurality of data sub-blocks;

according to the value X obtained by dividing the data quantity Q of the element to which the data to be queried belongs by the memory page size and performing upward rounding, determining the number N of memory pages with the head and tail overlapped of two adjacent data sub-blocks in the memory position; wherein N is more than or equal to X;

and aiming at two adjacent initial data sub-blocks in the memory position, adding the data of the N memory pages at the head of the next initial data sub-block to the tail of the previous initial data sub-block to obtain the plurality of data sub-blocks.

9. The method of claim 6, wherein the parallel processing unit performing parallel queries on the plurality of data sub-blocks with the plurality of computing units, comprising:

any one of the plurality of computing units performs character matching on the target data sub-block allocated to the any one computing unit and the filtering condition corresponding to the query request by using a character string matching algorithm; and marking the matching result of each character in the target data sub-block by using the binary bit value to obtain the query result of the target data sub-block.

10. The method of claim 9, wherein the string matching algorithm is a displacement or algorithm, or a displacement and algorithm.

11. The method according to any one of claims 4 to 10, wherein the query result of the data to be queried is a matching result of each character in the data to be queried marked with a bit value and a filtering condition corresponding to the query request; the method further comprises the steps of:

the host screens out target data meeting the filtering conditions from the data to be queried according to the matching result of each character in the data to be queried marked by the bit value and the filtering conditions corresponding to the query request;

and providing the target data to the equipment which sends out the query request.

12. The method according to any one of claims 4-10, wherein before storing the data to be queried corresponding to the query request to the contiguous memory space of the host, the method further comprises:

the host reads a column to be queried corresponding to the query request from a data source to serve as the data to be queried;

converting the column to be queried into a column storage format supported by a memory;

the method is specifically used for storing the data to be queried corresponding to the query request into the continuous memory space of the host when the data to be queried is stored in the continuous memory space of the host:

And storing the column to be queried with the column storage format supported by the memory into the continuous memory space of the host.

13. A parallel processing module, comprising: memory and computing unit; the memory is electrically connected with the computing unit;

the parallel processing module is used for being in communication connection with a host, and is used for reading target data blocks from a plurality of data blocks stored in a continuous memory space of the host to the continuous memory space of the memory; the data blocks are obtained by dividing the data to be queried corresponding to the query request according to the target data quantity supported to be processed by the parallel processing module by the host; overlapping the data of N memory pages from head to tail of the data blocks adjacent to the memory positions in the plurality of data blocks; and the data quantity of other data blocks except the last data block in the plurality of data blocks is S times of the memory page; the data volume of the data block is smaller than or equal to the target data volume; wherein N and S are positive integers, S > N;

14. The module of claim 13, wherein the parallel processing module comprises a plurality of the computing units; the parallel processing module further includes: a memory controller;

the memory controller is used for: dividing the target data block into a plurality of data sub-blocks with the data volume being integral multiple of the memory page; overlapping the data of N memory pages from head to tail of two adjacent data sub-blocks in the memory position; wherein, the data quantity of other data subblocks except the last data subblock in the plurality of data subblocks is M times of the memory page; m is a positive integer, and N < M < S; s is more than or equal to 3;

the computing units query the data sub-blocks in parallel to obtain query results of the data sub-blocks;

the memory controller is further configured to store query results of the plurality of data sub-blocks to a result storage area preset for the plurality of data sub-blocks in the parallel processing module; and in the storage process, the query results of the data of the N memory pages at the tail of the previous data sub-block in the two adjacent data sub-blocks in the memory position are used for covering the query results of the data of the N memory pages at the head of the next data sub-block, so as to obtain the query results of the target data block.