CN115145953A

CN115145953A - Data query method

Info

Publication number: CN115145953A
Application number: CN202210751435.4A
Authority: CN
Inventors: 叶杨; 陈伟; 王维军
Original assignee: Shanghai Zhuochen Info Tech Co ltd
Current assignee: Shanghai Zhuochen Info Tech Co ltd
Priority date: 2021-10-22
Filing date: 2021-11-03
Publication date: 2022-10-04
Also published as: CN114020779A; CN114020779B

Abstract

The invention provides a data query method, which is applied to a self-adaptive optimization retrieval performance database for big data storage and mainly comprises the following steps: step 1: inputting a query request, and receiving and analyzing the query request by a query module to obtain a query condition; step (ii) of 2: judging whether the same query conditions exist in the cache module, if so, directly obtaining a query result from the cache module, and if not, entering the step 3; and step 3: adjusting the query resources distributed to each block of data in the storage module by the query module according to the reward and punishment function of the optimization module, and querying to obtain a query result; and 4, step 4: recording the information of each inquired block data in the inquiry process, wherein the information comprises inquiry conditions, inquiry time and inquiry results, and combining the inquiry conditions, the inquiry time and the inquiry results into an inquiry result set; and 5: and caching the query conditions and the query results into a cache module. The invention solves the problem that the prior art can not improve the query efficiency according to the real-time query condition of mass data.

Description

Data query method

The application is a divisional application of Chinese patent application with the application number of 202111291885.1 and the name of "adaptive optimization retrieval performance database and data query method" filed on 3.11 months in 2021.

Technical Field

The invention relates to a data query method, and belongs to the field of big data storage.

Background

Data processing can be broadly divided into two broad categories: online transaction Processing (OLTP) and online Analytical Processing (OLAP). OLTP is the primary application of traditional relational databases, primarily for basic, everyday transactions such as banking transactions. OLAP is a major application of data warehouse systems, supports complex analytical operations, focuses on decision support (and is therefore also referred to as DSS decision support system), and provides intuitive and straightforward query results.

In the OLAP scenario, the most basic and effective optimization of data storage is to store rows instead of columns. Data compression is a common optimization means in the storage field, and the storage space of data on a disk is greatly reduced by controllable CPU overhead, so that the cost can be saved, and the overhead of cross-thread and cross-node network transmission of I/O and data in a memory can be reduced. Compression algorithms are not as high as possible, and algorithms with high compression ratios tend to have slower compression and decompression speeds, requiring tradeoffs between CPU and I/O depending on hardware configuration and usage scenarios. Data encoding may be understood as lightweight compression, including RLE and data dictionary encoding, among others. In the column storage mode, the efficiency of data compression and encoding is much higher than in the row storage mode.

When the OLAP system accesses a large amount of data, the OLAP system is limited by a data storage mode, common query data and non-common data cannot be distinguished, different resources occupied by different query operations are not optimized uniformly, and query efficiency cannot be improved according to the real-time query state of mass data.

In view of the above, it is necessary to provide a new data query method to solve the above problems.

Disclosure of Invention

The invention aims to provide a data query method to solve the problem that the existing data storage system cannot improve the query efficiency according to the real-time query condition of mass data.

In order to achieve the above object, the present invention provides a data query method, which is applied to a self-adaptive optimization retrieval performance database for big data storage, wherein the self-adaptive optimization retrieval performance database comprises a query module, a cache module, an optimization module and a storage module, and the data query method mainly comprises the following steps:

step 1: inputting a query request, and receiving and analyzing the query request by a query module to obtain a query condition;

step 2: judging whether the same query conditions exist in the cache module, if so, directly obtaining a query result from the cache module, and if not, entering the step 3;

and step 3: adjusting the query resources distributed to each block of data in the storage module by the query module according to the reward and punishment function of the optimization module, and querying to obtain a query result;

and 4, step 4: recording the information of each inquired block data in the inquiry process, wherein the information comprises inquiry conditions, inquiry time and inquiry results, and combining the inquiry conditions, the inquiry time and the inquiry results into an inquiry result set;

and 5: caching the query conditions and the query results into a cache module;

wherein, the optimization module evaluates through a reward and punishment function in the step 3The weight alpha of the partition data i in the query process is calculated, and the information entropy of the query instruction is calculated

Wherein p is _i For inquiry instruction a _m Entropy of information in class i, j denotes a _m There are j categories in total, and then the conditional information entropy of each query resource is calculated

In which resources r are queried _n The total of k different attribute values, query resource r _n ＝{r _n1 ,r _n2 ,…r _nk }，E(a _m |r _n ) To query a resource r _n Lower a _m The conditional information entropy of the query instruction is calculated, and then the information gain G of the n query resources for the m query instructions is calculated _m (r _n )＝I(a _m )-E(a _m |r _n ) Finally, the mth query instruction is obtained by normalization to query the resource r _n Weight of (2)

Adjusting the query resources distributed to each block data in real time when the query module queries according to the weight alpha of the block data i, wherein the reward and punishment function is specifically as follows:

wherein n represents a total of n block data, E (d) represents a time complexity average value at the time of block data query, d _i And the time complexity of inquiring the block data i is represented, lambda is a penalty coefficient, and alpha is the weight of the block data i.

As a further improvement of the present invention, the adaptive optimization search performance database further includes an index module that records blocking information of each block data, and step 3 specifically includes:

step 31: concurrently executing filtering of the feature information of the blocks in the query condition to the index module, and summarizing and filtering the obtained feature block data to be queried;

step 32: the feature block data to be inquired are screened in a multithread concurrent execution mode to the storage module, and row indexes of screened screening blocks are obtained;

step 33: and returning a query result.

As a further improvement of the present invention, in step 3, when the weight α of the block data in the reward and punishment function is greater than 1, the forward allocation weight formula of the query resource is as follows:

wherein, w _mn Querying resource r for mth query instruction _n The weight of (c).

As a further improvement of the present invention, in step 3, when the weight α =1 of the block data in the reward and punishment function, the query resource allocated to the block data is not changed.

As a further improvement of the present invention, in step 3, when the weight α of the block data in the reward and punishment function is less than 1, the inverse distribution weight formula of the query resource is:

where wmn is the weight of the mth query instruction in the query resource rn.

As a further improvement of the present invention, the adaptive optimization retrieval performance database further includes a data blocking module, and the block data is obtained by the data blocking module through multi-threading or multi-process blocking processing of data to be stored and stored in the storage module.

As a further improvement of the present invention, the data blocking module scans data to be stored and determines a data type of the data to be stored, and then performs blocking processing according to the data type.

As the invention proceedsIn one improvement, the resource allocation optimization of the query resource allocated by the query module to each block of data in the storage module is mainly based on calculating the gain of each query resource to the query instruction, wherein the query resource set is R = { R = ₁ ,r ₂ ,…,r _n Represents that there are n query resources, and the query instruction set is A = { a = ₁ ,a ₂ ,…,a _m Represents that there are m query instructions.

As a further improvement of the present invention, the query resources include but are not limited to thread number, CPU core number, memory and hard disk cache.

As a further improvement of the present invention, the query instruction includes, but is not limited to, the number of scan lines, the execution time, and the number of returned results.

The invention has the beneficial effects that: according to the method, the query module is optimized and updated through the optimization module by using the reward and punishment function, the query resources distributed to each block when the query module queries are adjusted in real time, the query time complexity of each block is changed, the query efficiency is improved, the query process is optimized and retrieved in a self-adaptive manner, and the problem that the query efficiency cannot be improved according to the real-time query condition of mass data in the existing data storage system is solved.

Drawings

FIG. 1 is a block diagram of the architecture of the adaptive optimized search performance database of the present invention.

FIG. 2 is a flow chart of a data query method of the present invention.

FIG. 3 is a detailed flow chart of the query module of the present invention when executing a query.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, the present invention discloses a self-adaptive optimized search performance database 100, which is applied to big data storage and specifically includes the following modules:

the data blocking module 1 is used for blocking data to be stored in a multithreading or multiprocessing mode to obtain blocked data;

the storage module 2 is used for storing the block data in the data block module 1;

the index module 3 is used for recording the blocking information of each block of blocking data when the data to be stored is blocked and forming a data index table;

the query module 4 is used for querying the stored block data;

the cache module 5 is used for caching the query conditions and the query results within preset time;

the optimization module 6 evaluates the query process and the query result through the reward and punishment function, optimizes and updates the query module 4, adjusts the query resources distributed to each block data when the query module 4 is queried in real time, and the reward and punishment function is specifically as follows:

wherein n represents a total of n block data, E (d) represents a time complexity average value at the time of block data query, d _i And the time complexity of inquiring the block data i is shown, lambda is a penalty coefficient, and alpha is the weight of the block data i.

For one copy of data to be stored, the data blocking module 1 is configured to scan data in the data to be stored in a multi-thread or multi-process manner, determine a data type of the data to be stored, select a corresponding blocking method according to the data type, and block the data to be stored.

The data types of the data to be stored specifically include: structured data and unstructured data.

When the data type of the data to be stored is structured data, namely table data, the data to be stored is logically partitioned, firstly, field contents in the data to be stored are identified, and then, the identified field contents are partitioned according to numerical characteristics or coding formats.

Numerical characteristics include, but are not limited to: and the time, the place, the certificate number, the transaction account number, the amount, the contact way, the ip and other preset basic data attributes. Encoding formats include, but are not limited to: numerical type, character string type, time type (date type), acsii code, utf-8, etc.

And when the block processing is carried out according to the numerical characteristics, the data block is divided according to the main data attributes corresponding to the numerical characteristics. The main data attribute refers to the data attribute which accounts for the largest proportion of the data to be stored. If the main data attribute is a time value, the data field can be processed in blocks according to the day; if the primary data attribute is geographic coordinates, the data fields may be chunked by geographic partition. When the data is blocked, the fine granularity of the block selected is changed according to the self characteristics of the data attribute. If the data volume in the block data after being blocked according to the preset fine granularity is still larger, the blocking fine granularity can be further reduced, and the block data with the larger data volume is further divided into a plurality of block data with smaller data volume.

For example, in an enterprise employee data database, all employees are processed as employee data by rows according to specific attributes such as department, gender, year of employment, identity information and the like, and structured data can be stored in blocks after being blocked according to employee identity information (numerical characteristics such as identity card numbers) or department codes (coding formats).

When the data type of the data to be stored is unstructured data, namely text information, dimension blocking is conducted on the data to be stored, data cube cutting is conducted according to different dimensions of the data to be stored, a plurality of block data are obtained, each block data comprises at least one unstructured data with preset dimensions, and the preset dimensions are at least one-dimensional.

And storing the block data blocked by the data blocking module 1 into a storage module 2, wherein the storage module 2 comprises a plurality of distributed storage nodes, and at least one block data is stored in each distributed storage node.

The index module 3 is used for recording the blocking information of each block of blocking data when the data to be stored is blocked, and forming a data index table.

Specifically, when the data to be stored is blocked, blocking information of each block data is recorded, the blocking information includes but is not limited to a block name, a block number, and a block characteristic, the blocking information is recorded in a block index table associated with the block data, and each recorded data is simultaneously added with an index record.

If the data type of the block data is structured data, tree indexes are built when the indexes are built.

If the data type of the block data is unstructured data, establishing an inverted index when establishing the index, wherein the process of establishing the index is as follows: index module 3-cache module 5-storage module 2.

And summarizing the established block index tables to obtain a current total index set, namely a data index table.

The query module 4 is used for querying the stored data.

The cache module 5 is configured to cache the query condition and the query result within a preset time, where the cache module 5 stores the query condition and the query result of at least one query, and the preset cache time is determined by a client, which is not limited herein. Specifically, in this embodiment, the caching preset time is preferably seven days, and the caching module 5 caches the query condition and the query result for querying within seven days. When the query module 4 queries, the actual query conditions obtained by analysis are compared with the query conditions stored in the cache module 5, and when the query conditions are the same, the stored corresponding query results can be directly obtained from the cache module 5 without scanning and querying the storage module 2, so that the query speed and efficiency can be effectively improved.

The cache module 5 plays a role of storing when the size of the data to be stored is 8-256 GB, and directly stores the data to be stored in the cache module 5, and when the size of the data to be stored is greater than 256GB, only caches the query condition and the query result within a preset time. Of course, the range of "8 to 256GB" is merely illustrated as a preferred embodiment, and in other embodiments, the range may be adjusted according to actual situations, and is not limited herein.

Referring to fig. 2, the present invention further provides a data query method, which is applied to the foregoing adaptive optimization search performance database 100, and mainly includes the following steps:

step 1: inputting a query request, and receiving and analyzing the query request by the query module 4 to obtain a query condition;

step 2: judging whether the same query conditions exist in the cache module 5, if so, directly obtaining a query result from the cache module 5, and if not, entering the step 3;

and step 3: adjusting the query resource distributed by the query module 4 to each block of data in the storage module 2 according to the reward and punishment function of the optimization module 6, and querying to obtain a query result;

and 5: and caching the query conditions and the query results into a cache module 5.

Referring to fig. 3, the specific steps of performing the query in step 3 include:

step 31: concurrently executing filtering of the block feature information in the query condition to the index module 3, and summarizing and filtering the obtained feature block data to be queried;

step 32: the data of the characteristic blocks to be inquired are screened in a multithread concurrent execution mode to the storage module 2, and row indexes of screened blocks are obtained;

step 33: and returning a query result.

The optimization module 6 evaluates the query process and the query result through the reward and punishment function, optimizes and updates the query module 4, and adjusts the query resources distributed to each block of data in real time when the query module 4 queries so as to improve the query efficiency.

The optimization module 6 further establishes a query reward and punishment function corresponding to the block data by acquiring a query result set comprising the query conditions, the query time and the query result in the step 4; and optimizing the resources distributed when each block data is subjected to query operation according to the reward and punishment function, wherein the reward and punishment function aims to enable the query time complexity of each block data to be close to each other, so that the optimal solution of the total query efficiency is obtained.

The cost function of query optimization is as follows, and the smaller the function value is, the optimal query efficiency is represented:

The optimization goal of the reward and penalty function is to minimize the cost function. The reward and punishment function formula is as follows:

the weighted value alpha of each block data is obtained through calculation of a reward and punishment function, whether resource distribution optimization is carried out or not is determined, if alpha is larger than 1, forward resource optimization is carried out, the time complexity of block data query is reduced, if alpha =1, resource optimization operation is not carried out, if alpha is smaller than 1, reverse resource optimization is carried out, and the time complexity of block data query is improved.

The resource allocation optimization of the query resources allocated to each block of data in the storage module 2 by the query module 4 is mainly based on calculating the gain of each query resource to the query instruction, and the query resource set R = { R = { R = ₁ ,r ₂ ,…,r _n Denotes that there are n query resources, the query resources include but are not limited to thread number, CPU core number, memory and/or hard disk cache, and the query instruction set a = { a = ₁ ,a ₂ ,…,a _m And m query instructions are represented, and include but are not limited to scanning line number, execution time, returned result number and other instructions.

Firstly, calculating the information entropy of a query instruction:

wherein p is _i For inquiry instruction a _m Entropy of information in class i, j denotes a _m There are j categories, in this embodiment, taking the scanning line number in the query instruction as an example, the scanning line number is classified by less than 5000 lines, 5000-1000 lines, and more than 10000 lines, and then j =3 in this embodiment.

Then, calculating the condition information entropy of each query resource:

wherein, for a query resource r _n There are a total of k different attribute values, so the resource r is queried _n ＝{r _n1 ,r _n2 ,…r _nk }，E(a _m |r _n ) To query a resource r _n Lower a _m The condition information entropy of (1).

Query resource r _n The corresponding information gain can be expressed as:

G _m (r _n )＝I(a _m )-E(a _m |r _n )

by calculating the information gain G of n query resources for m query instructions _m (r _n ) The influence degree of each query resource on m query instructions can be obtained.

Obtaining the m-th query instruction by normalization and querying the resource r _n The weight of (c):

when the weight alpha of block data in the reward and punishment function is more than 1, the forward distribution weight formula of the query resource is as follows:

the forward distribution weight after the optimization of the query resource distribution can reduce the complexity d of the query time of the blocks _i Punishment if awardingIf the block weight alpha in the function is greater than 1, forward resource allocation optimization is carried out on the query resources, namely the query resources allocated to the block data are increased, and the improvement of the allocation quantity of the query resources can lead to lower time consumption in the query process, reduce the complexity of the query time of the blocks and improve the query speed of the block data.

When the weight α =1 of the block data in the reward and punishment function, the query resource allocated to each block data is not changed.

When the weight alpha of the block data in the reward and punishment function is less than 1, the reverse distribution weight formula of the query resource is as follows:

the query resource allocation optimized inverse distribution weight can improve the complexity d of the query time of the block _i If the weight alpha of the block data in the reward and punishment function is smaller than 1, performing reverse resource allocation optimization on the query resources, namely reducing the query resources allocated to the block data, wherein the reduction of the allocation quantity of the query resources can lead to more time spent in the query process, improve the query time complexity of the block data and reduce the query speed of the block data.

By changing the weight of the distribution of the query resources to the block data, the query time of the block data is increased or reduced, so that the query time among the block data is dynamically balanced, a small time difference is always kept, and the query efficiency is improved.

The adaptive optimized retrieval performance database 100 of the present invention is an OLAP type database, and when retrieving data in the database, since a processing method for data blocking is adopted, a retrieval task can be executed simultaneously by multiple threads or multiple processes, each thread can execute a query instruction, and a result set is recorded respectively. The larger the number of threads, the more query tasks the system can allocate. For example, in order to obtain data of each day, if the threads are enough, the data of each thread corresponding to an hour can be obtained, and finally, the query results are pieced together and returned.

The query speed of a single query instruction is high, the total return time is not the fastest, the query process needs to be optimally planned, different threads execute different query instructions, and different CPU (central processing unit) core numbers, internal memories and/or hard disk caches and other query resources are distributed to the different threads, and the query resources distributed when each block executes the query are dynamically optimized, so that the query of a plurality of block data can be dynamically distributed according to the system load, the time spent by each thread for executing the query instruction required to be executed is changed, the time spent by each thread for finishing the last completion is close to each other, the optimization of the total query efficiency is achieved, the query resources are fully utilized, and the overall time spent is reduced.

In summary, the adaptive optimization retrieval performance database 100 of the present invention optimizes and updates the query module 4 through the optimization module 6 by using a reward and punishment function, adjusts the query resource allocated to each block data when the query module 4 queries in real time, and changes the query time complexity of each block data, so as to improve the query efficiency, and the adaptive optimization retrieval query process solves the problem that the query efficiency cannot be improved according to the real-time query condition of mass data in the existing data storage system; the data blocking module 1 blocks the data and blocks the large data, so that the data blocking can be processed and inquired in a multi-thread or multi-process mode; the index module 3 is used for establishing indexes for each block data and summarizing the indexes to form a data index table, so that the query process can be simplified during query, the query speed is increased, the query can be executed in parallel aiming at the index information of a plurality of block data, and the query efficiency is improved.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. A data query method is applied to a self-adaptive optimization retrieval performance database for large data storage, and the self-adaptive optimization retrieval performance database comprises a query module, a cache module, an optimization module and a storage module, and is characterized by mainly comprising the following steps:

and 5: caching the query conditions and the query results into a cache module;

in the step 3, the optimization module evaluates the weight alpha of the block data i in the query process through a reward and punishment function, and calculates the information entropy of the query instruction

Wherein p is _i For a query instruction a _m Entropy of information in class i, j represents a _m There are j categories in total, and then the conditional information entropy of each query resource is calculated

In which resources r are queried _n A total of k different attribute values, query resource r _n ＝{r _n1 ,r _n2 ,…r _nk }，E(a _m |r _n ) To query a resource r _n Lower a _m The conditional information entropy of the query instruction is calculated, and then the information gain G of the n query resources for the m query instructions is calculated _m (r _n )＝I(a _m )-E(a _m |r _n ) Finally, the mth query is obtained by normalizationInstructing a query on a resource r _n Weight of (2)

Adjusting query resources distributed to each block data in real time when the query module queries according to the weight alpha of the block data i, wherein the reward and punishment function is specifically as follows:

2. The data query method according to claim 1, wherein the adaptive optimization retrieval performance database further includes an index module that records blocking information of each block data, and at this time, step 3 specifically includes:

step 33: and returning a query result.

3. The data query method of claim 1, wherein: in step 3, when the weight α of the block data in the reward and punishment function is greater than 1, the forward distribution weight formula of the query resource is as follows:

4. The data query method of claim 1, wherein: in step 3, when the weight α =1 of the block data in the reward and punishment function, the query resource allocated to the block data is not changed.

5. The data query method of claim 1, wherein: in step 3, when the weight α of the block data in the reward and punishment function is less than 1, the inverse distribution weight formula of the query resource is as follows:

where wmn is the weight of the mth query instruction in the query resource rn.

6. The data query method of claim 1, wherein: the self-adaptive optimization retrieval performance database also comprises a data blocking module, wherein the data blocking module blocks data to be stored through multiple threads or multiple processes, and the data blocking module stores the data in the storage module.

7. The data query method of claim 6, wherein: the data blocking module scans data to be stored and judges the data type of the data to be stored, and then blocks the data according to the data type.

8. The data query method of claim 1, wherein: the resource allocation optimization of the query resource allocated by the query module to each block of data in the storage module is mainly based on calculating the gain of each query resource to a query instruction, wherein the query resource set is R = { R = ₁ ,r ₂ ,…,r _n Indicates that there are n query resourcesThe query instruction set is A = { a = ₁ ,a ₂ ,…,a _m Represents that there are m query instructions.

9. The data query method of claim 8, wherein: the query resources include, but are not limited to, thread count, CPU core count, memory, and hard disk cache.

10. The data query method of claim 8, wherein: the query instructions include, but are not limited to, the number of scan lines, the execution time, and the number of results returned.