CN113704577A

CN113704577A - Data query method and device based on multithreading concurrent processing

Info

Publication number: CN113704577A
Application number: CN202111054754.1A
Authority: CN
Inventors: 孙李坤
Original assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Current assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Priority date: 2021-09-09
Filing date: 2021-09-09
Publication date: 2021-11-26

Abstract

The application discloses a data query method and a device based on multithreading concurrent processing, wherein the method comprises the following steps: acquiring a first query request for requesting to query data; determining the total number and the total time range of the data to be queried which accord with the query conditions; averagely dividing the total time range into a plurality of first time ranges based on the number of threads subjected to concurrent processing; determining a first number of pieces of data to be queried actually distributed in each first time range and an average number of pieces of data to be queried in the first time range; judging whether the numerical relation between the first number and the average number meets a first condition or not; if yes, concurrently processing a plurality of first query tasks through a plurality of threads to obtain a query result; and if not, concurrently processing a plurality of second query tasks through a plurality of threads to obtain a query result. The method has high query efficiency and short query time.

Description

Data query method and device based on multithreading concurrent processing

Technical Field

The present application relates to the field of data query technologies, and in particular, to a data query method and apparatus based on multithread concurrent processing.

Background

With the development of database related technologies, some conventional search engines can meet the conventional data query requirements of common users well by supporting functions such as distributed deployment, fragment computation, full-text search and the like. However, when a large amount of data is faced, the conventional search engine still has the technical problem of slow query speed when the data amount of a large amount of data is queried, especially when the data amount of a single query is large.

Disclosure of Invention

In view of the above problems in the prior art, the present application provides a data query method and apparatus based on multithread concurrent processing, and the technical solution adopted in the embodiments of the present application is as follows:

a data query method based on multi-thread concurrent processing comprises the following steps:

acquiring a first query request for requesting to query data, wherein the first query request comprises query conditions;

determining the total number of the data to be queried which accord with the query condition and the total time range of distribution;

averagely dividing the total time range into a plurality of first time ranges based on the number of threads concurrently processed;

determining a first number of pieces of data to be queried which are actually distributed in each first time range and an average number of pieces of data to be queried in the first time range;

judging whether the numerical relationship between the first number and the average number meets a first condition, wherein the first condition represents that the data to be inquired are uniformly distributed in the total time range;

if yes, respectively establishing a plurality of first query tasks based on the first time ranges, and concurrently processing the first query tasks through a plurality of threads to obtain query results;

and if the total time range does not meet the average number of the data to be inquired, the total time range is divided into a plurality of second time ranges respectively containing the average number of the data to be inquired, a plurality of second inquiry tasks are respectively established based on the second time ranges, and the plurality of second inquiry tasks are processed through a plurality of threads in a concurrent mode to obtain inquiry results.

In some embodiments, the determining the total number of pieces of data to be queried which meet the query condition and the total time range of distribution includes:

sending a second query request to a database based on the query condition;

and receiving feedback information of the database, wherein the feedback information at least comprises the total number, the total time range and second numbers of the data to be inquired which are actually distributed in each unit time range in the total time range.

In some embodiments, the repartitioning the total time range into a plurality of second time ranges respectively containing the average number of pieces of data to be queried includes:

and sequentially superposing a plurality of unit time ranges according to the time sequence, and determining the time range formed by superposing the unit time ranges as the second time range when the sum of the corresponding second number accords with the average number.

In some embodiments, said determining whether a numerical relationship between said first number and said average number meets a first condition comprises:

determining a difference between each of said first number of strips and said average number of strips;

determining that the numerical relationship between the first number of strips and the average number of strips does not meet the first condition if the difference between any of the first number of strips and the average number of strips is greater than a first threshold;

determining that the numerical relationship between the first number and the average number meets the first condition in a case where a difference between each of the first number and the average number is smaller than the first threshold.

In some embodiments, the method further comprises:

and summarizing the obtained query results, and feeding back the summarized query results to the user terminal sending the first query request.

A data query apparatus, comprising:

the device comprises an acquisition module, a query module and a query module, wherein the acquisition module is used for acquiring a first query request for requesting to query data, and the first query request comprises query conditions;

the first determining module is used for determining the total number of the data to be queried which accord with the query condition and the total time range of distribution;

the dividing module is used for averagely dividing the total time range into a plurality of first time ranges based on the number of threads processed concurrently;

a second determining module, configured to determine a first number of pieces of data to be queried actually distributed in each first time range, and an average number of pieces of data to be queried in the first time range;

the judging module is used for judging whether the numerical relationship between the first number and the average number meets a first condition, wherein the first condition represents that the data to be inquired are uniformly distributed in the total time range;

a first query module, configured to create a plurality of first query tasks based on each first time range respectively when a numerical relationship between the first number and the average number meets the first condition, and concurrently process the plurality of first query tasks through a plurality of threads to obtain a query result;

and the second query module is used for subdividing the total time range into a plurality of second time ranges respectively containing the average number of data to be queried under the condition that the numerical relationship between the first number and the average number does not meet the first condition, respectively creating a plurality of second query tasks based on the second time ranges, and concurrently processing the plurality of second query tasks through a plurality of threads to obtain query results.

In some embodiments, the first determining module is specifically configured to:

sending a second query request to a database based on the query condition;

In some embodiments, the second query module is specifically configured to:

In some embodiments, the determining module is specifically configured to:

In some embodiments, the apparatus further comprises:

and the feedback module is used for summarizing the obtained query results and feeding back the summarized query results to the user terminal sending the first query request.

According to the data query method based on multithreading concurrent processing, whether the numerical relation between the first data and the average data meets the first condition or not can be judged, whether obvious data inclination exists or not in the distribution of the data to be queried in the time sequence can be judged, if the obvious data inclination does not exist, a plurality of first query tasks are created according to the time-sharing method, the plurality of first query tasks are processed in parallel through a plurality of threads, if the obvious data inclination does not exist, a plurality of second query tasks are created according to the number-sharing method, the plurality of second query tasks are processed in parallel through the plurality of threads, the purpose of performing data query by the plurality of threads concurrently and synchronously can be achieved, the query efficiency is high, and the query time is short.

Drawings

FIG. 1 is a flow chart of a data query method according to an embodiment of the present application;

FIG. 2 is a flowchart of step S105 of a data query method according to an embodiment of the present application;

fig. 3 is a schematic view of the scenarios of steps S161 and S162 of the data query method according to the embodiment of the present application;

FIG. 4 is a block diagram of a data query device according to an embodiment of the present application;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Various aspects and features of the present application are described herein with reference to the drawings.

It will be understood that various modifications may be made to the embodiments of the present application. Accordingly, the foregoing description should not be construed as limiting, but merely as exemplifications of embodiments. Those skilled in the art will envision other modifications within the scope and spirit of the application.

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the application and, together with a general description of the application given above and the detailed description of the embodiments given below, serve to explain the principles of the application.

These and other characteristics of the present application will become apparent from the following description of alternative forms of embodiment, given as non-limiting examples, with reference to the attached drawings.

It is also to be understood that although the present application has been described with reference to some specific examples, those skilled in the art are able to ascertain many other equivalents to the practice of the present application.

The above and other aspects, features and advantages of the present application will become more apparent in view of the following detailed description when taken in conjunction with the accompanying drawings.

Specific embodiments of the present application are described hereinafter with reference to the accompanying drawings; however, it is to be understood that the disclosed embodiments are merely exemplary of the application, which can be embodied in various forms. Well-known and/or repeated functions and constructions are not described in detail to avoid obscuring the application of unnecessary or unnecessary detail. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present application in virtually any appropriately detailed structure.

The specification may use the phrases "in one embodiment," "in another embodiment," "in yet another embodiment," or "in other embodiments," which may each refer to one or more of the same or different embodiments in accordance with the application.

Fig. 1 is a flowchart of a data query method based on multithread concurrent processing according to an embodiment of the present application, and as shown in fig. 1, the data query method based on multithread concurrent processing according to the embodiment of the present application may specifically include the following steps:

s101, a first query request for requesting to query data is obtained, wherein the first query request comprises query conditions.

The first query request may be a query request sent by the user terminal to the data query device, or may be a query request obtained from another location. The first query request includes a query condition, which may include, for example, a data type, a time, and/or a storage address.

S102, determining the total number of the data to be queried which meet the query conditions and the total time range of distribution.

In the case of obtaining the query condition, the total number of the data to be queried that meet the query condition and the total time range of the data to be queried that meet the query condition may be determined based on the query condition. For example, in the case of querying japanese-type files, the total number of log files satisfying the query condition, and the start-stop time of log files satisfying the query condition may be determined based on the query condition.

In a specific embodiment, the determining the total number of the data to be queried meeting the query condition and the total time range of distribution includes:

sending a second query request to a database based on the query condition;

That is, in the case of obtaining the query condition, a second query request may be sent to the database based on the query condition to request the database to perform a pre-query operation, and the total number and the total time range of the query conditions and the second number of the data to be queried actually distributed in each unit time range are queried. For example, in the case where the total time range is 24 hours, the unit time range may be 10 minutes, half an hour, or one hour. That is, while determining the total number of pieces, the actual distribution and the number of pieces of data to be queried in each unit time range are also determined respectively.

Of course, the total time range is not limited to be determined by the pre-query operation, and the total time range may be obtained from the first query request as a query condition.

S103, averagely dividing the total time range into a plurality of first time ranges based on the number of threads in concurrent processing.

The number of concurrently processed threads may be the maximum number of concurrently processed threads. For example, when a multi-threaded concurrent processing task is executed based on a thread pool, the maximum number of threads that can be provided by the thread pool can be obtained. The total time range is divided equally according to the maximum number of threads. For example, in the case that the maximum number of threads is 10, the total time range can be divided into 10 first time ranges on average, and the first time ranges are also referred to as sub-time ranges.

S104, determining the first number of the data to be inquired which are actually distributed in each first time range and the average number of the data to be inquired in the first time range.

The first number of pieces of data to be queried, which are actually distributed in each first time range, may be obtained in various ways. In the case where the respective first time ranges are determined, a third query request may be sent to the database based on the first time ranges, and another feedback information of the database may be obtained, and the another range information may include a first number of pieces of data to be queried that are actually distributed in the first time ranges, as in one specific embodiment.

In another embodiment, if a second number of pieces of data to be queried in each unit time range in the total time range has been determined, a first number of pieces of data to be queried actually distributed in each first time range may be determined based on the second number. For example, in the case where the first time range is 1 hour and the unit time range is 10 minutes, the first number of one first time range can be obtained by superimposing the second number of 6 unit time ranges at a time. Of course, the first number may be determined in other ways when actually applied.

Regarding the average number of pieces, when the number of the divided first time ranges and the total number of pieces are determined, an averaging operation may be performed based on these two parameters, that is, the average number of pieces may be obtained.

S105, judging whether the numerical relation between the first number and the average number meets a first condition or not.

Wherein, the first condition represents that the data to be inquired are distributed uniformly in the total time range. It should be noted that the uniform distribution does not mean that the data to be queried which meet the query condition are absolutely evenly distributed at equal time intervals, and it should be understood that the data to be queried which meet the query condition are basically evenly distributed in time sequence, and a large amount of data to be queried is not obviously gathered in a certain short time range.

Under the condition that the first number and the average number are obtained, whether the numerical relationship between each first number and the average number meets a first condition or not is respectively determined, and only under the condition that all the numerical relationships between the first number and the average number meet the first condition, the data to be inquired are uniformly distributed without obvious data inclination, and the numerical relationship between the first number and the average number meets the first condition. And under the condition that the numerical relationship between any one first number and the average number does not accord with the first condition, namely that the data to be inquired is determined to be unevenly distributed and have obvious data inclination, and the numerical relationship between the first number and the average number does not accord with the first condition.

S161, if yes, respectively creating a plurality of first query tasks based on each first time range, and concurrently processing the plurality of first query tasks through a plurality of threads to obtain query results.

That is, the data to be queried are distributed evenly in time sequence, and under the condition that no obvious data inclination exists, a plurality of first query tasks are created based on each first time range, so that each first query task contains a first number of data to be queried, the first numbers are basically the same, the number of the formed first query tasks is consistent with the number of threads processed in parallel, and the plurality of first query tasks are processed in parallel through a plurality of threads to obtain query results, as shown in fig. 3, the data query can be performed in a multithreading and parallel manner, the query efficiency can be obviously improved, and the time for querying the data can be shortened. Alternatively, a plurality of first query tasks may be concurrently processed based on, for example, a plurality of threads in the Java thread pool, and each first query task may be managed.

And S162, if the total time range does not meet the requirement, subdividing the total time range into a plurality of second time ranges respectively containing the average number of data to be queried, respectively creating a plurality of second query tasks based on the second time ranges, and concurrently processing the second query tasks through a plurality of threads to obtain query results.

When the data to be queried are distributed unevenly in time sequence and have obvious data inclination, if the task is created based on the first time range, the idle of individual threads occurs, and the task quantity of other threads is larger, so that the query efficiency cannot be effectively improved. Therefore, under the condition that it is determined that there is significant data skew, the total time range can be subdivided into a plurality of second time ranges based on the average number, the number of pieces of data to be queried included in each second time range is substantially consistent with the average number, and a plurality of second query tasks are created based on each second implementation range, so that the number of pieces of data to be queried included in the plurality of second query tasks is substantially consistent, and the plurality of second query tasks are concurrently processed through a plurality of threads to obtain query results.

Practice proves that taking one query of 20 ten thousand pieces of data as an example, the total time consumed for executing the query task by using one search engine is 754 seconds, and by adopting the data query method of the embodiment of the application, the total time consumed for querying the 20 ten thousand pieces of data is 149 seconds, the query efficiency is improved by about 80%, the query efficiency can be obviously improved, and the query time is shortened.

As shown in conjunction with fig. 1, in some embodiments, the method further comprises:

and S107, summarizing the obtained query results, and feeding back the summarized query results to the user terminal sending the first query request.

In a specific implementation process, two query results may occur, one result is that data to be queried is acquired, the other result is that an abnormal condition occurs, the query task is re-executed according to the configured retry number upper limit, and the retry number reaches the configured retry number upper limit, and the abnormal condition is recorded and also serves as a query result.

And summarizing all query results under the condition that all the first query tasks or the second query tasks are executed, wherein the query results comprise the acquired query data and abnormal condition records, and the summarized query results are fed back to the user terminal.

In some embodiments, referring to fig. 2, the determining whether the numerical relationship between the first number and the average number meets a first condition includes:

In particular implementations, the first number of bars may be greater than the average number of bars or less than the average number of bars, where the first number of bars is greater than the average number of bars, the average number of bars may be subtracted from the first number of bars, and where the first number of bars is less than the average number of bars, the first number of bars may be subtracted from the average number of bars. That is, the difference between the first number and the average number is understood to be an absolute value.

The first threshold is a threshold for representing that data to be queried are distributed on a time sequence with data inclination, and when a difference between any one first number and the average number is greater than the first threshold, the difference at least indicates that the data to be queried in the corresponding first time range are not distributed equally, and may include more data to be queried or better data to be queried. As long as the above occurs, it is determined that there is a data skew. When the difference between all the first number and the average number is smaller than the first threshold, the difference between the number of the data to be inquired in all the first time ranges and the average number is smaller, the data distribution is balanced, and no obvious data inclination exists.

Of course, in practical applications, it is not limited to determine whether the numerical relationship between the first number and the average number meets the first condition by the difference, but may also determine whether the numerical relationship between the first number and the average number meets the first condition by, for example, a ratio or other numerical relationship. That is, the first condition is not limited to the first threshold configured based on the difference value, but may be another threshold configured based on, for example, a ratio, or another threshold configured based on other numerical relationships.

That is, in the case where the second number of pieces of data to be queried in each unit time range is determined, the second number of the plurality of unit time ranges may be sequentially superimposed until the sum of the second number matches the average number, and the time range in which the plurality of unit time ranges are superimposed may be determined as the second time range. The agreement is not to be understood as meaning that the sum of the second numbers corresponds exactly to the average number, that the sum of the second numbers is close to the average number, or that the difference between the sum of the second numbers and the average number is less than a second threshold value. The second threshold may be the same as or different from the first threshold.

For example, in the case that the total time range is from 00:00 to 24:00 on a certain day and the unit time range is 10 minutes, the time ranges from 00:00 to 00:10, 00:10 to 00:20 and 00:20 to 00:30 … … are set until the difference between the sum of the second number and the average number is smaller than a second threshold value, for example, until the superposition value is 01:30 and the difference between the sum of the second number and the average number is smaller than the second threshold value, the time ranges from 00:00 to 01:30 are set as a first second time range, the starting time of the next second time range is 01:30, and so on until the division of the total time range is completed. Therefore, the number of the data to be queried contained in each second query task can be ensured to be basically consistent.

Referring to fig. 4, an embodiment of the present application further provides a data query apparatus based on multi-thread concurrent processing, including:

an obtaining module 201, configured to obtain a first query request for requesting to query data, where the first query request includes a query condition;

a first determining module 202, configured to determine a total number of pieces of data to be queried that meet the query condition, and a total time range of distribution;

a dividing module 203, configured to divide the total time range into a plurality of first time ranges on average based on the number of concurrently processed threads;

a second determining module 204, configured to determine a first number of pieces of data to be queried actually distributed in each first time range, and an average number of pieces of data to be queried in the first time range;

a determining module 205, configured to determine whether a numerical relationship between the first number and the average number meets a first condition, where the first condition represents that data to be queried is uniformly distributed in a total time range;

a first query module 261, configured to, when a numerical relationship between the first number and the average number meets the first condition, create a plurality of first query tasks based on each of the first time ranges, and concurrently process the plurality of first query tasks through a plurality of threads to obtain a query result;

a second query module 262, configured to, when the numerical relationship between the first number and the average number does not meet the first condition, subdivide the total time range into a plurality of second time ranges that respectively include the average number of pieces of data to be queried, create a plurality of second query tasks based on the second time ranges, and concurrently process the plurality of second query tasks through a plurality of threads to obtain a query result.

In some embodiments, the first determining module 202 is specifically configured to:

sending a second query request to a database based on the query condition;

In some embodiments, the second query module 262 is specifically configured to:

In some embodiments, the determining module 205 is specifically configured to:

In some embodiments, the apparatus further comprises:

Referring to fig. 5, an electronic device is further provided in an embodiment of the present application, and includes at least a memory 301 and a processor 302, where the memory 301 stores a program, and the processor 302 implements the data query method according to any of the above embodiments when executing the program on the memory 301.

The embodiment of the present application further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions in the computer-readable storage medium are executed, the data query method according to any of the above embodiments is implemented.

It will be apparent to one skilled in the art that embodiments of the present application may be provided as methods, electronic devices, computer-readable storage media, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The processor may be a general purpose processor, a digital signal processor, an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof. A general purpose processor may be a microprocessor or any conventional processor or the like.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

The readable storage medium may be a magnetic disk, an optical disk, a DVD, a USB, a Read Only Memory (ROM), a Random Access Memory (RAM), etc., and the specific form of the storage medium is not limited in this application.

The above embodiments are only exemplary embodiments of the present application, and are not intended to limit the present application, and the protection scope of the present application is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present application and such modifications and equivalents should also be considered to be within the scope of the present application.

Claims

1. A data query method based on multi-thread concurrent processing is characterized by comprising the following steps:

2. The method of claim 1, wherein the determining the total number of pieces of data to be queried which meet the query condition and the total time range of distribution comprises:

sending a second query request to a database based on the query condition;

3. The method of claim 2, wherein the repartitioning of the total time range into a plurality of second time ranges respectively containing the average number of pieces of data to be queried comprises:

4. The method of claim 1, wherein said determining whether a numerical relationship between said first number and said average number meets a first condition comprises:

5. The method of claim 1, further comprising:

6. A data query apparatus, comprising:

7. The apparatus of claim 6, wherein the first determining module is specifically configured to:

sending a second query request to a database based on the query condition;

8. The apparatus of claim 7, wherein the second query module is specifically configured to:

9. The apparatus of claim 6, wherein the determining module is specifically configured to:

10. The apparatus of claim 6, further comprising: