CN113626491B

CN113626491B - Data query method, device and distributed data query system

Info

Publication number: CN113626491B
Application number: CN202010388675.3A
Authority: CN
Inventors: 陈国栋; 赵世范; 姜伟浩
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2023-08-04
Anticipated expiration: 2040-05-09
Also published as: CN113626491A

Abstract

The embodiment of the invention provides a data query method, a data query device and a distributed data query system. The method comprises the following steps: according to the first predicate judgment sequence, sequentially inquiring each original data to determine whether the original data are target data hit with a plurality of predicates at the same time, until the number of the inquired original data meets a preset sequence replacement condition or the inquiry of all the original data is completed; when the number of the queried original data meets a preset sequence replacement condition, calculating the current score of each predicate in the predicates according to the data characteristics of the queried original data; generating a second predicate judgment sequence according to the sequence from high to low of the efficiency represented by the current score of each predicate; and taking the second predicate judgment sequence as a new first predicate judgment sequence, and returning to execute the step of sequentially inquiring each piece of original data according to the first predicate judgment sequence. The data query efficiency can be improved.

Description

Data query method, device and distributed data query system

Technical Field

The present invention relates to the field of big data analysis technologies, and in particular, to a data query method, a device, and a distributed data query system.

Background

Often, massive amounts of raw data are faced in data analysis, not all of which are data of interest to the user. The user may input a plurality of conditional expressions (hereinafter, predicates) for representing the filtering conditions, and the electronic device having the data query function, such as a calculation engine, may perform data query according to the predicates input by the user, so as to determine target data satisfying the filtering conditions represented by all predicates at the same time from the original data, and return all the target data, so that the user may perform data analysis according to the returned target data, thereby reducing the calculation amount of the data analysis. For convenience of description, hit predicates refer hereinafter to satisfying the screening conditions represented by predicates.

When data query is performed, the electronic device may query for each original data in turn to determine whether the original data is target data. In the query process, the electronic device can sequentially use each input predicate to judge the original data according to the predicate judgment sequence, if the original data hits the predicate, the next predicate is continuously used for judging the original data, and if the original data does not hit the predicate, the judgment is stopped and the original data is determined not to be target data. If the original data hits all predicates, the original data is determined to be the target data.

Assuming that the input predicate is predicate 1-5, an original data hits predicate 1-4, and misses predicate 5. Then if predicate 5 is located in the first bit in the predicate sequence, it may be determined that the original data is not the target data by one predicate, and if predicate 5 is located in the fifth bit in the predicate sequence, it may be determined that the original data is not the target data by five predicates. It can be seen that the predicate judgment order used in querying will affect the query efficiency, i.e., the predicate judgment order will affect the efficiency of data queries. Therefore, how to reasonably set predicate judgment order to improve data query efficiency becomes a technical problem to be solved.

Disclosure of Invention

The embodiment of the invention aims to provide a data query method, a data query device and a distributed data query system, so as to improve the data query efficiency. The specific technical scheme is as follows:

in a first aspect of an embodiment of the present invention, there is provided a data query method, including:

sequentially inquiring each original data according to a first predicate judgment sequence to determine whether the original data are target data hit by a plurality of predicates at the same time until the number of the inquired original data meets a preset sequence replacement condition or all the original data are inquired, wherein the first predicate judgment sequence is used for indicating the judgment sequence of the predicates in the inquiring process, and the first predicate judgment sequence is a preset initial sequence at the beginning;

When the number of the queried original data meets a preset sequence replacement condition, calculating a current score of each predicate of the plurality of predicates according to the data characteristics of the queried original data, wherein the current score is used for representing the efficiency when the predicate is used for judging whether the current queried original data is target data or not, and the data characteristics comprise the condition that the original data hits the predicate, and/or the time consumption of querying the original data;

generating a second predicate judgment sequence according to the sequence from high to low of the efficiency represented by the current score of each predicate;

and taking the second predicate judgment sequence as a new first predicate judgment sequence, and returning to execute the step of sequentially inquiring each piece of original data according to the first predicate judgment sequence.

In a possible embodiment, the calculating the current score of the predicate according to the data features of the original data already queried includes:

and calculating the current score of the predicate according to the data characteristics of the queried data in the stage, wherein the queried data in the stage is the original data which is queried by using the current first predicate judgment sequence.

In a possible embodiment, the calculating the current score of the predicate according to the data feature of the queried data of the stage includes:

calculating a stage score of the predicate according to the data characteristics of the queried data in the stage, wherein the stage score is used for representing the efficiency when the predicate is used for judging whether the queried data in the stage is target data or not;

a current score of the predicate is calculated based on the stage score of the predicate and a history score of the predicate, the history score initially being a preset initial score and the history score being a score according to which the predicate was last generated in a second predicate-judging order, the efficiency represented by the current score being positively correlated with the efficiency represented by the stage score and positively correlated with the efficiency represented by the history score.

In a possible embodiment, the calculating the stage score of the predicate according to the data feature of the queried data of the stage includes:

according to the data characteristics of the queried data in the stage, calculating the hit rate of the predicate and the time-consuming cost, wherein the hit rate is used for representing the proportion of the queried data in the stage to hit the original data of the predicate, and the time-consuming cost is used for representing the time consumed for judging whether the queried data in the stage hit the predicate;

A stage score of the predicate is calculated based on the hit rate and the time-consuming cost of the predicate, the stage score representing an efficiency inversely related to the hit rate and inversely related to the time-consuming cost.

In one possible embodiment, the hit rate is calculated by:

sampling from the queried data of the stage according to a preset sampling rule to obtain sampling data of the stage;

and counting the duty ratio of the original data hit with the predicate in the sampling data at the stage, and taking the duty ratio as the hit rate of the predicate.

In one possible embodiment, the time-consuming cost is calculated by:

counting and judging whether the sampling data in the stage hits the predicate or not, and consuming accumulated time;

and carrying out mean normalization processing on the accumulated time consumption to obtain the time consumption cost of the predicate.

In a possible embodiment, the calculating the current score of the predicate according to the phase score of the predicate and the history score of the predicate includes:

and carrying out weighted addition on the stage score of the predicate and the history score of the predicate to obtain the current score of the predicate.

In a second aspect of the embodiment of the present invention, there is provided a data query apparatus, the apparatus including:

the query module is used for sequentially querying each original data according to a first predicate judgment sequence to determine whether the original data are target data hit by a plurality of predicates at the same time until the number of the queried original data meets a preset sequence replacement condition or the query of all the original data is completed, wherein the predicate judgment sequence is used for representing the judgment sequence of the predicates in the query process, and the first predicate judgment sequence is a preset initial sequence at the beginning;

a score calculating module, configured to calculate, for each predicate of the plurality of predicates, a current score of the predicate when the number of the original data that has been queried satisfies a preset order replacement condition, the current score being used to represent efficiency when the predicate is used to determine whether the original data that has been queried currently is target data;

the reordering module is used for generating a second predicate judgment sequence according to the sequence from high to low of the efficiency represented by the current scoring of each predicate;

and the query module is further configured to use the second predicate judgment sequence as a new first predicate judgment sequence, and continue to execute the step of sequentially querying each piece of original data according to the first predicate judgment sequence.

In one possible embodiment, the score calculation module calculates a current score of the predicate from data features of the original data that have been queried, including:

calculating the current score of the predicate according to the data characteristics of the queried data of the stage, wherein the queried data of the stage is the original data queried by using the current first predicate judgment sequence

In one possible embodiment, the score calculating module calculates a current score of the predicate according to the data feature of the queried data at the present stage, including:

In one possible embodiment, the score calculating module calculates a stage score of the predicate according to the data feature of the queried data of the stage, including:

In one possible embodiment, the hit rate is calculated by:

In one possible embodiment, the time-consuming cost is calculated by:

In one possible embodiment, the score calculation module calculates a current score of the predicate from the phase score of the predicate and a history score of the predicate, including:

In a third aspect of the embodiment of the present invention, there is provided an electronic device, including:

a memory for storing a computer program;

a processor for implementing the method steps of any of the above first aspects when executing a program stored on a memory.

In a third aspect of the embodiments of the present invention, there is provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the method steps of any of the first aspects described above.

The embodiments of the present invention also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the data query methods described above.

The embodiment of the invention has the beneficial effects that:

according to the data query method, the data query device and the distributed data query system provided by the embodiment of the invention, the predicate judgment sequence can be adaptively adjusted according to the queried original data, so that predicates with higher efficiency are arranged at higher order positions, and because the original data always have certain similarity, the predicates with higher efficiency when judging whether the queried original data are target data can be considered, and the predicates with higher efficiency when judging whether the queried original data are target data can be considered, so that the predicates are arranged at higher order positions, the efficiency when querying the non-queried original data can be improved, and the data query efficiency can be improved. Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a data query method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a current score calculating method according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a stage score calculating method according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of another method for querying data according to an embodiment of the present invention;

FIG. 5a is a schematic diagram of a distributed data query system according to an embodiment of the present invention;

FIG. 5b is a schematic diagram of a pluggable unit enablement flow provided by an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a data query device according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to more clearly describe the data query method provided by the embodiment of the present invention, an exemplary description will be given below of one possible application scenario of the data query method provided by the embodiment of the present invention, and it will be understood that the data query method provided by the embodiment of the present invention may be applied to other possible application scenarios in other possible embodiments, and the following examples do not limit any limitation.

Assuming that 100 tens of thousands of personnel data are stored in the database, a user is in actual demand and needs to acquire specific personnel data therein, for example, the user needs to acquire personnel data in which the following screening conditions are simultaneously satisfied:

screening condition 1: the collection time is earlier than 2018, 11, 15 and 06:00:00;

screening condition 2: the collection time is later than 2018, 11, 15 and 00:00:01;

screening condition 3: the number of the camera for collecting the personnel data is greater than or equal to 9999;

screening condition 4: the number of the camera for collecting the personnel data is less than or equal to 19999;

screening condition 5: the number of the terminal equipment for collecting the personnel data is more than 153;

screening condition 6: the jacket is not worn;

screening condition 7: the sex is male.

The user may input a query instruction with predicates 1-7 into the computation engine of the database, where predicate 1 represents screening condition 1, predicate 2 represents screening condition 2, and so on. It will be appreciated that depending on the programming language used, and the programming specifications, the programming language representation of the various predicates and query instructions may vary, and that exemplary query instructions may be as follows:

where time<2018-11-15 06:00:00and time>2018-11-15 00:00:01 and camera_id>＝9999 and camera_id<＝19999 and device_id>153 and jacket_type！＝1and sex＝male

Wherein, "time <2018-11-15 06:00:00" is predicate 1, "time >2018-11-15 00:00:01" is predicate 2, "camera_id > =9999" is predicate 3, "camera_id < =19999" is predicate 4, "device_id >153" is predicate 5, "socket_type-! =1 "is predicate 6, and" sex=mole "is predicate 7.

After receiving the query instruction, the computing engine determines a predicate judgment sequence according to the statement sequence of each predicate in the query instruction, wherein the higher the predicate in the statement sequence is, the higher the predicate in the predicate judgment sequence is, taking the query instruction as an example, the predicate judgment sequence can be: { predicate 1; predicate 2; predicate 3; predicate 4; predicate 5; predicate 6; predicate 7}. The computing engine may query for each person in the 100 thousands of person data in turn.

Since predicate judgment order is { predicate 1; predicate 2; predicate 3; predicate 4; predicate 5; predicate 6; predicate 7}, therefore, when querying the personnel data, the computing engine firstly uses the predicate 1 to judge the personnel data, if the personnel data does not hit the predicate 1, the personnel data is determined not to be target data, and begins to query the next personnel data, if the personnel data hits the predicate 1, the predicate 2 is continuously used to judge the personnel data, if the personnel data does not hit the predicate 2, the personnel data is determined not to be target data, and begins to query the next personnel data, if the personnel data hits the predicate 2, the predicate 3 is continuously used to judge the personnel data, and so on, and if the personnel data hits the predicate 7, the personnel data can be determined to be target data.

It can be seen that if one person data is not the target data, the higher the order of predicates of the person data miss in the predicate judgment order, the less time it takes to query the person data. Therefore, the predicate judgment order will affect the personnel data query efficiency, and the theoretically optimal predicate judgment order is different according to the personnel data in the database. Because the data in the database may change over time, it is difficult for the user to predict a superior predicate judgment order.

The statement sequence of predicates in the query instruction is often determined empirically by a user, on the one hand, the personnel data in the database may change with time, and the changes have a certain unpredictability, so that the statement sequence determined empirically by the user cannot be well applied to the changed personnel data, resulting in lower data query efficiency. On the other hand, if the staff performs statistical analysis on the staff data in the database before each data query to determine a better statement sequence, the statistical analysis will take a certain time, so that the data query efficiency is lower.

Based on this, the embodiment of the present invention provides a data query method, and referring to fig. 1, fig. 1 is a schematic flow diagram of the data query method provided by the embodiment of the present invention, which may include:

s101, sequentially inquiring each original data according to the first predicate judgment sequence to determine whether the original data are target data hit by a plurality of predicates at the same time, until the number of the inquired original data meets a preset sequence replacement condition or the inquiry of all the original data is completed.

S102, when the number of the queried original data meets a preset sequence replacement condition, calculating the current score of each predicate in a plurality of predicates according to the data characteristics of the queried original data.

S103, generating a second predicate judgment sequence according to the sequence from high to low of the efficiency represented by the current score of each predicate.

S104, taking the second predicate judgment sequence as a new first predicate judgment sequence, and returning to S101.

With this embodiment, the predicate judgment sequence may be adaptively adjusted according to the queried original data, so as to set the predicate with higher efficiency in a higher order, and since the change of the original data often has a certain continuity, the continuity may be spatial continuity, for example, the similarity between the original data collected in adjacent areas may be higher, or may be temporal continuity, for example, the similarity between the original data collected in two adjacent time windows may be higher, so that the original data may be considered to have a certain similarity, so that the predicate with higher efficiency when judging whether the queried original data is the target data may be considered to be the predicate with higher efficiency when judging whether the non-queried original data is the target data, and therefore, setting the predicates in a higher order may improve the efficiency when querying the non-queried original data, that is, may improve the data query efficiency.

In S101, the first predicate judgment order is used to represent the judgment orders of the predicates in the query process, where the first predicate judgment order is a preset initial order at the beginning. The preset initial sequence may be preset by a user according to experience, or may be calculated according to a preset rule, which is not limited in this embodiment.

The queried original data refers to that the query has been performed according to the first predicate judgment sequence, and the related description of how to perform the query according to the predicate judgment sequence can be referred to, which is not repeated herein. The preset sequence replacement condition may be different according to different application scenarios, and in an exemplary embodiment, the preset sequence replacement condition may be an integer multiple of the preset replacement interval, the preset replacement interval may be set according to actual requirements or user experience, for example, the preset replacement interval may be 300 ten thousand, that is, the preset sequence replacement condition is satisfied whenever the number of the original data that has been queried is 300 ten thousand, 600 ten thousand, 900 ten thousand, or the like.

Theoretically, before all the original data are queried, the number of the queried original data should satisfy the preset order replacement condition at least once, and it is understood that if the number of the queried original data cannot satisfy the preset order replacement condition until all the original data are queried, the total number of the original data is considered to be smaller, and when the total number of the original data is smaller, the effect of the data query efficiency on the time taken to complete the data query is smaller, for example, the time taken for the data query is 0.3ms assuming that the data query is performed in a specific predicate judgment order, the efficiency is improved by 200% even by optimizing the predicate judgment order, and the time consumption of 0.2ms can be shortened. At this time, there is no technical problem to be solved by the data query method provided by the embodiment of the present invention, so the description of this case will not be given below.

In S102, the data characteristics include a condition that the original data hits the predicate, and/or a time consuming query of the original data. How the data features affect the predicate efficiency will be described in the following embodiments, and will not be described here.

The current score is used to represent the efficiency when the predicate is used to determine whether the original data that has been currently queried is target data. The calculation of the current score will be described in the following embodiments, and will not be described here.

It may be appreciated that the representation of the current score may be different according to the actual application scenario, for example, the current score may be a numerical value, may be a letter, or may be another character, which is not limited in this embodiment. And, when the current score is a numerical value, the numerical value of the current score may be positively or negatively correlated with the efficiency represented by the current score. For example, in one possible embodiment, the lower the current score, the higher the efficiency that the current score represents, and in another possible embodiment, the higher the current score, the higher the efficiency that the current score represents.

In S103, if one original data is the target data, all predicates need to be used to determine the original data, and if one original data is not the target data, it is not necessary to use predicates located after the predicates that the original data did not hit in the predicate determination order to determine the original data. Therefore, if the order of the predicate with higher efficiency in the predicate judgment order is higher, theoretically, the higher the efficiency of querying the original data is. Therefore, it is considered that the efficiency in querying the original data that has been queried in the second predicate judging order is theoretically higher than the efficiency in querying the original data that has been queried in the second predicate judging order.

In S104, as described above, the analyzed original data often has a certain similarity, and the similarity depends on the continuity of the original data change, and may also depend on other factors, which will be described in the following embodiments, and will not be repeated here. For example, assuming that 50% of the 1 st to 3000000 th pieces of original data are personal data of male people, the ratio of the personal data of male people in the 3000001-6000000 th piece of original data theoretically has a probability of being close to 50%. Therefore, it can be considered that the efficiency when the original data that has been queried and the original data that has not been queried are queried in the same predicate sequence, respectively, is similar. Therefore, as is clear from the analysis in S103, the efficiency of querying the original data that has not been queried in the second judgment order is theoretically higher than the efficiency of querying the original data that has not been queried in the first judgment order. Therefore, the second predicate judgment sequence is used as a new first predicate judgment sequence, and the query is continued, so that the query efficiency can be effectively improved, namely the data query efficiency is improved.

Whether the current score is accurately calculated affects whether each predicate in the second predicate judgment sequence can be accurately ordered according to efficiency. As in the foregoing analysis, if the order of predicates in the predicate-judging order is higher, the higher the efficiency is, the higher the efficiency of querying the original data theoretically is. Therefore, whether the current score can be accurately calculated will directly affect the efficiency of the data query.

Based on this, in one possible embodiment, the current score of the predicate may be calculated from the data characteristics of the queried data at this stage. The queried data in the stage is original data which is queried by using the current first predicate judgment sequence.

If the number of the original data that has been queried satisfies the preset order replacement condition for the first time when the current score is calculated, the current first predicate judgment order refers to the first predicate judgment order at the initial time, that is, the preset initial order. If the number of the original data that has been queried does not satisfy the preset order replacement condition for the first time when the current score is calculated, the current first predicate judgment order is the second predicate judgment order generated last time.

The replacement conditions in the preset sequence are as follows: the number is an integer multiple of the preset replacement interval, and the preset replacement interval is 3000000, if the number of the queried original data reaches 3000000, the preset sequence replacement condition is satisfied for the first time, at this time, the current first predicate judgment sequence is a preset initial sequence, and the queried data in this stage is 1 st to 3000000 original data. When the number of the queried original data reaches 6000000, the preset sequence replacement condition is met for the second time, the current first predicate judgment sequence is a second predicate judgment sequence generated when the number of the queried original data reaches 3000000, the queried data in the stage is 3000001-6000000 th original data, and so on.

It can be understood that the original data have similarity, and since the original data are arranged according to a certain rule, the similarity between the two original data is inversely related to the number of the original data spaced between the two original data. For example, assuming that the original data are performance parameters of an electronic device at different moments, since the performance parameters of the electronic device often change regularly in a certain period of time, the shorter the interval between the acquisition times of the two original data is, the stronger the similarity between the two original data is, whereas in the application scenario, the original data are often arranged according to the acquisition times, so the fewer the number of the original data at the interval between the two original data is, the stronger the similarity between the two original data is.

Thus, it can be considered that the similarity between the queried data of the stage and the original data not yet queried is stronger than other original data which is not queried data of the stage in the queried original data. Therefore, the current score obtained by calculation according to the data characteristics of the queried data in the stage can better reflect the efficiency when the predicate is used for judging whether the original data which is not queried yet is the target data. Therefore, the accuracy of the calculated current score can be effectively improved by adopting the embodiment, so that the data query efficiency is further improved.

In some application scenarios, if the number of the queried original data many times meets the preset order replacement condition, the number of queried data in the stage may be smaller than the number of the queried original data, and the original data has certain randomness, so the queried data in the stage may be limited to the limited number, the data characteristics of the queried data in the stage may be greatly influenced by the randomness of the data, so the calculated current score may not reflect the efficiency of using the predicate on whether the queried original data is the target data or not, and the stability of the data query efficiency is lower.

Based on this, in a possible embodiment, referring to fig. 2, fig. 2 is a schematic flow chart of a current score calculating method according to an embodiment of the present invention, which may include:

s201, the data characteristics of the queried data at the stage are calculated, and the stage score of the predicate is calculated.

The phase score is used for indicating the efficiency of judging whether the queried data in the phase is target data or not by using the predicate.

S202, calculating the current score of the predicate according to the stage score of the predicate and the history score of the predicate.

The historical score is a score according to which the predicate is generated when the second predicate judgment sequence is generated last time, and an initial value of the historical score is a preset score, namely if the number of the original data which is queried does not meet a preset sequence replacement condition for the first time when the current score is calculated, the historical score of the predicate is the current score of the predicate when the second predicate judgment sequence is generated last time. If the number of original data that have been queried satisfies a preset order replacement condition for the first time when the current score is calculated, the score of the predicate is a preset score. And the efficiency represented by the current score is positively correlated with the efficiency represented by the stage score and positively correlated with the efficiency represented by the history score.

In one possible embodiment, the preset score may be used to represent the estimated efficiency when the predicate is used to determine whether the original data is the target data, where the preset score may be preset according to user experience, or may be calculated according to a preset rule, which is not limited in this embodiment. In this embodiment, the preset initial order is generated by ordering the predicates according to an order of high to low prediction efficiency represented by the preset score.

It will be appreciated that the history score may be used to represent the efficiency of using a predicate to determine whether other data of the original data that has been queried than the queried data of the stage is target data, while the stage score is used to represent the efficiency of using the predicate to determine whether the queried data of the stage is target data. As the above analysis, the similarity between the queried data at the present stage and the original data which has not been queried is stronger, and the queried original data often has a larger number than the queried data at the present stage, so the present score is calculated by considering the stage score and the history score at the same time, the calculated present score can reflect the efficiency of judging the original data which has not been queried by using the predicate more accurately, and has stronger resistance to the randomness of the original data.

In one possible embodiment, the current score may be obtained by weighted addition of the stage score and the history score. For example, assuming a stage score of R, a history score of S, and a current score of T, the current score may be calculated according to the following equation:

T＝f*S+(1-f)*R

f is a preset contact factor, and the value range of f is (0, 1), where the value of f may be different according to different practical application scenarios, for example, in a possible embodiment, f=0.35.

The calculation of the stage score will be described below, referring to the description of the stage score in S202, since the history score is the stage score when the second predicate judgment order was last generated, or the preset score, the determination of the history score may be referred to the description of the stage score, and the description of the stage score will be described below, and the determination of the history score will not be repeated here.

Referring to fig. 3, fig. 3 is a schematic flow chart of a stage score calculating method according to an embodiment of the present invention, which may include:

s301, calculating hit rate and time-consuming cost of predicates according to the data characteristics of the queried data in the stage.

The hit rate is used for representing the proportion of the original data hitting the predicate in the queried data at the stage. For example, assuming that the stage has data 3000000 in total, where 1500000 original data hit the predicate, the hit rate of the predicate is 50%.

It will be appreciated that if all the queried data in this stage are counted, the calculation amount is large, so in a possible embodiment, the sampled data in this stage can be obtained by sampling the queried data in this stage according to a sampling rule, the duty ratio of the original data hitting the predicate in the sampled data in this stage is counted, and as the hit rate of the predicate, the calculation amount of the score in the calculation stage can be reduced by using this embodiment.

The time-consuming cost is used to represent the time it takes to determine whether the queried data hits the predicate at this stage. It can be understood that if all the queried data in the stage are counted to determine the time consumption cost, the calculation amount required to be spent is larger, so in a possible embodiment, the sampled data in the stage can be obtained by sampling from the queried data in the stage according to a sampling rule, the accumulated time consumed by judging whether the data in the stage hits the predicate or not is counted, and the average normalization processing is performed on the accumulated time to obtain the time consumption cost of the predicate, and the calculation amount of the score in the calculation stage can be reduced by adopting the embodiment.

S302, calculating the stage score of the predicate according to the hit rate and the time-consuming cost of the predicate.

Wherein the efficiency represented by the phase score is inversely related to hit rate and to time-consuming costs. It will be appreciated that if the hit rate of one predicate is lower, the likelihood of the original data hitting the predicate is indicated to be lower, while the failure of the original data to hit one predicate may determine that the original data is not target data, and therefore the higher the likelihood of the original data not being target data is determined using the predicate, i.e., the higher the efficiency of determining whether the original data is target data using the predicate. And the efficiency is inversely proportional to the time consumption, the more efficient it is to use a predicate to determine if the original data is the target data if the time consumption cost of the predicate is lower.

With this embodiment, the stage score can be calculated in a relatively simple and accurate manner based on the time-consuming cost and hit rate.

On the premise that the efficiency represented by the stage score is inversely related to the hit rate and inversely related to the time-consuming cost, the specific calculation formula of the stage score can be different according to the application. Illustratively, in one possible embodiment, it may be calculated according to the following formula:

wherein, H is hit rate, and C is time-consuming cost.

As in the foregoing analysis, when the total number of original data is small, the effect of the data query efficiency on the time taken to complete the data query is small, i.e., when the total number of original data is small, it can be considered what predicate judgment order is selected, and the time taken to complete the data query is within an acceptable range. And a certain amount of calculation is required to be consumed for calculating the current score and generating the second predicate judgment sequence, so in a possible embodiment, whether the total number of the original data is larger than a preset number threshold value can be judged, if the total number is larger than the preset number threshold value, the data query method provided by the embodiment of the invention can be used for query, and if the total number is not larger than the preset number threshold value, any data query method can be used for query.

For a clearer description of the data query method provided by the embodiment of the present invention, the following description will be made with reference to fig. 4, where fig. 4 is another flow chart of the data query method provided by the embodiment of the present invention, and the flow chart may include:

s401, judging whether the total number of the original data is larger than a preset number threshold value threshold, if the total number is larger than the preset number threshold value, executing S402, and if the total number is not larger than the preset number threshold value, ending the data query.

After finishing the data query, any other data query method may be used to perform the data query, which is not limited in this embodiment. the value of threshold may be different according to the application scenario, for example 9000000.

S402, inquiring the next data, and accumulating the rowsNum of the inquired original data.

The flow of the query may be referred to the foregoing related description, and will not be described herein.

S403, judging whether the rowsNum can be divided by a preset sampling interval sampleInterval, executing S404 if the rowsNum can be divided by sampleInterval, and returning to executing S402 if the rowsNum cannot be divided by sampleInterval.

The value of sampleInterval may be different according to different application scenarios, for example, 3000.

S404, sampling the current original data, and accumulating the sampled count sampleNum of the original data, the hit count hitnumargray of the original data and the time-consuming cost count costar.

Wherein, every sampling one original data, sampleNum counts up by 1. The hitnumaroy is a sequence in which each element corresponds to a predicate, and each time the sampled original data hits a predicate, the value of the element in the sequence corresponding to the predicate is added by 1, and the initial value of each element in the sequence is 0.

The costar is a sequence, each element in the sequence corresponds to a predicate, each time an original data is sampled, the time spent for judging whether the original data hits the predicate is calculated for each predicate, the time is added to the value of the element corresponding to the predicate in the sequence, and the initial value of each element in the sequence is 0.

S405, judging whether the rowsNum can be divided by the preset replacement interval computer interval, executing S406 if the rowsNum can be divided by the computer interval, and returning to executing S402 if the rowsNum cannot be divided by the computer interval.

In this example, the computeInterval is an integer multiple of sampleInterval, so rowsNum can be divided by computeInterval on the premise that rowsNum can be divided by sampleInterval, so S405 is performed after S403, and in other possible embodiments, if computeInterval is not an integer multiple of sampleInterval, S405 may also be performed in parallel with S403, which is not a limitation in this example.

For the computeInterval, reference is made to the previous description about the preset interval, and the description is omitted here. In this example, the value of computeInterval may be 3000000.

S406, calculating a stage score.

For calculation of the stage score, reference may be made to the descriptions of S301 and S302, which are not described herein. In this example, the hit rate of one predicate may be calculated by:

the hit rate of the predicate may be a quotient obtained by dividing the value of the element corresponding to the predicate in the hitnumaray by sampleNum.

The time-consuming cost can be the time-consuming cost of the predicate corresponding to each element after the average normalization of the values of each element of the costar.

S407, calculating the current score.

For calculating the current score according to the phase score, reference may be made to the description of the step S202, which is not repeated here.

S408, adjusting the predicate judgment sequence according to the current score.

That is, a second predicate judgment order is generated based on the current score, and the second predicate judgment order is used as a new first predicate judgment order. For how the second predicate determination sequence is generated, reference may be made to the description of the correlation in S103, which is not repeated here.

S409, judging whether the inquiry of all the original data is completed, if so, ending the data inquiry, and if not, returning to execute S402.

After the data query is completed, all target data may be returned.

Because the computation amount of the data query is often high, a distributed data query system is often adopted in the related art, wherein the distributed data query system comprises a master node and a plurality of slave nodes. For example, the Apache Spark (a distributed data query system) includes a Driver (master node) and a plurality of executors (slave nodes), where the Driver is a master entry or a result convergence port of an application program, and is responsible for distributing original data and scheduling and monitoring data query tasks. The Executor is used for executing the data query task scheduled by the Driver.

In one possible embodiment, when the master node receives a data query task, the data query task may be divided into a plurality of data query sub-tasks, and each data query sub-task is scheduled to each slave node, so that each slave node executes the scheduled data query sub-task, thereby completing the data query task received by the master node.

When the slave node executes the data query subtask, the slave node may receive a first predicate judgment sequence issued by the master node to the slave node, query the original data aimed by the data query subtask according to the first predicate judgment sequence, and feed back the partially queried original data to the master node, so that the master node calculates the current score of each predicate based on the queried original data fed back by the slave node, thereby generating a second predicate judgment sequence, and issue the second predicate judgment sequence to the slave node, so that the slave node uses the second predicate judgment sequence as a new first predicate judgment sequence to continuously query the original data. In this embodiment, however, the need to feed back the original data that has been queried from the slave node to the master node will result in a waste of bandwidth between the master node and the slave node. On the other hand, the original data for each slave node to perform data query may be differentiated, and the master node globally adjusts the predicate judgment sequence, so that only the data query efficiency of a part of slave nodes can be improved by using the second predicate judgment sequence, and the data query efficiency of another part of slave nodes is reduced.

Based on this, an embodiment of the present invention provides a distributed data query system, which may include a master node 501 and a plurality of slave nodes 502 as shown in fig. 5 a. Wherein the master node 501 is configured to allocate raw data to each slave node 502. The slave node 502 is configured to perform data query on the allocated original data according to any data query method provided by the embodiment of the present invention, so as to return the target data to the master node 501.

By adopting the embodiment, each slave node can independently adjust the predicate judgment sequence without feeding back the queried original data to the master node, so that the bandwidth pressure between the master node and the slave node can be effectively reduced. And each slave node can pertinently adjust the predicate judgment sequence according to the original data distributed by the slave node, so that the adaptability of the predicate judgment sequence after adjustment to the original data distributed by the slave node is improved, the data query efficiency of each slave node can be fully improved, and the efficiency of the data query system is improved as a whole.

In one possible embodiment, the slave node 502 may include a pluggable plug-in. The slave node 502 may be specifically configured to load an execution environment set for a pluggable unit, where any of the data query methods provided by the embodiments of the present invention are executed on the allocated raw data through the pluggable unit.

By adopting the embodiment, the logic of the predicate judgment sequence adjustment of the slave node can be independently maintained, upgraded and expanded, the interference to other business logic of the slave node is avoided, and the running stability of the slave node and the flexibility of data query are improved.

The business logic for enabling pluggable plugins from node 502 may include, as shown in fig. 5 b:

s510, loading a running environment set for the pluggable plugin.

The configuration file set for the pluggable unit may be read to load the configuration parameters configured in the configuration file, so as to realize loading the running environment set for the pluggable unit.

S520, judging whether the running environment is loaded successfully, if so, executing S530, and if not, executing S550.

S530, judging whether an enabling switch arranged for the pluggable unit is turned on, if the enabling switch is turned on, executing S540, and if the enabling switch is not turned on, ending the enabling of the pluggable unit.

S540, enabling the pluggable plugin to realize any data query method provided by the embodiment of the invention through the pluggable plugin.

S550, outputting the abnormality information.

It will be appreciated that if loading fails, it may be considered that there may be a setting error for the execution environment set by the pluggable unit, and thus, exception information may be output to alert relevant personnel to perform maintenance.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a data query device according to an embodiment of the present invention, which may include:

the query module 601 is configured to query each piece of original data sequentially according to a first predicate judgment sequence, so as to determine whether the original data is target data hitting a plurality of predicates at the same time, until the number of queried original data meets a preset sequence replacement condition or the query of all original data is completed, where the predicate judgment sequence is used to represent a judgment sequence of the plurality of predicates in the query process, and the first predicate judgment sequence is a preset initial sequence at the beginning;

a score calculating module 602, configured to calculate, for each predicate of the plurality of predicates, a current score of the predicate when the number of the original data that have been queried satisfies a preset order replacement condition, according to a data characteristic of the original data that have been queried, the current score being used to represent an efficiency when the predicate is used to determine whether the original data that have been queried currently is target data, the data characteristic including a condition that the original data hits the predicate, and/or a time-consuming time for querying the original data;

a reordering module 603, configured to generate a second predicate judgment sequence according to a sequence from high to low of the efficiency represented by the current scoring of each predicate;

The query module 601 is further configured to use the second predicate judgment sequence as a new first predicate judgment sequence, and continue to execute the step of sequentially querying each piece of original data according to the first predicate judgment sequence.

In one possible embodiment, the score computation module 602 computes a current score of the predicate from the data characteristics of the original data that has been queried, including:

In one possible embodiment, the score calculation module 602 calculates a current score of the predicate according to the data features of the queried data at the present stage, including:

In one possible embodiment, the score calculating module 602 calculates a stage score of the predicate according to the data feature of the queried data at the stage, including:

In one possible embodiment, the hit rate is calculated by:

In one possible embodiment, the time-consuming cost is calculated by:

In one possible embodiment, the score calculation module 602 calculates a current score of the predicate from the phase score of the predicate and a history score of the predicate, including:

The embodiment of the invention also provides an electronic device, as shown in fig. 7, including:

a memory 701 for storing a computer program;

the processor 702 is configured to execute the program stored in the memory 701, and implement the following steps:

In one possible embodiment, the hit rate is calculated by:

In one possible embodiment, the time-consuming cost is calculated by:

The Memory mentioned in the electronic device may include a random access Memory (Random Access Memory, RAM) or may include a Non-Volatile Memory (NVM), such as at least one magnetic disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer readable storage medium is provided, in which a computer program is stored, which when executed by a processor implements the steps of any of the data querying methods described above.

In yet another embodiment of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the data query methods of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a distributed data query system, apparatus, electronic device, computer readable storage medium, computer program product, the description is relatively simple as it is substantially similar to the method embodiments, as relevant see also part of the description of the method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method of querying data, the method comprising:

the second predicate judgment sequence is used as a new first predicate judgment sequence, and the step of sequentially inquiring each piece of original data according to the first predicate judgment sequence is returned to be executed;

the calculating the current score of the predicate according to the data characteristics of the queried original data comprises the following steps:

according to the data characteristics of the queried data in the stage, calculating the current score of the predicate, wherein the queried data in the stage is the original data which is queried by using the current first predicate judgment sequence;

the step of calculating the current score of the predicate according to the data characteristics of the queried data in the stage comprises the following steps:

and calculating the current score of the predicate according to the stage score of the predicate and the history score of the predicate, wherein the history score is a preset initial score initially, and the history score is a score according to the predicate when the second predicate judgment sequence is generated last time, and the efficiency represented by the current score is positively related to the efficiency represented by the stage score and is positively related to the efficiency represented by the history score.

2. The method of claim 1, wherein calculating the predicate's stage score based on the data characteristics of the queried data at the present stage comprises:

3. The method of claim 2, wherein the hit rate is calculated by:

4. The method of claim 2, wherein the time-consuming cost is calculated by:

5. The method of claim 1, wherein the calculating the current score for the predicate based on the phase score for the predicate and the history score for the predicate comprises:

6. A data querying device, the device comprising:

the query module is used for sequentially querying each original data according to a first predicate judgment sequence to determine whether the original data are target data hit by a plurality of predicates at the same time until the number of the queried original data meets a preset sequence replacement condition or the query of all the original data is completed, wherein the first predicate judgment sequence is used for indicating the judgment sequence of the predicates in the query process, and the first predicate judgment sequence is a preset initial sequence at the beginning;

A score calculating module, configured to calculate, for each predicate of the plurality of predicates, a current score of the predicate according to a data feature of the original data that has been queried when the number of the original data that has been queried satisfies a preset order replacement condition, the current score being used to represent efficiency when the predicate is used to determine whether the original data that has been queried currently is target data;

the query module is further configured to use the second predicate judgment sequence as a new first predicate judgment sequence, and continue to execute the step of sequentially querying each piece of original data according to the first predicate judgment sequence;

the score calculating module is specifically configured to calculate a current score of the predicate according to data features of queried data in the current stage, where the queried data in the current stage is original data queried by using a current first predicate judgment sequence;

the score calculation module is specifically configured to calculate a stage score of the predicate according to the data feature of the queried data in the stage, where the stage score is used to represent efficiency when the predicate is used to determine whether the queried data in the stage is target data;

7. A distributed data query system, comprising a master node and a plurality of slave nodes;

the master node is used for distributing original data for each slave node;

the slave node is configured to perform a data query on the allocated original data according to the data query method of any one of claims 1 to 5.

8. The distributed data query system of claim 7, wherein the slave node comprises a pluggable plug-in;

the slave node is specifically configured to load an operating environment set for the pluggable unit; in the operating environment, the data query method according to any one of claims 1 to 5 is performed on the allocated raw data by the pluggable unit.

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for carrying out the method steps of any one of claims 1-5 when executing a program stored on a memory.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-5.