CN112965991A

CN112965991A - Pre-calculation result generation method and device, electronic equipment and storage medium

Info

Publication number: CN112965991A
Application number: CN202110252174.7A
Authority: CN
Inventors: 梁乐平
Original assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Current assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Priority date: 2021-03-08
Filing date: 2021-03-08
Publication date: 2021-06-15
Anticipated expiration: 2041-03-08
Also published as: CN112965991B

Abstract

The embodiment of the invention discloses a method and a device for generating a pre-calculation result, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a characteristic vector set, wherein the characteristic vector set is determined based on dimension combination inquired by a query statement in preset time; inputting the feature vector set into a query heat model to obtain scores of all feature vectors in the feature vector set, wherein the query heat model is obtained by pre-training based on static features of a data table and historical query sentences; determining a feature vector score expected value according to the score of each feature vector; and screening target characteristic vectors meeting the query expectation from the characteristic vector set based on the characteristic vector score expectation, and performing pre-calculation based on the target characteristic vectors. According to the embodiment of the invention, the calculation efficiency of the pre-calculation result is high, the number of the pre-calculation results is less, less storage space is occupied, and the storage cost is reduced.

Description

Pre-calculation result generation method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of big data multidimensional analysis, in particular to a method and a device for generating a precomputation result, electronic equipment and a storage medium.

Background

As the amount of data becomes larger, big data ad hoc multidimensional analysis becomes more valuable. And the conventional distributed computing framework has the timeliness which is difficult to meet the requirement of the ad hoc multi-dimensional analysis. And based on the computing power and the storage power of the big data platform, the measurement value indexes are pre-computed by changing time in space, although the requirement of the big data on-site multidimensional query can be met. However, the method stores all the conditions, the dimension combinations, the metric calculation results and the calculation process results of all the queries, so that the data capacity of the pre-calculation results is large, and the data capacity overhead of the pre-calculation results is larger and larger after the pre-calculation results are accumulated day by day. In addition, some pre-computed results are not or rarely queried by the user, but the pre-computed results occupy a large amount of storage space, resulting in waste of storage space.

Disclosure of Invention

Based on the problems in the prior art, embodiments of the present invention provide a method and an apparatus for generating a pre-calculation result, an electronic device, and a storage medium.

In a first aspect, an embodiment of the present invention provides a method for generating a pre-calculation result, including:

acquiring a characteristic vector set, wherein the characteristic vector set is determined based on dimension combination queried by a query statement in a preset time;

inputting the feature vector set into a query heat model to obtain scores of all feature vectors in the feature vector set, wherein the query heat model is obtained by pre-training based on static features of a data table and historical query sentences;

determining a feature vector score expected value according to the score of each feature vector;

and screening target feature vectors meeting the query expectation from the feature vector set based on the feature vector score expectation, and performing pre-calculation based on the target feature vectors.

Further, the obtaining a feature vector set includes:

acquiring the query statement in the preset time;

acquiring a dimension combination of the query in the query statement;

converting the dimension combination inquired in the inquiry statement into a feature vector, and obtaining the feature vector set according to the feature vector converted by the dimension combination inquired in the inquiry statement.

Further, the determining a feature vector score expectation value according to the score of each feature vector includes:

converting the scores of the characteristic vectors into a histogram, wherein the histogram represents the probability distribution of the characteristic vectors in the preset time;

and obtaining the feature vector score expectation value of the feature vector set in the preset time based on the probability distribution of the feature vectors in the preset time.

Further, the screening out a target feature vector satisfying a query expectation from the feature vector set based on the feature vector score expectation includes:

judging whether the score of each feature vector is smaller than the expected score value of the feature vector;

and deleting the feature vectors with the scores smaller than the expected score values of the feature vectors from the feature vector set, and taking the feature vectors reserved in the feature vector set as target feature vectors meeting the query expectation.

Further, the pre-computing based on the target feature vector comprises:

converting the target feature vector into a target feature combination;

and performing pre-calculation based on the target feature combination to obtain and store a pre-calculation result.

Further, before the step of inputting the feature vector set into a query heat model and obtaining the score of each feature vector in the feature vector set, the method further includes:

and acquiring static characteristics and historical query sentences of a data table, and training the query heat model based on the static characteristics and the historical query sentences of the data table.

Further, the obtaining static features and historical query statements of the data table, and training the query heat model based on the static features and the historical query statements of the data table includes:

acquiring static characteristics of a data table, wherein the static characteristics of the data table comprise table dimension, a dimension base number, a table row number and table capacity, and obtaining an associated weight parameter of the table dimension based on the dimension base number, the table row number and the table capacity;

obtaining historical query sentences, obtaining historical dimension combinations queried in the historical query sentences, converting the historical dimension combinations into historical feature vectors, and obtaining the importance probabilities of the historical feature vectors based on the query frequency of the historical feature vectors in the historical query sentences and the number of the historical query sentences;

and training the query heat model based on the associated weight parameters of the table dimensions and the importance probabilities of the historical feature vectors.

In a second aspect, an embodiment of the present invention provides a pre-calculation result generation apparatus, including:

the acquisition module is used for acquiring a characteristic vector set, and the characteristic vector set is determined based on dimension combination inquired by a query statement in preset time;

the score determining module is used for inputting the feature vector set into a query heat model to obtain the score of each feature vector in the feature vector set, wherein the query heat model is obtained by pre-training based on static features of a data table and historical query sentences;

the expected value determining module is used for determining the expected value of the score of the feature vector according to the score of each feature vector;

and the pre-calculation module is used for screening target characteristic vectors meeting the query expectation from the characteristic vector set based on the characteristic vector score expectation values, and performing pre-calculation based on the target characteristic vectors.

In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the pre-calculation result generation method according to the first aspect.

In a fourth aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the pre-computation result generation method according to the first aspect.

According to the technical scheme, the pre-calculation result generation method, the pre-calculation result generation device, the electronic equipment and the storage medium predict the queried probability of each dimension combination according to the behavior of the user for querying data and the heat degree of each dimension combination, and further ignore the dimension combination which is not queried by the user or queried by the user with small probability. Therefore, the calculation efficiency of the pre-calculation result is improved, the number of the calculation results is small, the storage space is occupied less, the query requirement of a user is met, and the storage cost is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a method for generating pre-computed results according to an embodiment of the invention;

FIG. 2 is a block diagram of a pre-calculation result generation apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The following describes a pre-calculation result generation method, apparatus, electronic device, and storage medium according to embodiments of the present invention with reference to the accompanying drawings.

Fig. 1 is a flowchart illustrating a method for generating a pre-computed result according to an embodiment of the present invention. As shown in fig. 1, the method for generating a pre-calculation result provided in the embodiment of the present invention specifically includes the following steps:

s101: and acquiring a characteristic vector set, wherein the characteristic vector set is determined based on the dimension combination queried by the query statement in a preset time.

The predetermined time is, for example, the last 10 days, the last 15 days, the last 20 days, the last month, or the like, and it is needless to say that the predetermined time may be set to another value as necessary. The Query statement is typically an SQL (Structured Query Language) Query statement.

In large data multidimensional analysis, large data ad hoc multidimensional analysis is generally required, and therefore, the dimensional combinations of queries are generally different. That is, the data items that need to be queried each time are typically different. Therefore, the query statement in the predetermined time usually includes the dimension combinations of the plurality of dimension data to be queried, and after obtaining the dimension combinations, the conversion can be performed to obtain the feature vectors corresponding to the dimension combinations, and the feature vectors constitute the feature vector set in the real example.

A dimension refers to a column of a data table, each column representing a dimension. For example, in a data table, there are provinces, companies, channels, paid users, and subscription types in columns, and thus, province is a dimension, company is a dimension, channel is a dimension, paid user is a dimension, subscription type is a dimension, and so on, that is, each column in the data table represents a dimension.

In one embodiment of the present invention, obtaining a set of feature vectors comprises: acquiring the query statement in the preset time; acquiring a dimension combination of the query in the query statement; converting the dimension combination inquired in the inquiry statement into a feature vector, and obtaining the feature vector set according to the feature vector converted by the dimension combination inquired in the inquiry statement. That is, the SQL query statement input by the user within the predetermined time is analyzed, the dimension combination in the SQL query statement is extracted and converted into the feature vector, and the feature vector set is obtained.

In this example, it is assumed that the data table has N dimensions, and for the data table with N dimensions, the dimensions are combined into C (N, N) kinds. For each dimension combination, a feature signature hash function H may be used, and a feature vector V ═ H (Dim1, Dim2, Dim3 … DimN) for each dimension combination is calculated. Further, the set of eigenvectors may be converted from all combinations of dimensions of the data table. For example, assume that the dimensional combinations are: company, province, channel, null, null, the feature vector after the dimension combination conversion is 11100. And the other dimension combines: company, province, null, paid user, subscription type converted feature vector 11011. Where mull is null and the value in the corresponding transformed feature vector is 0.

S102: and inputting the feature vector set into a query heat model to obtain the score of each feature vector in the feature vector set, wherein the query heat model is obtained by pre-training based on static features of a data table and historical query sentences. The score of each feature vector is output by the query heat model according to the input feature vector set.

In an embodiment of the present invention, before the feature vector set is input into the query heat model to obtain the score of each feature vector in the feature vector set, the query heat model needs to be trained. Specifically, static features and historical query statements of a data table are obtained, and the query heat model is trained on the basis of the static features and the historical query statements of the data table.

In this example, the obtaining static features and historical query statements of the data table, and training the query heat model based on the static features and the historical query statements of the data table includes: acquiring static characteristics of a data table, wherein the static characteristics of the data table comprise table dimension, a dimension base number, a table row number and table capacity, and obtaining an associated weight parameter of the table dimension based on the dimension base number, the table row number and the table capacity; obtaining historical query sentences, obtaining historical dimension combinations queried in the historical query sentences, converting the historical dimension combinations into historical feature vectors, and obtaining the importance probabilities of the historical feature vectors based on the query frequency of the historical feature vectors in the historical query sentences and the number of the historical query sentences; and training the query heat model based on the associated weight parameters of the table dimensions and the importance probabilities of the historical feature vectors.

As a specific example, distributed computation is performed on the data table, and static characteristics of the data table, such as dimensions, a dimension base number, a table row number, and a table capacity, are obtained and stored.

Wherein a table dimension, i.e. a column of a table, such as province (provision), is a dimension;

the dimension base number indicates: all values of the dimensionality, for example, the province of China is 32 provinces, namely the dimensionality base number of the province is 32;

the table row number indicates all rows of the table;

the table capacity refers to a space occupied by a data amount of the data table, such as a data amount of 10T.

It is understood that the dimension and the dimension cardinality are important factors affecting the pre-computed result, such as the province dimension, and if a user may query 32 provinces, at least 32 results are pre-computed. The larger the dimension cardinality, the more desirable it is to avoid pre-computed dimensions.

And obtaining an associated weight parameter of the dimension based on the collected static features of the data table, for example, the associated weight parameter delta of the dimension is dimension base/(table row number) table capacity.

According to the historical query data (namely, historical query sentences) of the user, calculating the frequency x/the total frequency N of the importance probability p ═ f (x) ═ V corresponding to the feature vector of each historical query sentence. That is, the importance probability p ═ f (x) for each feature vector is calculated from the user historical query data (query frequency x of V)/(total historical query frequency N).

Use is made of, for example: the logistic stet regression algorithm, the gradient descent method, learns the training query heat model K (V), K (V) can give the score that the feature vector V will be queried. In a specific example, the score is between 0 and 1.

For example, assuming that the feature value of a certain SQL1 is V1, and K (V1) is 0.9, so far, the query heat model K (V) can give an expected score of the queried feature vector for each feature vector.

S103: and determining a feature vector score expectation value according to the score of each feature vector.

In a specific example, obtaining a feature vector score expectation value according to the score of each feature vector includes: converting the scores of the characteristic vectors into a histogram, wherein the histogram represents the probability distribution of the characteristic vectors in the preset time; and obtaining the feature vector score expectation value of the feature vector set in the preset time based on the probability distribution of the feature vectors in the preset time.

As a specific example, the predetermined time is denoted as Tn, then within the predetermined time Tn, the feature vector set corresponding to the feature vector in the query statement of the user query is Sn [ V ], and the score of each feature vector is calculated using the query heat model k (V). And converting the scores into a histogram Hn, wherein Hn represents the probability distribution of each feature vector within a predetermined time Tn, and the mathematical expectation for calculating the set of feature vectors Sn [ V ] within the predetermined time Tn is Fn, namely: the feature vector score expectation value Fn.

S104: and screening target characteristic vectors meeting the query expectation from the characteristic vector set based on the characteristic vector score expectation, and performing pre-calculation based on the target characteristic vectors.

In this example, the screening out the target feature vector from the feature vector set that satisfies the query expectation based on the feature vector score expectation includes: judging whether the score of each feature vector is smaller than the expected score value of the feature vector; and deleting the feature vectors with the scores smaller than the expected score values of the feature vectors from the feature vector set, and taking the feature vectors reserved in the feature vector set as target feature vectors meeting the query expectation.

In this example, the pre-calculating based on the target feature vector includes: converting the target feature vector into a target feature combination; and performing pre-calculation based on the target feature combination to obtain and store a pre-calculation result.

Specifically, Fn +1 is taken to be Fn, Fn +1 indicates that in a future time period, the feature vector score expectation value is obtained, and in the feature vector set S [ V ], the feature vectors of k (V) < Fn +1 are discarded. So far, the feature vector set S [ V ] retains the dimension combination Sn +1[ V ] which meets the user query expectation. Sn +1[ V ] represents a feature vector expected to be queried by a user in a future time period.

And calculating a pre-calculation result according to the feature vector Sn +1[ V ] to be inquired by the user. Furthermore, the dimension combination of user non-query or small-probability query is ignored, the pre-calculation result is stored, the instant query requirement of the user is greatly met, the storage space of the pre-calculation result is optimized, and the number of the pre-calculation result is reduced.

According to the pre-calculation result generation method provided by the embodiment of the invention, the queried probability of each dimension combination is predicted according to the behavior of the data queried by the user and the heat degree of each dimension combination, and further, the dimension combination which is not queried by the user or queried by the user with small probability is ignored. Therefore, the calculation efficiency of the pre-calculation result is improved, the number of the calculation results is small, the storage space is occupied less, the query requirement of a user is met, and the storage cost is reduced.

Fig. 2 is a block diagram of a pre-calculation result generation apparatus according to an embodiment of the present invention. As shown in fig. 2, the pre-calculation result generation apparatus according to an embodiment of the present invention includes: an acquisition module 210, a score determination module 220, an expected value determination module 230, and a pre-calculation module 240, wherein:

an obtaining module 210, configured to obtain a feature vector set, where the feature vector set is determined based on a combination of dimensions queried by a query statement in a predetermined time;

a score determining module 220, configured to input the feature vector set into a query heat model, so as to obtain a score of each feature vector in the feature vector set, where the query heat model is obtained by pre-training based on static features of a data table and a historical query statement;

an expected value determining module 230, configured to determine a feature vector score expected value according to the score of each feature vector;

and the pre-calculation module 240 is configured to screen a target feature vector meeting the query expectation from the feature vector set based on the feature vector score expectation, and perform pre-calculation based on the target feature vector.

In an embodiment of the present invention, the obtaining module 210 is specifically configured to:

acquiring the query statement in the preset time;

acquiring a dimension combination of the query in the query statement;

In an embodiment of the present invention, the expected value determining module 230 is specifically configured to:

In an embodiment of the present invention, the pre-calculation module 240 is specifically configured to:

converting the target feature vector into a target feature combination;

In one embodiment of the present invention, further comprising: a training module (not shown in fig. 2) specifically configured to:

The obtaining static characteristics and historical query sentences of the data table, and training the query heat model based on the static characteristics and the historical query sentences of the data table include:

According to the pre-calculation result generation device provided by the embodiment of the invention, the queried probability of each dimension combination is predicted according to the behavior of the data queried by the user and the heat degree of each dimension combination, and further, the dimension combination which is not queried by the user or queried by the user with small probability is ignored. Therefore, the calculation efficiency of the pre-calculation result is improved, the number of the calculation results is small, the storage space is occupied less, the query requirement of a user is met, and the storage cost is reduced.

It should be noted that a specific implementation manner of the pre-calculation result generation apparatus in the embodiment of the present invention is similar to a specific implementation manner of the pre-calculation result generation method in the embodiment of the present invention, and please refer to the description of the method part specifically, and details are not described here specifically in order to reduce redundancy.

Based on the same inventive concept, another embodiment of the present invention provides an electronic device, which specifically includes the following components, with reference to fig. 3: a processor 301, a memory 302, a communication interface 303, and a communication bus 304;

the processor 301, the memory 302 and the communication interface 303 complete mutual communication through the communication bus 304; the communication interface 303 is used for realizing information transmission between the devices;

the processor 301 is configured to call a computer program in the memory 302, and when the processor executes the computer program, the processor implements all the steps of the above-mentioned pre-calculation result generation apparatus method, for example, when the processor executes the computer program, the processor implements the following steps: acquiring a characteristic vector set, wherein the characteristic vector set is determined based on dimension combination queried by a query statement in a preset time; inputting the feature vector set into a query heat model to obtain scores of all feature vectors in the feature vector set, wherein the query heat model is obtained by pre-training based on static features of a data table and historical query sentences; determining a feature vector score expected value according to the score of each feature vector; and screening target feature vectors meeting the query expectation from the feature vector set based on the feature vector score expectation, and performing pre-calculation based on the target feature vectors.

Based on the same inventive concept, a further embodiment of the present invention provides a non-transitory computer-readable storage medium, having a computer program stored thereon, which, when executed by a processor, implements all the steps of the above-mentioned pre-calculation result generation apparatus method, for example, when the processor executes the computer program, the processor implements the following steps: acquiring a characteristic vector set, wherein the characteristic vector set is determined based on dimension combination queried by a query statement in a preset time; inputting the feature vector set into a query heat model to obtain scores of all feature vectors in the feature vector set, wherein the query heat model is obtained by pre-training based on static features of a data table and historical query sentences; determining a feature vector score expected value according to the score of each feature vector; and screening target feature vectors meeting the query expectation from the feature vector set based on the feature vector score expectation, and performing pre-calculation based on the target feature vectors.

In addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions may be essentially or partially implemented in the form of software products, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the index monitoring method according to the embodiments or some parts of the embodiments.

In addition, in the present invention, terms such as "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Moreover, in the present invention, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Furthermore, in the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for generating a pre-computed result, comprising:

2. The method of generating precomputed results according to claim 1, wherein said obtaining a set of feature vectors comprises:

acquiring the query statement in the preset time;

acquiring a dimension combination of the query in the query statement;

3. The method of generating pre-computed results according to claim 1, wherein determining an expected feature vector score value according to the score of each feature vector comprises:

4. The method of generating pre-computed results according to claim 1, wherein the screening target feature vectors from the feature vector set that satisfy query expectations based on the feature vector score expectation values comprises:

5. The method of generating precomputed results according to claim 1, wherein said precomputing based on said target feature vector comprises:

converting the target feature vector into a target feature combination;

6. The method of any one of claims 1 to 5, wherein before inputting the set of feature vectors into a query heat model and obtaining scores of feature vectors in the set of feature vectors, the method further comprises:

7. The method of claim 6, wherein the obtaining the static features of the data table and the historical query statement and training the query heat model based on the static features of the data table and the historical query statement comprises:

8. A pre-calculation result generation apparatus, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the pre-calculation result generation method according to any one of claims 1 to 7 when executing the computer program.

10. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the pre-computation result generation method according to any one of claims 1 to 7.