CN112965991B

CN112965991B - Pre-calculation result generation method and device, electronic equipment and storage medium

Info

Publication number: CN112965991B
Application number: CN202110252174.7A
Authority: CN
Inventors: 梁乐平
Original assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Current assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Priority date: 2021-03-08
Filing date: 2021-03-08
Publication date: 2023-12-08
Anticipated expiration: 2041-03-08
Also published as: CN112965991A

Abstract

The embodiment of the invention discloses a pre-calculation result generation method, a pre-calculation result generation device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a feature vector set, wherein the feature vector set is determined based on a dimension combination queried by a query statement in a preset time; inputting the feature vector set into a query hotness model to obtain the score of each feature vector in the feature vector set, wherein the query hotness model is obtained by training in advance based on the static features of the data table and historical query sentences; determining a feature vector score expected value according to the score of each feature vector; and screening target feature vectors meeting query expectations from the feature vector set based on the feature vector score expected values, and pre-calculating based on the target feature vectors. According to the embodiment of the invention, the calculation efficiency of the pre-calculation result is high, and the number of the pre-calculation result is smaller than that of the calculation result, so that the pre-calculation result occupies less storage space, and the storage cost is reduced.

Description

Pre-calculation result generation method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of big data multidimensional analysis, in particular to a pre-calculation result generation method, a device, electronic equipment and a storage medium.

Background

As the amount of data increases, large data impromptu multidimensional analysis becomes increasingly valuable. Whereas in a conventional distributed computing framework, timeliness is difficult to meet the impromptu multidimensional analysis. Based on the computing capacity and the storage capacity of the big data platform, the method of pre-calculating the metric index by using space time exchange can meet the demand of the big data on-spot multidimensional query. However, the conditions, dimension combinations, metric calculation results and calculation process results of all the queries are all stored, so that the data capacity of the pre-calculation results is large, and after the accumulation of days, the data capacity overhead of the pre-calculation results is larger. In addition, some pre-computed results are not or rarely queried by the user, but the pre-computed results occupy a large amount of storage space, resulting in a waste of storage space.

Disclosure of Invention

Based on the problems existing in the prior art, the embodiment of the invention provides a pre-calculation result generation method, a pre-calculation result generation device, electronic equipment and a storage medium.

In a first aspect, an embodiment of the present invention provides a method for generating a pre-calculation result, including:

acquiring a feature vector set, wherein the feature vector set is determined based on a dimension combination queried by a query statement in a preset time;

inputting the feature vector set into a query hotness model to obtain the score of each feature vector in the feature vector set, wherein the query hotness model is obtained by training in advance based on static features of a data table and historical query sentences;

determining a feature vector score expected value according to the score of each feature vector;

and screening target feature vectors meeting query expectations from the feature vector set based on the feature vector score expected values, and pre-calculating based on the target feature vectors.

Further, the acquiring the feature vector set includes:

acquiring a query statement within the preset time;

acquiring dimension combinations of queries in the query statement;

and converting the dimension combination of the query in the query statement into a feature vector, and obtaining the feature vector set according to the feature vector converted by the dimension combination of the query in the query statement.

Further, the determining the expected value of the feature vector score according to the score of each feature vector includes:

converting the score of each feature vector into a histogram, wherein the histogram indicator represents probability distribution of each feature vector in the preset time;

and obtaining the expected value of the feature vector score of the feature vector set in the preset time based on the probability distribution of each feature vector in the preset time.

Further, the screening the target feature vector meeting the query expectation from the feature vector set based on the feature vector score expected value includes:

judging whether the score of each feature vector is smaller than the expected score value of the feature vector;

deleting the feature vectors with scores smaller than expected values of the feature vector scores from the feature vector set, and taking the feature vectors reserved in the feature vector set as target feature vectors meeting query expectations.

Further, the pre-computing based on the target feature vector includes:

converting the target feature vector into a target feature combination;

and pre-calculating based on the target feature combination to obtain and store a pre-calculation result.

Further, before the feature vector set is input into the query hotness model to obtain the score of each feature vector in the feature vector set, the method further includes:

and acquiring static characteristics of the data table and historical query sentences, and training the query hotness model based on the static characteristics of the data table and the historical query sentences.

Further, the obtaining the static characteristics of the data table and the historical query statement, and training the query hotness model based on the static characteristics of the data table and the historical query statement, includes:

acquiring a data table static feature, wherein the data table static feature comprises a table dimension, a dimension base, a table number and a table capacity, and acquiring an association weight parameter of the table dimension based on the dimension base, the table number and the table capacity;

acquiring a history query statement, acquiring a history dimension combination of query in the history query statement, converting the history dimension combination into a history feature vector, and acquiring an important probability of the history feature vector based on the query frequency of the history feature vector in the history query statement and the number of the history query statement;

and training the query hotness model based on the associated weight parameters of the table dimension and the important probability of the historical feature vector.

In a second aspect, an embodiment of the present invention provides a pre-calculation result generating apparatus, including:

the acquisition module is used for acquiring a feature vector set, wherein the feature vector set is determined based on a dimension combination queried by a query statement in a preset time;

the score determining module is used for inputting the feature vector set into a query hotness model to obtain the score of each feature vector in the feature vector set, wherein the query hotness model is obtained by training in advance based on static features of a data table and historical query sentences;

the expected value determining module is used for determining an expected value of the feature vector score according to the score of each feature vector;

and the pre-calculation module is used for screening target feature vectors meeting query expectations from the feature vector set based on the feature vector score expected values and carrying out pre-calculation based on the target feature vectors.

In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the pre-calculation result generating method according to the first aspect when executing the computer program.

In a fourth aspect, embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the pre-calculation result generation method according to the first aspect.

As can be seen from the above technical solutions, the pre-calculation result generating method, apparatus, electronic device, and storage medium provided by the embodiments of the present invention predict the probability that each dimension combination is queried according to the behavior of the user query data and the heat of each dimension combination, and further ignore the dimension combination that the user does not query or queries with small probability. Therefore, the calculation efficiency of the pre-calculation result is improved, and compared with the calculation result, the method occupies less storage space, meets the query requirement of the user, and reduces the storage cost.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other drawings can be obtained from these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a pre-calculation result generation method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a pre-calculation result generating device according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following describes the embodiments of the present invention further with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

The following describes a pre-calculation result generation method, a device, an electronic apparatus and a storage medium according to an embodiment of the present invention with reference to the accompanying drawings.

Fig. 1 is a flowchart of a pre-calculation result generation method according to an embodiment of the present invention. As shown in fig. 1, the method for generating a pre-calculation result provided in the embodiment of the present invention specifically includes the following:

s101: a set of feature vectors is obtained, wherein the set of feature vectors is determined based on a combination of dimensions queried by the query statement within a predetermined time.

The predetermined time is, for example, the last 10 days, the last 15 days, the last 20 days, the last month, etc., and of course, the predetermined time may be set to other values as needed. The query statement is typically an SQL (structured query statement, structured Query Language) query statement.

In big data multidimensional analysis, big data ad hoc multidimensional analysis is often required, and thus, dimensional combinations of queries are often different. That is, the data items that need to be queried each time are typically different. Thus, a query term in a predetermined time typically includes a combination of dimensions of a plurality of dimensional data that need to be queried, and after these combinations of dimensions are obtained, a transformation can be performed, so as to obtain feature vectors corresponding to these combinations of dimensions, where these feature vectors constitute a feature vector set in this example.

Dimensions refer to columns of a data table, each column representing a dimension. For example, a data table includes a plurality of columns including provinces, companies, channels, paid users, and subscription types, etc., so that the provinces are one dimension, the companies are one dimension, the channels are one dimension, the paid users are one dimension, the subscription types are one dimension, etc., that is, each column represents one dimension in the data table.

In one embodiment of the invention, obtaining a set of feature vectors includes: acquiring a query statement within the preset time; acquiring dimension combinations of queries in the query statement; converting the dimension combination of the query in the query statement into a feature vector, and obtaining the feature vector set according to the feature vector converted by the dimension combination of the query in the query statement. That is, the SQL query statement input by the user within the predetermined time is analyzed, the dimension combination in the SQL query statement is extracted, and the dimension combination is converted into the feature vector, so that a feature vector set is obtained.

In this example, assuming that the dimensions of the data table are N, for the data table of N dimensions, the dimensions combine with C (N, N) species. For each dimension combination, a feature signature hash function H may be used to calculate a feature vector v=h (Dim 1, dim2, dim3 … DimN) for each dimension combination. Further, all combinations of dimensions of the data table may be converted into feature vector sets. For example, assume the dimension combination is: company province, channel, null, null, feature vector after the dimension combination conversion is 11100. While the other dimension combines: company, province, null, pay-per-view user, subscription type converted feature vector is 11011. Where mul is represented as null and the value in the corresponding converted feature vector is 0.

S102: and inputting the feature vector set into a query hotness model to obtain the score of each feature vector in the feature vector set, wherein the query hotness model is obtained by training in advance based on the static features of the data table and the historical query sentences. The score of each feature vector is output by the query hotness model according to the input feature vector set.

In one embodiment of the present invention, the query hotness model needs to be trained before the feature vector set is input into the query hotness model to obtain the score of each feature vector in the feature vector set. Specifically, static characteristics of a data table and historical query sentences are obtained, and the query hotness model is trained based on the static characteristics of the data table and the historical query sentences.

In this example, the obtaining the static features of the data table and the historical query statement, and training the query hotness model based on the static features of the data table and the historical query statement, includes: acquiring a data table static feature, wherein the data table static feature comprises a table dimension, a dimension base, a table number and a table capacity, and acquiring an association weight parameter of the table dimension based on the dimension base, the table number and the table capacity; acquiring a history query statement, acquiring a history dimension combination of query in the history query statement, converting the history dimension combination into a history feature vector, and acquiring an important probability of the history feature vector based on the query frequency of the history feature vector in the history query statement and the number of the history query statement; and training the query hotness model based on the associated weight parameters of the table dimension and the important probability of the historical feature vector.

As a specific example, distributed computation is performed on a data table, and static characteristics such as dimensions, dimension base, table number, table capacity and the like of the data table are acquired and stored.

Wherein a table dimension, i.e., a column of a table, such as province (province), is one dimension;

the dimension radix refers to: the total value of the dimensions, such as 32 provinces in China, namely the dimension base of the provinces is 32;

the number of table rows refers to all the rows of the table;

the table capacity refers to the space occupied by the data amount of the data table, such as the data amount of 10T.

It will be appreciated that dimensions and the base of dimensions are important factors that affect the pre-computed results, such as the province dimension, and that a user may query 32 provinces, at least 32 results are pre-computed. The larger the dimension cardinality, the more desirable the dimension to avoid precomputation.

Based on the collected static characteristics of the data table, the associated weight parameters of the dimension, for example, the associated weight parameter delta=dimension base/(table number×table capacity) of the dimension are obtained.

According to the user history query data (namely, history query sentences), calculating the frequency x/total times N of the important probability p=f (x) =V corresponding to the feature vector of each history query sentence. That is, from the user history query data, the importance probability p=f (x) = (query frequency x of V)/(history total number of queries N) corresponding to each feature vector is calculated.

Using, for example: the logistic regression algorithm, gradient descent method, learns to train a query heat model K (V) that can give a score that the feature vector V will be queried. In a specific example, the score is between 0 and 1.

For example, assuming that the eigenvalue of a certain SQL1 is V1, K (V1) =0.9, the query hotness model K (V) can give a desired score to be queried for each eigenvector.

S103: and determining a feature vector score expected value according to the score of each feature vector.

In a specific example, obtaining the expected value of the feature vector score according to the score of each feature vector includes: converting the score of each feature vector into a histogram, wherein the histogram indicator represents probability distribution of each feature vector in the preset time; and obtaining the expected value of the feature vector score of the feature vector set in the preset time based on the probability distribution of each feature vector in the preset time.

As a specific example, the predetermined time is denoted as Tn, and then, within the predetermined time Tn, a feature vector set corresponding to a feature vector in a query sentence of a user query is Sn [ V ], and a score of each feature vector is calculated using a query heat model K (V). And converting the score into a histogram Hn, wherein Hn represents the probability distribution of each feature vector over a predetermined time Tn, and calculating the mathematical expectation Fn for the set of feature vectors Sn V over the predetermined time Tn, namely: the eigenvector score expectation Fn.

S104: and screening target feature vectors meeting query expectations from the feature vector set based on the feature vector score expected values, and pre-calculating based on the target feature vectors.

In this example, screening target feature vectors from the set of feature vectors that satisfy a query desire based on the feature vector score expectations includes: judging whether the score of each feature vector is smaller than the expected score value of the feature vector; deleting the feature vectors with scores smaller than expected values of the feature vector scores from the feature vector set, and taking the feature vectors reserved in the feature vector set as target feature vectors meeting query expectations.

In this example, pre-computing based on the target feature vector includes: converting the target feature vector into a target feature combination; and pre-calculating based on the target feature combination to obtain and store a pre-calculation result.

Specifically, taking fn+1=fn, fn+1 represents the expected value of the feature vector score in one time period in the future, and the feature vector of K (V) < fn+1 is discarded from the feature vector set S [ V ]. To this end, the feature vector set S [ V ] retains the dimensional combination Sn+1[V that meets the user' S query expectations. Sn+1[V ] represents the feature vector that the user is expected to query in a future time period.

The pre-calculation result is calculated according to the feature vector sn+1[V to be queried by the user. Furthermore, the dimension combination of non-query or small-probability query of the user is ignored, the pre-calculation result is stored, the instant query requirement of the user is greatly met, the storage space of the pre-calculation result is optimized, and the number of the pre-calculation results is reduced.

According to the pre-calculation result generation method provided by the embodiment of the invention, the queried probability of each dimension combination is predicted according to the behavior of the user query data and the heat of each dimension combination, and then the dimension combination of non-queried or small-probability queried by the user is ignored. Therefore, the calculation efficiency of the pre-calculation result is improved, and compared with the calculation result, the method occupies less storage space, meets the query requirement of the user, and reduces the storage cost.

Fig. 2 is a block diagram of a pre-calculation result generation apparatus according to an embodiment of the present invention. As shown in fig. 2, the pre-calculation result generating apparatus according to an embodiment of the present invention includes: an acquisition module 210, a score determination module 220, an expected value determination module 230, and a pre-calculation module 240, wherein:

an obtaining module 210, configured to obtain a feature vector set, where the feature vector set is determined based on a combination of dimensions queried by a query statement within a predetermined time;

the score determining module 220 is configured to input the feature vector set into a query hotness model to obtain a score of each feature vector in the feature vector set, where the query hotness model is obtained by training in advance based on static features of a data table and historical query sentences;

a desired value determining module 230, configured to determine a desired value of the feature vector score according to the score of each feature vector;

the pre-calculation module 240 is configured to screen out a target feature vector that meets the query expectation from the feature vector set based on the feature vector score expected value, and perform pre-calculation based on the target feature vector.

In one embodiment of the present invention, the obtaining module 210 is specifically configured to:

acquiring a query statement within the preset time;

acquiring dimension combinations of queries in the query statement;

In one embodiment of the present invention, the expected value determining module 230 is specifically configured to:

In one embodiment of the present invention, the pre-calculation module 240 is specifically configured to:

converting the target feature vector into a target feature combination;

In one embodiment of the present invention, further comprising: a training module (not shown in fig. 2), specifically for:

The step of obtaining the static characteristics of the data table and the historical query sentences, and training the query hotness model based on the static characteristics of the data table and the historical query sentences comprises the following steps:

According to the pre-calculation result generating device provided by the embodiment of the invention, the queried probability of each dimension combination is predicted according to the behavior of the query data of the user and the heat of each dimension combination, and then the dimension combination of non-queried or small-probability query of the user is ignored. Therefore, the calculation efficiency of the pre-calculation result is improved, and compared with the calculation result, the method occupies less storage space, meets the query requirement of the user, and reduces the storage cost.

It should be noted that, the specific implementation manner of the pre-calculation result generating device in the embodiment of the present invention is similar to the specific implementation manner of the pre-calculation result generating method in the embodiment of the present invention, please refer to the description of the method section specifically, and in order to reduce redundancy, details are not repeated here.

Based on the same inventive concept, a further embodiment of the present invention provides an electronic device, see fig. 3, comprising in particular: a processor 301, a memory 302, a communication interface 303, and a communication bus 304;

wherein, the processor 301, the memory 302, and the communication interface 303 complete communication with each other through the communication bus 304; the communication interface 303 is used for realizing information transmission between devices;

the processor 301 is configured to invoke a computer program in the memory 302, where the processor executes the computer program to implement all the steps of the pre-calculation result generating device method described above, for example, the processor executes the computer program to implement the following steps: acquiring a feature vector set, wherein the feature vector set is determined based on a dimension combination queried by a query statement in a preset time; inputting the feature vector set into a query hotness model to obtain the score of each feature vector in the feature vector set, wherein the query hotness model is obtained by training in advance based on static features of a data table and historical query sentences; determining a feature vector score expected value according to the score of each feature vector; and screening target feature vectors meeting query expectations from the feature vector set based on the feature vector score expected values, and pre-calculating based on the target feature vectors.

Based on the same inventive concept, a further embodiment of the present invention provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, realizes all the steps of the above-described pre-calculation result generating apparatus method, for example, the processor realizes the following steps when executing the computer program: acquiring a feature vector set, wherein the feature vector set is determined based on a dimension combination queried by a query statement in a preset time; inputting the feature vector set into a query hotness model to obtain the score of each feature vector in the feature vector set, wherein the query hotness model is obtained by training in advance based on static features of a data table and historical query sentences; determining a feature vector score expected value according to the score of each feature vector; and screening target feature vectors meeting query expectations from the feature vector set based on the feature vector score expected values, and pre-calculating based on the target feature vectors.

Further, the logic instructions in the memory described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules can be selected according to actual needs to achieve the purpose of the embodiment of the invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the index monitoring method described in the respective embodiments or some parts of the embodiments.

Furthermore, in the present disclosure, such as "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

Moreover, in the present invention, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Furthermore, in the description herein, reference to the terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A pre-calculation result generation method, characterized by comprising:

screening target feature vectors meeting query expectations from the feature vector set based on the feature vector score expected values, and pre-calculating based on the target feature vectors;

the obtaining the feature vector set includes:

acquiring a query statement within the preset time;

acquiring dimension combinations of queries in the query statement;

converting the dimension combination of the query in the query statement into a feature vector, and obtaining the feature vector set according to the feature vector converted by the dimension combination of the query in the query statement;

the determining the expected value of the feature vector score according to the score of each feature vector comprises the following steps:

2. The pre-calculation result generation method according to claim 1, wherein the screening the target feature vector satisfying the query expectation from the feature vector set based on the feature vector score expectation value includes:

3. The pre-calculation result generation method according to claim 1, wherein the pre-calculating based on the target feature vector includes:

converting the target feature vector into a target feature combination;

4. A pre-calculation result generation method according to any one of claims 1 to 3, wherein before said inputting the feature vector set into a query heat model to obtain a score of each feature vector in the feature vector set, the method further comprises:

5. The pre-calculation result generation method according to claim 4, wherein the acquiring the data table static feature and the history query statement and training the query hotness model based on the data table static feature and the history query statement comprises:

6. A pre-calculation result generation apparatus, characterized by comprising:

the acquisition module is used for acquiring a feature vector set, wherein the feature vector set is determined based on a dimension combination queried by a query statement in a preset time; the obtaining the feature vector set includes: acquiring a query statement within the preset time; acquiring dimension combinations of queries in the query statement; converting the dimension combination of the query in the query statement into a feature vector, and obtaining the feature vector set according to the feature vector converted by the dimension combination of the query in the query statement;

the expected value determining module is used for determining an expected value of the feature vector score according to the score of each feature vector, and comprises the following steps: converting the score of each feature vector into a histogram, wherein the histogram indicator represents probability distribution of each feature vector in the preset time; obtaining a feature vector score expected value of the feature vector set in the preset time based on probability distribution of each feature vector in the preset time;

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the pre-calculation result generation method according to any of claims 1 to 5 when executing the computer program.

8. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the pre-calculation result generation method according to any one of claims 1 to 5.