CN116991901A

CN116991901A - Data control system and method based on multidimensional database query

Info

Publication number: CN116991901A
Application number: CN202311238071.0A
Authority: CN
Inventors: 刘勇
Original assignee: Shenzhen Qinsi Technology Co ltd
Current assignee: Shenzhen Qinsi Technology Co ltd
Priority date: 2023-09-25
Filing date: 2023-09-25
Publication date: 2023-11-03

Abstract

The invention discloses a data control system and a method based on multidimensional database query, which relate to the technical field of databases and comprise a data conversion module, a data analysis module and a data analysis module, wherein the data conversion module converts each data point into a feature vector and comprises a dimension value and a measurement value of the data point; the deep learning module trains the deep learning model so that the deep learning model can predict the magnitude according to the dimension value; the data segmentation module inputs a multidimensional database to be queried into a model, wherein the difference value between the predicted measurement value of the data point and the special value is smaller than a threshold value, the data is included in a sparse area, and the rest data is included in a non-sparse area; and the data query module is used for preferentially querying or calculating the data of the non-sparse region when the data query or calculation of the multidimensional database is executed. The invention can predict sparse areas in the multidimensional database and improve query efficiency.

Description

Data control system and method based on multidimensional database query

Technical Field

The invention relates to the technical field of databases, in particular to a data control system and method based on multidimensional database query.

Background

Multidimensional databases are mainly used in business intelligence and data analysis, especially in the context of processing such as data warehouse and OLAP (online analytical processing) systems. Multidimensional databases allow data to be organized in multiple dimensions, such as time, place, product type, customer type, and the like. Such organization enables a user to query and analyze data very flexibly and intuitively, such as: "what is the total sales of all televisions sold in Shanghai in the upper month? "or" what is the most purchased commodity by all VIP clients today? ".

One major challenge faced by multidimensional databases is the sparsity of the data. This is because in an actual business scenario, not all possible combinations of dimensions will have corresponding data. For example, assuming 100 products are sold worldwide, 1000 sales locations, spanning 5 years, there will be at least 5000000 data points if one data item is to be stored for each possible product, location and time combination. In reality, however, there may be many products that have never been sold at some locations or have not been recorded for sale for some period of time. This results in sparsity of the multidimensional database, i.e., many possible dimensional combinations do not have corresponding data.

This sparsity of data can have an impact on the query efficiency of a multidimensional database, for example, if we query only data of a certain dimension, there may be a large number of data points without data, and query is performed by traversing all the data and listing the existing data, which greatly reduces the efficiency. In addition, when calculating data, such as adding data in a certain dimension, it is also necessary to traverse all the data and find out the existing data, and then add, which also greatly reduces the efficiency.

In order to efficiently process sparse data in a multidimensional database, emphasis is placed on where to confirm the sparse data. The general processing method is to traverse the multidimensional database to check whether the measurement value of each data point exists, if the measurement value of a certain data point does not exist, the data point is a 'blank', the process can be automated through programming, but for a large multidimensional database, the traversing can consume a great deal of computing resources and time, and the database is always in a changing state, so that the traversing mode cannot grasp the rule of existence of the 'blank', and thus the traversing mode can need to be repeated, and the invention tries to grasp the rule of existence of the 'blank' by adopting a deep learning model, so that the rule of existence of the 'blank' can be grasped even if the database changes.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a data control system and a data control method based on multidimensional database query, so that a deep learning model is adopted to grasp the rule of existence of sparse data, and the efficiency of data query is improved.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a data control method based on multidimensional database query comprises the following steps:

s1: converting each data point of the multidimensional database for training into a feature vector, wherein the feature vector comprises a dimension value representing the position of the data point and a measurement value representing the measurement of the data point, and for the missing data point, the measurement value is represented by using a predefined same special value;

s2: training by applying a deep learning model, wherein the dimension value of the data point is used as input, and the measurement value of the data point is used as a label, so that the deep learning model can predict the measurement value according to the dimension value;

s3: inputting a multidimensional database to be queried into a trained deep learning model, if the difference value between the measured value of a predicted data point and a predefined special value set in S1 is smaller than a preset threshold value, taking the data point into a sparse area, and taking other data points into a non-sparse area; the multidimensional database to be queried and the multidimensional database for training have the same dimensionality and measurement value meaning;

s4: when the data query or calculation of the multidimensional database in the step S3 is executed, the data of the non-sparse region segmented in the step S3 is preferentially queried or calculated.

In some embodiments, the deep learning model is selected as an LSTM model.

In some embodiments, the multidimensional database used for training consists of partial data extracted from the multidimensional database to be queried.

In some embodiments, the multidimensional database used for training consists of data in the multidimensional database to be queried over a period of time.

The invention also discloses a data control system based on multidimensional database query, which comprises:

and a data conversion module: for converting each data point of the multi-dimensional database for training into a feature vector comprising a dimension value representing the position of the data point and a metric value representing the metric of the data point, for missing data points, representing the metric value using a predefined same special value;

and the deep learning module is used for: training the deep learning model, wherein the dimension value of the data point is used as input, and the measurement value of the data point is used as a label to train, so that the deep learning model can predict the measurement value according to the dimension value;

and a data segmentation module: the method comprises the steps of inputting a multidimensional database to be queried into a trained deep learning model, if the difference value between a predicted measurement value of a data point and a predefined special value set in a data conversion module is smaller than a preset threshold value, taking the data point into a sparse region, and taking other data points into a non-sparse region; the multidimensional database to be queried and the multidimensional database for training have the same dimensionality and measurement value meaning;

and a data query module: and (3) when the data query or calculation of the multidimensional database in the step S3 is executed, preferentially querying or calculating the data of the non-sparse region segmented in the data segmentation module.

In some embodiments, the deep learning model used by the deep learning module is an LSTM model.

The invention has the advantages compared with the prior art that:

1) Conventional methods often require traversing the entire database to find and mark missing data when processing the missing data, which is time consuming and costly in large-scale databases. According to the invention, by training the deep learning model, which data points are likely to be missing data can be directly predicted, so that the process of traversing and searching is avoided, and the efficiency is greatly improved.

2) The method not only can predict the missing data, but also can capture the missing data through the trained model and divide the sparse region and the non-sparse region, and find out the rules of some data missing, which has important guiding significance for understanding the reasons of the data missing and further perfecting the data collection and processing flow.

3) By dividing the data into the sparse region and the non-sparse region, the data processing method and the data processing device can preferentially process the data of the non-sparse region when the data query or calculation is executed, and improve the data processing efficiency. Meanwhile, the data of the non-sparse region is more complete and accurate, so that the data is processed preferentially, the influence caused by misprediction can be reduced, and the quality of the query result is improved.

4) According to the invention, different deep learning models, such as LSTM models and the like, can be selected for training according to specific business requirements and data characteristics. In addition, the performance of the model can be optimized by adjusting the parameters and super parameters of the model, so that the model can better adapt to the service requirements.

5) The invention comprises a data conversion module, a deep learning module, a data segmentation module, a data query module and the like, forms a set of systematic processing flow, can realize the whole-flow management of the multidimensional database, and improves the efficiency and convenience of data processing.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

fig. 2 is a schematic diagram of the system of the present invention.

Description of the embodiments

The following describes specific embodiments of the present invention with reference to the drawings.

Schematic diagrams of the method and system of the present invention are shown in fig. 1-2, respectively.

The invention discloses a data control method based on multidimensional database query, which comprises the following steps:

s3: inputting a multidimensional database to be queried into a trained deep learning model, if the difference value between the measured value of a predicted data point and a predefined special value set in S1 is smaller than a preset threshold value, taking the data point into a sparse area, and taking other data points into a non-sparse area; the multidimensional database to be queried and the multidimensional database for training have the same dimensionality and metric value meaning;

Wherein, in some embodiments, the deep learning model is selected as the LSTM model.

Wherein in some embodiments the multidimensional database for training is composed of partial data extracted from the multidimensional database to be queried.

Wherein in some embodiments the multidimensional database used for training is data in the multidimensional database to be queried over a period of time.

The invention relates to a data control system based on multidimensional database query, which comprises:

and a data segmentation module: the method comprises the steps of inputting a multidimensional database to be queried into a trained deep learning model, if the difference value between a predicted measurement value of a data point and a predefined special value set in a data conversion module is smaller than a preset threshold value, taking the data point into a sparse region, and taking other data points into a non-sparse region; the multidimensional database to be queried and the multidimensional database for training have the same dimensionality and metric value meaning;

Wherein, in some embodiments, the deep learning model used by the deep learning module is an LSTM model.

Wherein in some embodiments the multidimensional database for training consists of data in the multidimensional database to be queried over a period of time.

In the following embodiments, we consider a specific scenario:

considering the scenario of an online e-commerce platform, its business data can be considered as a multidimensional database. For example, three dimensions may represent "category of merchandise," "time" (e.g., year, month, day), and "region," respectively. The metric value may represent a "sales". The metric may be missing, i.e., no sales, at some point in time, some category of merchandise, or some region.

First, each data point of the multidimensional database is converted into a feature vector. Each feature vector includes a commodity category, time, region as a dimension value, and a corresponding sales as a measure value. For the missing sales we can use the same special value predefined to represent its metric value, like 0.

A deep learning model suitable for processing the sequence data may then be applied for training. The commodity category, time and region are used as input, and sales are used as labels for training, so that the deep learning model can predict sales according to the commodity category, time and region.

After this, the multidimensional database to be queried is entered into a trained deep learning model, which predicts a sales for each query. If the difference between the predicted sales and 0 is less than a preset threshold, e.g., 0.1, the data point is considered to be in a sparse region, otherwise it is classified as a non-sparse region.

It should be noted that the multidimensional database to be queried and the multidimensional database for training should come from the same business scenario or system, have the same dimension and the same metric meaning; for example, the multi-dimensional database to be queried should also only include three types of states of Beijing Shanghai (of course, in actual cases, the current e-commerce can cover basically every city of China); the metrics are all sales. In practice, a portion of the data (e.g., data of approximately two years) may be extracted from the multidimensional database to be queried as a training multidimensional database.

The trained deep learning model can capture some rules of sales.

Finally, when data query or calculation is executed, the data of the non-sparse area is preferentially queried or calculated. This can improve the efficiency of data processing.

More specifically, the model further includes:

1) Data preprocessing:

for both category features, the "commodity category" and the "region," they can be converted to digital features using one-hot encoding. If the number of categories is too large, other coding schemes, such as tag coding or object coding, are also contemplated.

For time, it can be split into sub-features of year, month, day, etc. Depending on the actual situation, some derivative features may be created as well, such as quarterly, week number, weekend, holiday, etc.

The sales missing data may be set to a special value of 0, but it may also be necessary to consider the effects these special values may have on the model training phase.

2) Model selection and parameter setting:

in the above scenario, sales of each commodity may be affected by time (e.g., day of the year, day of the week, or time of day). At the same time, different merchandise categories and regions may also create different sales patterns. Thus, a model that captures time-dependent and process-type features, such as LSTM, may be used as the model. Parameters of the LSTM, such as the number of hidden units, the learning rate, the batch size and the like, can be optimized by means of cross-validation, grid search and the like.

3) Training and predicting:

and inputting the processed characteristic data and the corresponding sales label into a model for training. Future sales are predicted using the trained model. A threshold, e.g., 0.1, may be set, and when the difference between the predicted value and 0 is less than the threshold, the data point is classified into the sparse region, and otherwise, the data point is classified into the non-sparse region.

4) Post-treatment and optimization:

when inquiring or calculating, the data of the non-sparse region can be processed preferentially, so that the data processing efficiency is improved, and meanwhile, the influence caused by the misprediction is reduced. With the continuous updating of business data, the model needs to be retrained and optimized with new data periodically. Such as by incremental learning.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should be covered by the protection scope of the present invention by making equivalents and modifications to the technical solution and the inventive concept thereof.

Claims

1. The data control method based on multidimensional database query is characterized by comprising the following steps:

2. The method of claim 1, wherein the deep learning model is selected as an LSTM model.

3. The method for data control based on multi-dimensional database query according to claim 1, wherein the multi-dimensional database for training is composed of partial data extracted from the multi-dimensional database to be queried.

4. A data control method based on multidimensional database queries according to claim 3, characterized in that the multidimensional database for training consists of data in the multidimensional database to be queried over a certain period of time.

5. A data control system based on multidimensional database queries, the system comprising:

6. The multidimensional database query-based data control system of claim 5, wherein the deep learning model used by the deep learning module is an LSTM model.

7. The multidimensional database query-based data control system of claim 5, wherein the multidimensional database for training is comprised of portions of data extracted from the multidimensional database to be queried.

8. The multidimensional database query-based data control system of claim 5, wherein the multidimensional database for training consists of data in the multidimensional database to be queried over a period of time.