CN116991901A - Data control system and method based on multidimensional database query - Google Patents

Data control system and method based on multidimensional database query Download PDF

Info

Publication number
CN116991901A
CN116991901A CN202311238071.0A CN202311238071A CN116991901A CN 116991901 A CN116991901 A CN 116991901A CN 202311238071 A CN202311238071 A CN 202311238071A CN 116991901 A CN116991901 A CN 116991901A
Authority
CN
China
Prior art keywords
data
multidimensional database
value
deep learning
data point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311238071.0A
Other languages
Chinese (zh)
Inventor
刘勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Qinsi Technology Co ltd
Original Assignee
Shenzhen Qinsi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Qinsi Technology Co ltd filed Critical Shenzhen Qinsi Technology Co ltd
Priority to CN202311238071.0A priority Critical patent/CN116991901A/en
Publication of CN116991901A publication Critical patent/CN116991901A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Abstract

The invention discloses a data control system and a method based on multidimensional database query, which relate to the technical field of databases and comprise a data conversion module, a data analysis module and a data analysis module, wherein the data conversion module converts each data point into a feature vector and comprises a dimension value and a measurement value of the data point; the deep learning module trains the deep learning model so that the deep learning model can predict the magnitude according to the dimension value; the data segmentation module inputs a multidimensional database to be queried into a model, wherein the difference value between the predicted measurement value of the data point and the special value is smaller than a threshold value, the data is included in a sparse area, and the rest data is included in a non-sparse area; and the data query module is used for preferentially querying or calculating the data of the non-sparse region when the data query or calculation of the multidimensional database is executed. The invention can predict sparse areas in the multidimensional database and improve query efficiency.

Description

Data control system and method based on multidimensional database query
Technical Field
The invention relates to the technical field of databases, in particular to a data control system and method based on multidimensional database query.
Background
Multidimensional databases are mainly used in business intelligence and data analysis, especially in the context of processing such as data warehouse and OLAP (online analytical processing) systems. Multidimensional databases allow data to be organized in multiple dimensions, such as time, place, product type, customer type, and the like. Such organization enables a user to query and analyze data very flexibly and intuitively, such as: "what is the total sales of all televisions sold in Shanghai in the upper month? "or" what is the most purchased commodity by all VIP clients today? ".
One major challenge faced by multidimensional databases is the sparsity of the data. This is because in an actual business scenario, not all possible combinations of dimensions will have corresponding data. For example, assuming 100 products are sold worldwide, 1000 sales locations, spanning 5 years, there will be at least 5000000 data points if one data item is to be stored for each possible product, location and time combination. In reality, however, there may be many products that have never been sold at some locations or have not been recorded for sale for some period of time. This results in sparsity of the multidimensional database, i.e., many possible dimensional combinations do not have corresponding data.
This sparsity of data can have an impact on the query efficiency of a multidimensional database, for example, if we query only data of a certain dimension, there may be a large number of data points without data, and query is performed by traversing all the data and listing the existing data, which greatly reduces the efficiency. In addition, when calculating data, such as adding data in a certain dimension, it is also necessary to traverse all the data and find out the existing data, and then add, which also greatly reduces the efficiency.
In order to efficiently process sparse data in a multidimensional database, emphasis is placed on where to confirm the sparse data. The general processing method is to traverse the multidimensional database to check whether the measurement value of each data point exists, if the measurement value of a certain data point does not exist, the data point is a 'blank', the process can be automated through programming, but for a large multidimensional database, the traversing can consume a great deal of computing resources and time, and the database is always in a changing state, so that the traversing mode cannot grasp the rule of existence of the 'blank', and thus the traversing mode can need to be repeated, and the invention tries to grasp the rule of existence of the 'blank' by adopting a deep learning model, so that the rule of existence of the 'blank' can be grasped even if the database changes.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a data control system and a data control method based on multidimensional database query, so that a deep learning model is adopted to grasp the rule of existence of sparse data, and the efficiency of data query is improved.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a data control method based on multidimensional database query comprises the following steps:
s1: converting each data point of the multidimensional database for training into a feature vector, wherein the feature vector comprises a dimension value representing the position of the data point and a measurement value representing the measurement of the data point, and for the missing data point, the measurement value is represented by using a predefined same special value;
s2: training by applying a deep learning model, wherein the dimension value of the data point is used as input, and the measurement value of the data point is used as a label, so that the deep learning model can predict the measurement value according to the dimension value;
s3: inputting a multidimensional database to be queried into a trained deep learning model, if the difference value between the measured value of a predicted data point and a predefined special value set in S1 is smaller than a preset threshold value, taking the data point into a sparse area, and taking other data points into a non-sparse area; the multidimensional database to be queried and the multidimensional database for training have the same dimensionality and measurement value meaning;
s4: when the data query or calculation of the multidimensional database in the step S3 is executed, the data of the non-sparse region segmented in the step S3 is preferentially queried or calculated.
In some embodiments, the deep learning model is selected as an LSTM model.
In some embodiments, the multidimensional database used for training consists of partial data extracted from the multidimensional database to be queried.
In some embodiments, the multidimensional database used for training consists of data in the multidimensional database to be queried over a period of time.
The invention also discloses a data control system based on multidimensional database query, which comprises:
and a data conversion module: for converting each data point of the multi-dimensional database for training into a feature vector comprising a dimension value representing the position of the data point and a metric value representing the metric of the data point, for missing data points, representing the metric value using a predefined same special value;
and the deep learning module is used for: training the deep learning model, wherein the dimension value of the data point is used as input, and the measurement value of the data point is used as a label to train, so that the deep learning model can predict the measurement value according to the dimension value;
and a data segmentation module: the method comprises the steps of inputting a multidimensional database to be queried into a trained deep learning model, if the difference value between a predicted measurement value of a data point and a predefined special value set in a data conversion module is smaller than a preset threshold value, taking the data point into a sparse region, and taking other data points into a non-sparse region; the multidimensional database to be queried and the multidimensional database for training have the same dimensionality and measurement value meaning;
and a data query module: and (3) when the data query or calculation of the multidimensional database in the step S3 is executed, preferentially querying or calculating the data of the non-sparse region segmented in the data segmentation module.
In some embodiments, the deep learning model used by the deep learning module is an LSTM model.
In some embodiments, the multidimensional database used for training consists of partial data extracted from the multidimensional database to be queried.
In some embodiments, the multidimensional database used for training consists of data in the multidimensional database to be queried over a period of time.
The invention has the advantages compared with the prior art that:
1) Conventional methods often require traversing the entire database to find and mark missing data when processing the missing data, which is time consuming and costly in large-scale databases. According to the invention, by training the deep learning model, which data points are likely to be missing data can be directly predicted, so that the process of traversing and searching is avoided, and the efficiency is greatly improved.
2) The method not only can predict the missing data, but also can capture the missing data through the trained model and divide the sparse region and the non-sparse region, and find out the rules of some data missing, which has important guiding significance for understanding the reasons of the data missing and further perfecting the data collection and processing flow.
3) By dividing the data into the sparse region and the non-sparse region, the data processing method and the data processing device can preferentially process the data of the non-sparse region when the data query or calculation is executed, and improve the data processing efficiency. Meanwhile, the data of the non-sparse region is more complete and accurate, so that the data is processed preferentially, the influence caused by misprediction can be reduced, and the quality of the query result is improved.
4) According to the invention, different deep learning models, such as LSTM models and the like, can be selected for training according to specific business requirements and data characteristics. In addition, the performance of the model can be optimized by adjusting the parameters and super parameters of the model, so that the model can better adapt to the service requirements.
5) The invention comprises a data conversion module, a deep learning module, a data segmentation module, a data query module and the like, forms a set of systematic processing flow, can realize the whole-flow management of the multidimensional database, and improves the efficiency and convenience of data processing.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention;
fig. 2 is a schematic diagram of the system of the present invention.
Description of the embodiments
The following describes specific embodiments of the present invention with reference to the drawings.
Schematic diagrams of the method and system of the present invention are shown in fig. 1-2, respectively.
The invention discloses a data control method based on multidimensional database query, which comprises the following steps:
s1: converting each data point of the multidimensional database for training into a feature vector, wherein the feature vector comprises a dimension value representing the position of the data point and a measurement value representing the measurement of the data point, and for the missing data point, the measurement value is represented by using a predefined same special value;
s2: training by applying a deep learning model, wherein the dimension value of the data point is used as input, and the measurement value of the data point is used as a label, so that the deep learning model can predict the measurement value according to the dimension value;
s3: inputting a multidimensional database to be queried into a trained deep learning model, if the difference value between the measured value of a predicted data point and a predefined special value set in S1 is smaller than a preset threshold value, taking the data point into a sparse area, and taking other data points into a non-sparse area; the multidimensional database to be queried and the multidimensional database for training have the same dimensionality and metric value meaning;
s4: when the data query or calculation of the multidimensional database in the step S3 is executed, the data of the non-sparse region segmented in the step S3 is preferentially queried or calculated.
Wherein, in some embodiments, the deep learning model is selected as the LSTM model.
Wherein in some embodiments the multidimensional database for training is composed of partial data extracted from the multidimensional database to be queried.
Wherein in some embodiments the multidimensional database used for training is data in the multidimensional database to be queried over a period of time.
The invention relates to a data control system based on multidimensional database query, which comprises:
and a data conversion module: for converting each data point of the multi-dimensional database for training into a feature vector comprising a dimension value representing the position of the data point and a metric value representing the metric of the data point, for missing data points, representing the metric value using a predefined same special value;
and the deep learning module is used for: training the deep learning model, wherein the dimension value of the data point is used as input, and the measurement value of the data point is used as a label to train, so that the deep learning model can predict the measurement value according to the dimension value;
and a data segmentation module: the method comprises the steps of inputting a multidimensional database to be queried into a trained deep learning model, if the difference value between a predicted measurement value of a data point and a predefined special value set in a data conversion module is smaller than a preset threshold value, taking the data point into a sparse region, and taking other data points into a non-sparse region; the multidimensional database to be queried and the multidimensional database for training have the same dimensionality and metric value meaning;
and a data query module: and (3) when the data query or calculation of the multidimensional database in the step S3 is executed, preferentially querying or calculating the data of the non-sparse region segmented in the data segmentation module.
Wherein, in some embodiments, the deep learning model used by the deep learning module is an LSTM model.
Wherein in some embodiments the multidimensional database for training is composed of partial data extracted from the multidimensional database to be queried.
Wherein in some embodiments the multidimensional database for training consists of data in the multidimensional database to be queried over a period of time.
In the following embodiments, we consider a specific scenario:
considering the scenario of an online e-commerce platform, its business data can be considered as a multidimensional database. For example, three dimensions may represent "category of merchandise," "time" (e.g., year, month, day), and "region," respectively. The metric value may represent a "sales". The metric may be missing, i.e., no sales, at some point in time, some category of merchandise, or some region.
First, each data point of the multidimensional database is converted into a feature vector. Each feature vector includes a commodity category, time, region as a dimension value, and a corresponding sales as a measure value. For the missing sales we can use the same special value predefined to represent its metric value, like 0.
A deep learning model suitable for processing the sequence data may then be applied for training. The commodity category, time and region are used as input, and sales are used as labels for training, so that the deep learning model can predict sales according to the commodity category, time and region.
After this, the multidimensional database to be queried is entered into a trained deep learning model, which predicts a sales for each query. If the difference between the predicted sales and 0 is less than a preset threshold, e.g., 0.1, the data point is considered to be in a sparse region, otherwise it is classified as a non-sparse region.
It should be noted that the multidimensional database to be queried and the multidimensional database for training should come from the same business scenario or system, have the same dimension and the same metric meaning; for example, the multi-dimensional database to be queried should also only include three types of states of Beijing Shanghai (of course, in actual cases, the current e-commerce can cover basically every city of China); the metrics are all sales. In practice, a portion of the data (e.g., data of approximately two years) may be extracted from the multidimensional database to be queried as a training multidimensional database.
The trained deep learning model can capture some rules of sales.
Finally, when data query or calculation is executed, the data of the non-sparse area is preferentially queried or calculated. This can improve the efficiency of data processing.
More specifically, the model further includes:
1) Data preprocessing:
for both category features, the "commodity category" and the "region," they can be converted to digital features using one-hot encoding. If the number of categories is too large, other coding schemes, such as tag coding or object coding, are also contemplated.
For time, it can be split into sub-features of year, month, day, etc. Depending on the actual situation, some derivative features may be created as well, such as quarterly, week number, weekend, holiday, etc.
The sales missing data may be set to a special value of 0, but it may also be necessary to consider the effects these special values may have on the model training phase.
2) Model selection and parameter setting:
in the above scenario, sales of each commodity may be affected by time (e.g., day of the year, day of the week, or time of day). At the same time, different merchandise categories and regions may also create different sales patterns. Thus, a model that captures time-dependent and process-type features, such as LSTM, may be used as the model. Parameters of the LSTM, such as the number of hidden units, the learning rate, the batch size and the like, can be optimized by means of cross-validation, grid search and the like.
3) Training and predicting:
and inputting the processed characteristic data and the corresponding sales label into a model for training. Future sales are predicted using the trained model. A threshold, e.g., 0.1, may be set, and when the difference between the predicted value and 0 is less than the threshold, the data point is classified into the sparse region, and otherwise, the data point is classified into the non-sparse region.
4) Post-treatment and optimization:
when inquiring or calculating, the data of the non-sparse region can be processed preferentially, so that the data processing efficiency is improved, and meanwhile, the influence caused by the misprediction is reduced. With the continuous updating of business data, the model needs to be retrained and optimized with new data periodically. Such as by incremental learning.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should be covered by the protection scope of the present invention by making equivalents and modifications to the technical solution and the inventive concept thereof.

Claims (8)

1. The data control method based on multidimensional database query is characterized by comprising the following steps:
s1: converting each data point of the multidimensional database for training into a feature vector, wherein the feature vector comprises a dimension value representing the position of the data point and a measurement value representing the measurement of the data point, and for the missing data point, the measurement value is represented by using a predefined same special value;
s2: training by applying a deep learning model, wherein the dimension value of the data point is used as input, and the measurement value of the data point is used as a label, so that the deep learning model can predict the measurement value according to the dimension value;
s3: inputting a multidimensional database to be queried into a trained deep learning model, if the difference value between the measured value of a predicted data point and a predefined special value set in S1 is smaller than a preset threshold value, taking the data point into a sparse area, and taking other data points into a non-sparse area; the multidimensional database to be queried and the multidimensional database for training have the same dimensionality and measurement value meaning;
s4: when the data query or calculation of the multidimensional database in the step S3 is executed, the data of the non-sparse region segmented in the step S3 is preferentially queried or calculated.
2. The method of claim 1, wherein the deep learning model is selected as an LSTM model.
3. The method for data control based on multi-dimensional database query according to claim 1, wherein the multi-dimensional database for training is composed of partial data extracted from the multi-dimensional database to be queried.
4. A data control method based on multidimensional database queries according to claim 3, characterized in that the multidimensional database for training consists of data in the multidimensional database to be queried over a certain period of time.
5. A data control system based on multidimensional database queries, the system comprising:
and a data conversion module: for converting each data point of the multi-dimensional database for training into a feature vector comprising a dimension value representing the position of the data point and a metric value representing the metric of the data point, for missing data points, representing the metric value using a predefined same special value;
and the deep learning module is used for: training the deep learning model, wherein the dimension value of the data point is used as input, and the measurement value of the data point is used as a label to train, so that the deep learning model can predict the measurement value according to the dimension value;
and a data segmentation module: the method comprises the steps of inputting a multidimensional database to be queried into a trained deep learning model, if the difference value between a predicted measurement value of a data point and a predefined special value set in a data conversion module is smaller than a preset threshold value, taking the data point into a sparse region, and taking other data points into a non-sparse region; the multidimensional database to be queried and the multidimensional database for training have the same dimensionality and metric value meaning;
and a data query module: and (3) when the data query or calculation of the multidimensional database in the step S3 is executed, preferentially querying or calculating the data of the non-sparse region segmented in the data segmentation module.
6. The multidimensional database query-based data control system of claim 5, wherein the deep learning model used by the deep learning module is an LSTM model.
7. The multidimensional database query-based data control system of claim 5, wherein the multidimensional database for training is comprised of portions of data extracted from the multidimensional database to be queried.
8. The multidimensional database query-based data control system of claim 5, wherein the multidimensional database for training consists of data in the multidimensional database to be queried over a period of time.
CN202311238071.0A 2023-09-25 2023-09-25 Data control system and method based on multidimensional database query Pending CN116991901A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311238071.0A CN116991901A (en) 2023-09-25 2023-09-25 Data control system and method based on multidimensional database query

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311238071.0A CN116991901A (en) 2023-09-25 2023-09-25 Data control system and method based on multidimensional database query

Publications (1)

Publication Number Publication Date
CN116991901A true CN116991901A (en) 2023-11-03

Family

ID=88523471

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311238071.0A Pending CN116991901A (en) 2023-09-25 2023-09-25 Data control system and method based on multidimensional database query

Country Status (1)

Country Link
CN (1) CN116991901A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120179698A1 (en) * 2011-01-12 2012-07-12 International Business Machines Corporation Multiple Sparse Index Intelligent Table Organization
CN103258054A (en) * 2013-05-31 2013-08-21 闫朝升 Method and device for processing data
CN104850649A (en) * 2015-05-29 2015-08-19 苏州大学张家港工业技术研究院 Method and system for sampling points of interest on map
CN107315751A (en) * 2016-04-26 2017-11-03 北京京东尚科信息技术有限公司 Multidimensional data query method and device
CN111046630A (en) * 2019-12-06 2020-04-21 中国科学院计算技术研究所 Syntax tree extraction method of JSON data
CN112990442A (en) * 2021-04-21 2021-06-18 北京瑞莱智慧科技有限公司 Data determination method and device based on spatial position and electronic equipment
CN113779160A (en) * 2021-08-24 2021-12-10 北京元年科技股份有限公司 Method and device for acquiring data of multidimensional database
CN116226202A (en) * 2023-03-14 2023-06-06 金蝶软件(中国)有限公司 Multidimensional database query method, multidimensional database query device, computer equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120179698A1 (en) * 2011-01-12 2012-07-12 International Business Machines Corporation Multiple Sparse Index Intelligent Table Organization
CN103258054A (en) * 2013-05-31 2013-08-21 闫朝升 Method and device for processing data
CN104850649A (en) * 2015-05-29 2015-08-19 苏州大学张家港工业技术研究院 Method and system for sampling points of interest on map
CN107315751A (en) * 2016-04-26 2017-11-03 北京京东尚科信息技术有限公司 Multidimensional data query method and device
CN111046630A (en) * 2019-12-06 2020-04-21 中国科学院计算技术研究所 Syntax tree extraction method of JSON data
CN112990442A (en) * 2021-04-21 2021-06-18 北京瑞莱智慧科技有限公司 Data determination method and device based on spatial position and electronic equipment
CN113779160A (en) * 2021-08-24 2021-12-10 北京元年科技股份有限公司 Method and device for acquiring data of multidimensional database
CN116226202A (en) * 2023-03-14 2023-06-06 金蝶软件(中国)有限公司 Multidimensional database query method, multidimensional database query device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
用户1148526: "多维数据库概述之一", pages 1 - 7, Retrieved from the Internet <URL:http://cloud.tencent.com/developer/article/1433031> *

Similar Documents

Publication Publication Date Title
CN107515898B (en) Tire enterprise sales prediction method based on data diversity and task diversity
Chow et al. Design of a knowledge-based logistics strategy system
CN109685583B (en) Supply chain demand prediction method based on big data
CN104077407B (en) A kind of intelligent data search system and method
CN110858219A (en) Logistics object information processing method and device and computer system
CN110956278A (en) Method and system for retraining machine learning models
CN116187524A (en) Supply chain analysis model comparison method and device based on machine learning
CN115269958A (en) Internet reliability data information acquisition and analysis system
CN112001539B (en) High-precision passenger transport prediction method and passenger transport prediction system
US7899776B2 (en) Explaining changes in measures thru data mining
CN114418602A (en) Online retailer product inventory decision-making method and system based on demand prediction
CN116991901A (en) Data control system and method based on multidimensional database query
CN101334793B (en) Method for automatic recognition for dependency relationship of demand
Wan et al. Similarity-based sales forecasting using improved ConvLSTM and prophet
He Rain prediction in Australia with active learning algorithm
CN116304726A (en) Material similarity analysis method based on semantic library and knowledge graph
CN113342844A (en) Industrial intelligent search system
CN115617790A (en) Data warehouse creation method, electronic device and storage medium
Liu et al. Inventory Management of Automobile After-sales Parts Based on Data Mining
Langer et al. Gideon-TS: Efficient Exploration and Labeling of Multivariate Industrial Sensor Data.
Fan et al. An agent model for incremental rough set-based rule induction: a big data analysis in sales promotion
CN116777508B (en) Medical supply analysis management system and method based on big data
CN114185957B (en) Intelligent mining method suitable for power big data service requirements
CN117807377B (en) Multidimensional logistics data mining and predicting method and system
JP2019168820A (en) Data analysis support system and data analysis support method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination