CN109684419B - Big data-based data cube processing method and device and electronic equipment - Google Patents

Big data-based data cube processing method and device and electronic equipment Download PDF

Info

Publication number
CN109684419B
CN109684419B CN201811547105.3A CN201811547105A CN109684419B CN 109684419 B CN109684419 B CN 109684419B CN 201811547105 A CN201811547105 A CN 201811547105A CN 109684419 B CN109684419 B CN 109684419B
Authority
CN
China
Prior art keywords
data cube
dimension
data
target data
original data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn - After Issue
Application number
CN201811547105.3A
Other languages
Chinese (zh)
Other versions
CN109684419A (en
Inventor
董子平
李军杰
张鹏飞
杨保
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Geo Vision Tech Co ltd
Original Assignee
Beijing Geo Vision Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Geo Vision Tech Co ltd filed Critical Beijing Geo Vision Tech Co ltd
Priority to CN201811547105.3A priority Critical patent/CN109684419B/en
Publication of CN109684419A publication Critical patent/CN109684419A/en
Application granted granted Critical
Publication of CN109684419B publication Critical patent/CN109684419B/en
Withdrawn - After Issue legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a data cube processing method and device based on big data and electronic equipment, relates to the field of big data processing, and is beneficial to improving the speed of data processing of a data cube. The processing method comprises the following steps: generating each original data cube according to the dimension and the measure selected by each user and storing metadata information of each original data cube; classifying each original data cube to obtain an original data cube set of different categories; carrying out normalization processing on each original data cube set to respectively obtain a corresponding target data cube, and storing metadata information of the target data cube set, mapping relations among elements of the target data cube set and the original data cube set into a database; all data of 1 to N dimensions of the target data cube are calculated and stored in compression. The embodiment of the invention is mainly used for the data cube based on the big data cluster HBase.

Description

Big data-based data cube processing method and device and electronic equipment
Technical Field
The present invention relates to the field of big data processing, and in particular, to a method and an apparatus for processing a data cube based on big data, and an electronic device.
Background
A data cube (dataCube) is a type of multidimensional matrix that allows users to explore and analyze data sets from multiple angles. An OLAP (online analytical processing system) built data cube is composed of N dimensions, and contains cell (subcube block) values satisfying the conditions, and the cells contain data to be analyzed, which is called a metric value. The metric value is the data to be analyzed and displayed, namely the index, and multidimensional analysis can be performed on the data.
The data cube is a technical architecture for data analysis and indexing, and aiming at a processing edge device of big data, the metadata can be subjected to real-time indexing of any multiple keywords. After analyzing the metadata through the data cube, the query and retrieval efficiency of the data can be greatly accelerated.
The traditional data cube technology is only suitable for the technical realization of medium and small data volume, and when the data reaches large data volume (hundreds of millions, billions and even trillions), the operation is slow and even the result cannot be generated.
Disclosure of Invention
In view of the above, the embodiments of the present invention provide a method, an apparatus, and an electronic device for processing a data cube based on big data, which are beneficial to improving the speed of data processing on the data cube.
In a first aspect, an embodiment of the present invention provides a method for processing a data cube based on big data, including: providing a plurality of dimensions and metrics for selection by a user; generating each original data cube according to the dimension and the measure selected by each user, and storing metadata information of each original data cube; classifying each original data cube to obtain an original data cube set of different categories; normalizing each original data cube in each original data cube set to obtain target data cubes respectively, wherein each target data cube corresponds to one original data cube set, and all target data cubes form a target data cube set; storing metadata information of the target data cube set, mapping relations among elements of the target data cube set and the original data cube set into a database; calculating all data from 1 dimension to N dimensions of the target data cube; all data of 1 to N dimensions of the target data cube are stored in compression.
In one implementation manner of the first embodiment of the present invention, the normalizing each original data cube in each original data cube set to obtain a target data cube respectively includes: and calculating the union of the dimensions and the union of the metrics for each original data cube in the original data cube set, and generating a target data cube according to the calculated union of the dimensions and the union of the metrics.
In one implementation of the first embodiment of the present invention, storing, in a database, metadata information of the target data cube set, a mapping relationship between elements of the target data cube set and the original data cube set includes: performing de-duplication on each dimension of the target data cube, and putting the de-duplicated dimension into an array, wherein the subscript corresponding to the array is used as a key value corresponding to the generated dictionary tree; combining and de-duplicating the dimensions of the target data cube, and mapping the generated dimension combination to form a single dictionary tree in the last step to finally form a dictionary tree of the dimensions; and storing a part of the dictionary tree in a memory and a cloumn key of mysql respectively.
In one implementation of the first embodiment of the present invention, the calculating all data of 1 dimension to N dimension of the target data cube includes: calculating N-1 dimension data summary of the dictionary tree by using an MR frame, calculating N-2 dimension data by using N-1 dimension, and repeating iterative calculation until one dimension data; the compressing stores all data of 1-dimension to N-dimension of the target data cube including: dividing the data into a plurality of parts by the number of lines or the data size, compressing each part of data by using a Snappy compression algorithm, and storing the compressed data.
In a first implementation manner of the embodiment of the present invention, the processing method further includes: providing a query portal for a user to select an original data cube to view and to input query conditions; searching corresponding dictionary information according to dimension information of an original data cube queried by a user, scanning and filtering a rowkey in hbase according to key values in the dictionary, and summarizing and calculating values according to scanning data; the calculated values are shown.
In a second aspect, a second embodiment of the present invention provides a data cube processing apparatus based on big data, including: a user selection module for providing a plurality of dimensions and metrics for selection by a user; the response module is used for generating each original data cube according to the dimension and the measurement selected by each user; a first storage module storing metadata information of each original data cube; the classification module classifies each original data cube to obtain an original data cube set with different categories; the normalization module is used for carrying out normalization processing on each original data cube in each original data cube set to respectively obtain target data cubes, each target data cube corresponds to one original data cube set, and all target data cubes form a target data cube set; the second storage module is used for storing the metadata information of the target data cube set, the mapping relation among the elements of the target data cube set and the original data cube set into a database; a data calculation module for calculating all data from 1 dimension to N dimensions of the target data cube; and the third storage module is used for compressing and storing all data from 1 dimension to N dimensions of the target data cube.
In a second embodiment of the present invention, the normalizing module normalizes each original data cube in each original data cube set to obtain a target data cube, respectively, including: and calculating the union of the dimensions and the union of the metrics for each original data cube in the original data cube set, and generating a target data cube according to the calculated union of the dimensions and the union of the metrics.
In a second implementation manner of the second embodiment of the present invention, the storing, by the second storage module, metadata information of the target data cube set, a mapping relationship between each element of the target data cube set and each element of the original data cube set into the database includes: performing de-duplication on each dimension of the target data cube, putting the de-duplicated dimensions into an array, using subscripts corresponding to the array as key values corresponding to the generated dictionary trees, performing combination de-duplication on the dimensions of the target data cube, performing combination mapping on the generated dimensions to form a single dictionary tree in one step, and finally forming the dictionary tree of the dimensions, wherein one dictionary tree is stored in a memory and a clomn key of mysql respectively; the data calculation module calculating all data of 1-dimension to N-dimension of the target data cube includes: calculating N-1 dimension data summary of the dictionary tree by using an MR frame, calculating N-2 dimension data by using N-1 dimension, and repeating iterative calculation until one dimension data; the compressing stores all data of 1-dimension to N-dimension of the target data cube including: dividing the data into a plurality of parts by the number of lines or the data size, compressing each part of data by using a Snappy compression algorithm, and storing the compressed data.
In a second implementation manner of the second embodiment of the present invention, the data processing apparatus further includes a user query module, a search module, and a display module: the user query module is used for providing a query inlet for a user to select an original data cube to be checked and input query conditions; the searching module is used for searching corresponding dictionary information according to dimension information of an original data cube queried by a user, scanning and filtering a rowkey in a hbase according to a key value in the dictionary, and summarizing a calculated value according to scanning data; and the display module is used for displaying the calculated value.
In a third aspect, an embodiment of the present invention provides an electronic device, including: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space surrounded by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing the operations of: providing a plurality of dimensions and metrics for selection by a user; generating each original data cube according to the dimension and the measure selected by each user, and storing metadata information of each original data cube; classifying each original data cube to obtain an original data cube set of different categories; normalizing each original data cube in each original data cube set to obtain target data cubes respectively, wherein each target data cube corresponds to one original data cube set, and all target data cubes form a target data cube set; storing metadata information of the target data cube set, mapping relations among elements of the target data cube set and the original data cube set into a database; calculating all data from 1 dimension to N dimensions of the target data cube; all data of 1 to N dimensions of the target data cube are stored in compression.
According to the data cube processing method and device based on big data and the electronic equipment, all original data cubes selected and generated by a user are classified to obtain different types of original data cube sets; after normalization processing, each original data cube set respectively obtains a corresponding target data cube, and all target data cubes form a target data cube set; storing metadata information of the target data cube set, mapping relations among elements of the target data cube set and the original data cube set into a database; in this way, the number of the target data cubes is obviously smaller than that of the original data cubes, so that the consumption of the whole computing resources, the whole storage resources and the whole network resources is reduced, and the whole computing speed is improved. The method is beneficial to realizing data processing of the data cube based on big data and improving the speed of data processing of the data cube.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for processing a big data based data cube according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an embodiment of an electronic device according to an embodiment of the present invention.
Detailed Description
It should be understood that the described embodiments are merely some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
First aspect:
fig. 1 is a flowchart of a method for processing a big data-based data cube according to an embodiment of the present invention, and as shown in fig. 1, the method of this embodiment may include:
step 101, providing a plurality of dimensions and metrics for selection by a user;
in this embodiment, a data cube (dataCube) is a type of multidimensional matrix that allows a user to explore and analyze a data set from multiple angles. An OLAP (online analytical processing system) built data cube is composed of N dimensions, and contains cell (subcube block) values satisfying the conditions, and the cells contain data to be analyzed, which is called a metric value. The metric value is the data to be analyzed and displayed, namely the index, and multidimensional analysis can be performed on the data. The present embodiment provides a wide variety of Dimension and Measure, and the user can select the Dimension and flexibly customize the Measure calculation rule at will.
Step 102, generating each original data cube according to the dimension and the measure selected by each user, and storing metadata information of each original data cube;
in this embodiment, each user selects a dimension and a metric, and generates a raw data cube correspondingly, and each user can define one or more raw data cubes; and correspondingly generating each original data cube according to the dimension and the measure selected by each user, and storing metadata information of each original data cube in a relational database mysql.
Step 103, classifying each original data cube to obtain an original data cube set of different categories;
in this embodiment, the original data cubes may be classified according to different users of different categories or different dimensions and metrics of interest of the users, to obtain an original data cube set of different categories. For example: according to the service types of users defining the original data cubes, the original data cubes are divided into original data cube sets of user classes, order classes and website behavior classes.
It will be appreciated that the specific method of classifying each raw data cube may be set by itself as desired, and this embodiment is not limited thereto.
Step 104, normalizing each original data cube in each original data cube set to obtain target data cubes respectively, wherein each target data cube corresponds to one original data cube set, and all target data cubes form a target data cube set;
in this embodiment, after normalization processing, each original data cube set obtains a corresponding target data cube, where the number of target data cubes is obviously smaller than that of original data cubes, so that consumption of overall computing resources, overall storage resources and overall network resources is reduced, and overall computing speed is improved. The method is beneficial to realizing data processing of the data cube based on big data and improving the speed of data processing of the data cube.
As an optional implementation manner of this embodiment, the normalizing each original data cube in each original data cube set to obtain a target data cube respectively includes: and calculating the union of the dimensions and the union of the metrics for each original data cube in the original data cube set, and generating a target data cube according to the calculated union of the dimensions and the union of the metrics.
In this embodiment, the target data cube includes all dimensions and metrics in the corresponding original data cube set, so as to implement metadata retention in the process of normalizing the original data cube set.
Step 105, storing metadata information of the target data cube set, mapping relations among elements of the target data cube set and the original data cube set into a database;
in this embodiment, when searching for the data information of the original data cube, the data information of the original data cube may be searched for in the target data cube according to the mapping relationship between the target data cube set and each element of the original data cube set.
In an optional implementation manner of this embodiment, storing, in the database, metadata information of the target data cube set, a mapping relationship between elements of the target data cube set and the original data cube set includes: performing de-duplication on each dimension of the target data cube, and putting the de-duplicated dimension into an array, wherein the subscript corresponding to the array is used as a key value corresponding to the generated dictionary tree; combining and de-duplicating the dimensions of the target data cube, and mapping the generated dimension combination to form a single dictionary tree in the last step to finally form a dictionary tree of the dimensions; and storing a part of the dictionary tree in a memory and a cloumn key of mysql respectively.
In this embodiment, firstly, de-duplicating each dimension of the target data cube, and putting the de-duplicated dimension into an array, where the subscript corresponding to the array is used as the key value of the corresponding generation dictionary tree; then, carrying out combination deduplication on the dimensions of the target data cube, specifically, firstly carrying out combination deduplication on every N-1 dimensions in N dimensions, mapping the generated dimension combination to form a single dictionary tree in the last step, then carrying out combination deduplication on every N-2 dimensions in N dimensions, mapping the generated dimension combination to form a single dictionary tree in the last step, and the like, and finally forming the dictionary tree in the dimension; when the data information of the original data cube is queried through dimensions, the dictionary tree utilizes the common prefix of the character strings to reduce the query time, furthest reduces unnecessary character string comparison and improves the query efficiency.
In this embodiment, the algorithm for calculating the non-repeated value for the dimension is based on the radix statistics hyperlog algorithm, and the algorithm under the present scenario is secondarily optimized and improved, by dividing the hash input stream into m substrings, keeping the value of m observable for each substring, and using the average value of the additional observed values to generate a counter, the data set with Nmax as the radix is estimated only by using the loglog (Nmax) +o (1) bits, and the data can be calculated according to the precision value specified by the user. The method can perform de-duplication calculation on billions of different data elements by using only 1.5kb of space, and control the precision within 98%, and most of scene precision is over 99.2.
Storing a part of dictionary tree in a memory, wherein the dictionary tree is used for calculating all data from 1 dimension to N dimensions of the target data cube; the dictionary tree is stored in the cloumn key of mysql and used for subsequent users to inquire the original data cube, so that the time for redefining the original data cube by the users and the time for reconstructing the original data cube and the target data cube by the system are saved.
Step 106, calculating all data from 1 dimension to N dimensions of the target data cube;
in an alternative implementation of this embodiment, the calculating all data of 1-dimension to N-dimension of the target data cube includes: calculating N-1-dimension data summary of the dictionary tree by using an MR (Map/Reduce) framework, calculating N-2-dimension data by N-1 dimension, and repeating iterative calculation until one-dimension data;
in this embodiment, map/Reduce is a distributed computing model for large-scale data processing, map/Reduce is a programming model (programming model), and is a related implementation for processing and generating large-scale data sets (processing and generating large data sets). The user defines a map function to process a key/value pair to generate a batch of intermediate key/value pairs, and a reduce function to merge all these intermediate values with the same key. The computation of the N-1 to 1-dimensional data of the dictionary tree using the MR framework in this embodiment is also based on this principle, and the time cost of the entire dimension reduction operation is log (N).
Optionally, when calculating data in this step, input dirty data is processed synchronously, and the processing types include filtering (discarding data which does not meet requirements), converting (converting data which does not meet requirements into a default value), and adding processing rules by a user.
In this embodiment, the rule for judging whether the data is dirty mainly includes: integer numbers, inside which are necessarily integers, or else dirty data; floating point numbers, in which decimal numbers are necessary, or dirty data; the data of the limited range must be data of a specified range, for example, the sex can only contain men or women; date and time type data, the inside must be date and time, otherwise, the inside is dirty data.
The embodiment provides diversified solutions for the fault tolerance of the data, automatically detects whether the data is dirty data or not in the data operation process, and intelligently converts the dirty data when the dirty data is detected, and can flexibly increase the processing rules by a user.
Step 107, compressing and storing all data from 1 dimension to N dimensions of the target data cube.
In this embodiment, optionally, the compressing and storing all data from 1 dimension to N dimensions of the target data cube includes: splitting the data into multiple copies in line number or data size, for example: the total number of data lines is one million lines, 1280M in size, and two splitting schemes are optional: splitting one part of data into ten parts of data by using one hundred thousand lines, and independently compressing each part of data to obtain ten parts of compressed data, wherein each part of data comprises one hundred thousand lines; with 128M as one part, splitting into ten parts of data, and compressing each part independently to obtain ten parts of compressed data, wherein the original data contained in each part of data is 128M (the number of lines contained in each part of data may be less than one hundred thousand lines or may be more than one hundred thousand lines). And compressing each data by using a Snappy compression algorithm, and storing the compressed data to hbase.
In this embodiment, a storage scheme based on hybrid "row+column" storage is implemented, and the line data is decompressed by using the snapy algorithm of the reference LZSS coding scheme, so that compared with the conventional row storage, the performance can be improved by three to four times, and the highest compression in the data part scene is 20.6% of the original data.
Compared with simple line storage, the mixed type 'line + column' storage compression developed by the scheme has the advantages that the data range in the same field is much smaller, the compression ratio is higher, 79.4% of the storage space can be compressed in the analysis subject of part of the scene, and the use of the storage space is greatly reduced.
Compared with simple column storage, the mixed type 'row + column' storage compression developed by the scheme effectively solves the problem that data are displayed according to the range rapidly under the condition of a main key (single main key or combined main key). Can search the positioning range of millisecond level and can screen and extract data rapidly.
In an optional implementation manner of the embodiment of the present invention, the data processing method further includes:
step 201, providing a query entry for a user to select an original data cube to be checked and input query conditions;
in this embodiment, the query portal provides a user-defined original data cube for the user to select, and when the user wants to view a previously-defined original data cube, the user only needs to select the original data cube to be viewed through the query portal; in addition, the query portal also provides a plurality of dimensions and metrics for a user to define a new original data cube, and when the user wants to query the information of the new original data cube, the dimensions and metrics are selected through the query portal.
Step 202, searching corresponding dictionary information according to dimension information of an original data cube queried by a user, scanning and filtering clomn keys in hbase according to key values in a dictionary, and summarizing calculated values according to scanned data;
in this embodiment, the user selects the original data cube to be checked through the query entry, the system searches the target data cube according to the dictionary information and the original data cube queried by the user, determines the corresponding dimension and metric field in the target data cube according to the dimension and metric field included in the original data cube, and then queries the dictionary information according to the target data cube and the dimension and metric information again to determine the table information of the data stored in hbase. After determining the table information in hbase, the data in the lookup table is returned to the user. When the data in the hbase table is queried, the query is performed according to the rowkey, and the rowkey is specially optimized according to the service scene, so that the query efficiency is improved.
Step 203, displaying the calculated value.
In this embodiment, a real-time visual presentation is provided.
In a second aspect, a second embodiment of the present invention provides a data cube processing apparatus based on big data, including: a user selection module for providing a plurality of dimensions and metrics for selection by a user; the response module is used for generating each original data cube according to the dimension and the measurement selected by each user; a first storage module storing metadata information of each original data cube; the classification module classifies each original data cube to obtain an original data cube set with different categories; the normalization module is used for carrying out normalization processing on each original data cube in each original data cube set to respectively obtain target data cubes, each target data cube corresponds to one original data cube set, and all target data cubes form a target data cube set; the second storage module is used for storing the metadata information of the target data cube set, the mapping relation among the elements of the target data cube set and the original data cube set into a database; a data calculation module for calculating all data from 1 dimension to N dimensions of the target data cube; and the third storage module is used for compressing and storing all data from 1 dimension to N dimensions of the target data cube.
In an optional implementation manner of the second embodiment of the present invention, the normalizing module normalizes each original data cube in each original data cube set to obtain a target data cube, respectively, including: and calculating the union of the dimensions and the union of the metrics for each original data cube in the original data cube set, and generating a target data cube according to the calculated union of the dimensions and the union of the metrics.
In an optional implementation manner of the second embodiment of the present invention, the storing, by the second storage module, metadata information of the target data cube set, a mapping relationship between each element of the target data cube set and each element of the original data cube set into the database includes: performing de-duplication on each dimension of the target data cube, putting the de-duplicated dimensions into an array, using subscripts corresponding to the array as key values corresponding to the generated dictionary trees, performing combination de-duplication on the dimensions of the target data cube, performing combination mapping on the generated dimensions to form a single dictionary tree in one step, and finally forming the dictionary tree of the dimensions, wherein one dictionary tree is stored in a memory and a clomn key of mysql respectively; the data calculation module calculating all data of 1-dimension to N-dimension of the target data cube includes: calculating N-1 dimension data summary of the dictionary tree by using an MR frame, calculating N-2 dimension data by using N-1 dimension, and repeating iterative calculation until one dimension data; the compressing stores all data of 1-dimension to N-dimension of the target data cube including: dividing the data into a plurality of parts by the number of lines or the data size, compressing each part of data by using a Snappy compression algorithm, and storing the compressed data.
In an optional implementation manner of the second embodiment of the present invention, the data processing apparatus further includes a user query module, a search module, and a display module: the user query module is used for providing a query inlet for a user to select an original data cube to be checked and input query conditions; the searching module is used for searching corresponding dictionary information according to dimension information of an original data cube queried by a user, scanning and filtering a rowkey in a hbase according to a key value in the dictionary, and summarizing a calculated value according to scanning data; and the display module is used for displaying the calculated value.
The device of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and its implementation principle and technical effects are similar, and are not described here again.
In a third aspect, an electronic device is provided in a third embodiment of the present invention, and fig. 2 is a schematic structural diagram of an embodiment of the electronic device in the present invention, where a flow of the embodiment in fig. 1 of the present invention may be implemented, as shown in fig. 2, where the electronic device may include: the processor 32 and the memory 33 are arranged on the circuit board 34, wherein the circuit board 34 is arranged in a space surrounded by the shell 31; a power supply circuit 35 for supplying power to the respective circuits or devices of the above-described electronic apparatus; the memory 33 is for storing executable program code; the processor 32 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 33 for performing a big data based data cube processing method according to any of the foregoing embodiments.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (9)

1. A method for processing a big data-based data cube, comprising:
providing a plurality of dimensions and metrics for selection by a user;
generating each original data cube according to the dimension and the measure selected by each user, and storing metadata information of each original data cube;
classifying each original data cube to obtain an original data cube set of different categories;
carrying out normalization processing on each original data cube in each original data cube set to obtain target data cubes respectively, wherein the normalization processing comprises the steps of calculating the union of dimensions and the union of metrics for each original data cube in the original data cube set, generating target data cubes according to the calculated union of dimensions and the union of metrics, each target data cube corresponds to one original data cube set, and all target data cubes form a target data cube set;
storing metadata information of a target data cube set, mapping relations among elements of the target data cube set and the original data cube set into a database, wherein deduplication is carried out on each dimension of the target data cube, algorithm selection for calculating a non-duplicate value of the dimension is carried out on the basis of a radix statistics hyperglogy algorithm, algorithm secondary optimization improvement is carried out, m sub-character strings are formed by dividing hash input streams, m values of each sub-input stream are kept observable, and a counter is generated by utilizing an average value of additional observed values;
all data from 1 dimension to N dimension of the target data cube are calculated, including providing diversified solutions for data fault tolerance, processing input dirty data synchronously, and filtering and converting processing types;
and compressing and storing all data of 1-dimension to N-dimension of the target data cube, wherein the data is divided into a plurality of parts by the number of lines or the data size, so that the storage compression based on the mixed type 'line + column' is realized.
2. The processing method according to claim 1, wherein storing the metadata information of the target data cube set, the mapping relationship between the respective elements of the target data cube set and the original data cube set in the database comprises:
performing de-duplication on each dimension of the target data cube, and putting the de-duplicated dimension into an array, wherein the subscript corresponding to the array is used as a key value corresponding to the generated dictionary tree;
combining and de-duplicating the dimensions of the target data cube, and mapping the generated dimension combination to form a single dictionary tree in the last step to finally form a dictionary tree of the dimensions;
and storing a part of the dictionary tree in a memory and a cloumn key of mysql respectively.
3. A processing method according to claim 2, characterized in that:
the computing all data of 1-dimension to N-dimension of the target data cube includes: calculating N-1 dimension data summary of the dictionary tree by using an MR frame, calculating N-2 dimension data by using N-1 dimension, and repeating iterative calculation until one dimension data;
the compressing stores all data of 1-dimension to N-dimension of the target data cube including: dividing the data into a plurality of parts by the number of lines or the data size, compressing each part of data by using a Snappy compression algorithm, and storing the compressed data.
4. A method of treatment according to claim 3, further comprising:
providing a query portal for a user to select an original data cube to view and to input query conditions;
searching corresponding dictionary information according to dimension information of an original data cube queried by a user, scanning and filtering a rowkey in hbase according to key values in the dictionary, and summarizing and calculating values according to scanning data;
the calculated values are shown.
5. A big data based data cube processing apparatus, comprising:
a user selection module for providing a plurality of dimensions and metrics for selection by a user;
the response module is used for generating each original data cube according to the dimension and the measurement selected by each user;
a first storage module storing metadata information of each original data cube;
the classification module classifies each original data cube to obtain an original data cube set with different categories;
the normalization module is used for carrying out normalization processing on each original data cube in each original data cube set to respectively obtain target data cubes, wherein the normalization processing comprises the steps of calculating the union of dimensions and the union of metrics for each original data cube in the original data cube set, generating the target data cubes according to the calculated union of dimensions and the union of metrics, each target data cube corresponds to one original data cube set, and all target data cubes form the target data cube set;
the second storage module is used for storing the metadata information of the target data cube set, the mapping relation among the elements of the target data cube set and the original data cube set into a database, and comprises the steps of de-duplicating each dimension of the target data cube, wherein algorithm selection for calculating a non-duplication value of the dimension is based on a radix statistics hyperlog algorithm, algorithm secondary optimization improvement is carried out, m sub-character strings are formed by dividing hash input streams, m values of each sub-input stream are kept observable, and a counter is generated by utilizing an average value of additional observation values;
the data calculation module is used for calculating all data from 1 dimension to N dimensions of the target data cube, wherein the data calculation module provides a diversified solution for the fault tolerance of the data and synchronously processes the input dirty data, and the processing types are filtered and converted;
and the third storage module is used for compressing and storing all data from 1 dimension to N dimensions of the target data cube, and dividing the data into a plurality of parts according to the line number or the data size, so that the storage compression based on the mixed line and column is realized.
6. The processing apparatus according to claim 5, wherein the normalizing module normalizes each original data cube in each original data cube set to obtain a target data cube, respectively, comprising:
and calculating the union of the dimensions and the union of the metrics for each original data cube in the original data cube set, and generating a target data cube according to the calculated union of the dimensions and the union of the metrics.
7. The processing apparatus according to claim 5, wherein:
the second storage module storing metadata information of the target data cube set, mapping relations among elements of the target data cube set and the original data cube set into a database comprises: performing de-duplication on each dimension of the target data cube, putting the de-duplicated dimensions into an array, using subscripts corresponding to the array as key values corresponding to the generated dictionary trees, performing combination de-duplication on the dimensions of the target data cube, performing combination mapping on the generated dimensions to form a single dictionary tree in one step, and finally forming the dictionary tree of the dimensions, wherein one dictionary tree is stored in a memory and a clomn key of mysql respectively;
the data calculation module calculating all data of 1-dimension to N-dimension of the target data cube includes: calculating N-1 dimension data summary of the dictionary tree by using an MR frame, calculating N-2 dimension data by using N-1 dimension, and repeating iterative calculation until one dimension data;
the compressing stores all data of 1-dimension to N-dimension of the target data cube including: dividing the data into a plurality of parts by the number of lines or the data size, compressing each part of data by using a Snappy compression algorithm, and storing the compressed data.
8. The processing device of claim 5, further comprising a user query module, a lookup module, and a display module:
the user query module is used for providing a query inlet for a user to select an original data cube to be checked and input query conditions;
the searching module is used for searching corresponding dictionary information according to dimension information of an original data cube queried by a user, scanning and filtering a rowkey in a hbase according to a key value in the dictionary, and summarizing a calculated value according to scanning data;
and the display module is used for displaying the calculated value.
9. An electronic device, the electronic device comprising: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space surrounded by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing the operations of:
providing a plurality of dimensions and metrics for selection by a user;
generating each original data cube according to the dimension and the measure selected by each user, and storing metadata information of each original data cube;
classifying each original data cube to obtain an original data cube set of different categories;
carrying out normalization processing on each original data cube in each original data cube set to obtain target data cubes respectively, wherein the normalization processing comprises the steps of calculating the union of dimensions and the union of metrics for each original data cube in the original data cube set, generating target data cubes according to the calculated union of dimensions and the union of metrics, each target data cube corresponds to one original data cube set, and all target data cubes form a target data cube set;
storing metadata information of a target data cube set, mapping relations among elements of the target data cube set and the original data cube set into a database, wherein deduplication is carried out on each dimension of the target data cube, algorithm selection for calculating a non-duplicate value of the dimension is carried out on the basis of a radix statistics hyperglogy algorithm, algorithm secondary optimization improvement is carried out, m sub-character strings are formed by dividing hash input streams, m values of each sub-input stream are kept observable, and a counter is generated by utilizing an average value of additional observed values;
all data from 1 dimension to N dimension of the target data cube are calculated, including providing diversified solutions for data fault tolerance, processing input dirty data synchronously, and filtering and converting processing types;
and compressing and storing all data of 1-dimension to N-dimension of the target data cube, wherein the data is divided into a plurality of parts by the number of lines or the data size, so that the storage compression based on the mixed type 'line + column' is realized.
CN201811547105.3A 2018-12-17 2018-12-17 Big data-based data cube processing method and device and electronic equipment Withdrawn - After Issue CN109684419B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811547105.3A CN109684419B (en) 2018-12-17 2018-12-17 Big data-based data cube processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811547105.3A CN109684419B (en) 2018-12-17 2018-12-17 Big data-based data cube processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN109684419A CN109684419A (en) 2019-04-26
CN109684419B true CN109684419B (en) 2023-10-03

Family

ID=66186355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811547105.3A Withdrawn - After Issue CN109684419B (en) 2018-12-17 2018-12-17 Big data-based data cube processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN109684419B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130166498A1 (en) * 2011-12-25 2013-06-27 Microsoft Corporation Model Based OLAP Cube Framework
CN105843842A (en) * 2016-03-08 2016-08-10 东北大学 Multi-dimensional gathering querying and displaying system and method in big data environment
CN108829707A (en) * 2018-05-02 2018-11-16 国网浙江省电力有限公司信息通信分公司 Big data intelligent analysis system and method across business domains

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130166498A1 (en) * 2011-12-25 2013-06-27 Microsoft Corporation Model Based OLAP Cube Framework
CN105843842A (en) * 2016-03-08 2016-08-10 东北大学 Multi-dimensional gathering querying and displaying system and method in big data environment
CN108829707A (en) * 2018-05-02 2018-11-16 国网浙江省电力有限公司信息通信分公司 Big data intelligent analysis system and method across business domains

Also Published As

Publication number Publication date
CN109684419A (en) 2019-04-26

Similar Documents

Publication Publication Date Title
CN110633292B (en) Query method, device, medium, equipment and system for heterogeneous database
US8533203B2 (en) Identifying synonyms of entities using a document collection
US7765216B2 (en) Multidimensional analysis tool for high dimensional data
US11036685B2 (en) System and method for compressing data in a database
US7870114B2 (en) Efficient data infrastructure for high dimensional data analysis
KR20160145785A (en) Flash optimized columnar data layout and data access algorithms for big data query engines
EP3709127A1 (en) Novel olap precomputation model and precomputation result generation method
Wang et al. MILC: Inverted list compression in memory
WO2021047373A1 (en) Big data-based column data processing method, apparatus, and medium
Huang et al. Effective data co-reduction for multimedia similarity search
CN112286961B (en) SQL optimization query method and device
CN112800213A (en) Medical text information display method and device and electronic equipment
CN111400323A (en) Data retrieval method, system, device and storage medium
US20230124432A1 (en) Database Indexing Using Structure-Preserving Dimensionality Reduction to Accelerate Database Operations
Welch et al. Fast and accurate incremental entity resolution relative to an entity knowledge base
CN110083731B (en) Image retrieval method, device, computer equipment and storage medium
EP3955256A1 (en) Non-redundant gene clustering method and system, and electronic device
US20090307214A1 (en) Computer system for performing aggregation of tree-structured data, and method and computer program product therefor
CN110874366A (en) Data processing and query method and device
CN109684419B (en) Big data-based data cube processing method and device and electronic equipment
CN115048469A (en) Data query method and device, electronic equipment and storage medium
CN114139040A (en) Data storage and query method, device, equipment and readable storage medium
RU2417424C1 (en) Method of compensating for multi-dimensional data for storing and searching for information in database management system and device for realising said method
JP2001022766A (en) Method and device for high speed processing for multidimensional database
US10387466B1 (en) Window queries for large unstructured data sets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20190426

GR01 Patent grant
GR01 Patent grant