CN111090705B

CN111090705B - Multidimensional data processing method, device and equipment and storage medium

Info

Publication number: CN111090705B
Application number: CN201811236196.9A
Authority: CN
Inventors: 曾锐; 陈国栋; 徐乾龙
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-10-23
Filing date: 2018-10-23
Publication date: 2023-08-25
Anticipated expiration: 2038-10-23
Also published as: CN111090705A

Abstract

The invention provides a multidimensional data processing method, a multidimensional data processing device, multidimensional data processing equipment and a multidimensional data storage medium, wherein the multidimensional data processing method comprises the following steps: determining a data level to which the read multidimensional data belongs, and aggregating the multidimensional data according to the data level to obtain cube data belonging to a corresponding data level; storing the cube data into a local target level cache region, wherein the target level cache region is a local level cache region corresponding to a data level to which the cube data belongs; judging whether the data amount cached in the target level cache area reaches a specified threshold value, if so, transferring the data cached in the target level cache area to a level database table of a target database corresponding to the data level. The problem of large data volume when inquiring the cube data caused by the indifferent aggregation and storage of multidimensional data is avoided, and the inquiry efficiency is improved.

Description

Multidimensional data processing method, device and equipment and storage medium

Technical Field

The present invention relates to the field of big data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for multidimensional data processing.

Background

Big data processing includes multidimensional data processing. A Data Cube (Data Cube) is a processing edge for large Data, such as multidimensional Data, and can implement real-time indexing of any multiple keywords of Cube Data. After the multidimensional data are aggregated into data cube data (cube data for short), a cube data set can be explored and analyzed from multiple angles, and as cube data are obtained by aggregating the plurality of dimensional data, a required result can be directly obtained by inquiring the cube data set, and the multidimensional data are not required to be calculated in real time, so that the inquiring and retrieving efficiency of the data can be greatly improved.

In the related multidimensional data processing mode, after the multidimensional data acquired from the data source are aggregated in a unified aggregation processing mode, the obtained cube data are all stored in the same database table of the database.

In the multidimensional data processing mode, due to consistency of aggregation and storage modes, when the interested cube data needs to be queried, all stored cube data needs to be traversed, so that the query data size is large, and the query efficiency is low.

Disclosure of Invention

In view of this, the present invention provides a multidimensional data processing method, apparatus, device and storage medium, which avoid the problem of large query data volume caused by consistency of aggregation and storage modes, and are beneficial to improving query efficiency.

The first aspect of the present invention provides a multidimensional data processing method, comprising:

determining a data level to which the read multidimensional data belongs, and aggregating the multidimensional data according to the data level to obtain cube data belonging to a corresponding data level;

storing the cube data into a local target level cache region, wherein the target level cache region is a local level cache region corresponding to a data level to which the cube data belongs;

judging whether the data amount cached in the target level cache area reaches a specified threshold value, if so, transferring the data cached in the target level cache area to a level database table of a target database corresponding to the data level.

According to one embodiment of the present invention, storing the cube data in a local target level cache area includes:

determining a data identifier of cube data according to a data level to which the cube data belongs and a time value on a designated time level in time dimension data of corresponding multidimensional data;

determining a target level cache area according to the data identifier of the cube data, and inquiring whether first cube data corresponding to the data identifier exists in the target level cache area;

If the first cube data exist, the first cube data are acquired, the cube data are combined with the acquired first cube data, and the first cube data in the target level cache area are modified into combined cube data.

According to one embodiment of the present invention, storing the cube data in a local target level cache area further includes:

if the first cube data does not exist in the target level cache area, searching the data identifier in an established data identifier table; the data identification table is recorded with the data identification of the cube data stored in the target database;

if the data is found, second cube data corresponding to the data identifier is obtained from the target database, the cube data is combined with the obtained second cube data, the combined cube data and the data identifier are stored into a target level cache area, and the data identifier in the data identifier table is deleted;

and if not, storing the cube data and the data identifier into a target level cache area.

According to one embodiment of the present invention, the determining a data hierarchy to which the read multidimensional data belongs includes:

Judging whether time dimension data in the multidimensional data is in a designated time interval or not;

if yes, determining a data level to which the multidimensional data belongs as a first level;

if not, judging whether the time dimension data in the multi-dimensional data is smaller than the smaller endpoint time of the appointed time interval, and if so, determining the data level to which the multi-dimensional data belongs as a second level.

According to one embodiment of the present invention, after determining whether the time dimension data in the multi-dimensional data is smaller than the smaller end point time of the specified time interval, the method further comprises:

and if not smaller than the preset time interval, adjusting the preset time interval according to the time dimension data so that the time dimension data is in the adjusted preset time interval, determining a data level to which the multidimensional data belongs as a first level, and deleting the stored cube data which belongs to the first level and corresponds to the multidimensional data and is not in the adjusted preset time interval and the data related to the cube data.

According to one embodiment of the present invention, aggregating the multidimensional data according to the data hierarchy to obtain cube data belonging to a corresponding data hierarchy includes:

If the multi-dimensional data belongs to a first level, aggregating time values on all time levels of time dimension data in the multi-dimensional data and the remaining appointed dimension data in the multi-dimensional data to obtain cube data belonging to the first level;

if the multi-dimensional data belongs to the second hierarchy, aggregating the time value on the target time hierarchy of the time dimension data in the multi-dimensional data and the remaining appointed dimension data in the multi-dimensional data to obtain cube data belonging to the second hierarchy;

wherein the target time hierarchy is less than all of the time hierarchies of the time dimension data in the multi-dimensional data.

A second aspect of the present invention provides a multidimensional data processing apparatus comprising:

the aggregation processing module is used for determining the data level to which the read multidimensional data belong and aggregating the multidimensional data according to the data level to obtain cube data belonging to the corresponding data level;

the local caching module is used for storing the cube data into a local target level cache region, wherein the target level cache region is a local level cache region corresponding to a data level to which the cube data belongs;

and the data transfer module is used for judging whether the data amount cached in the target level cache area reaches a specified threshold value, and transferring the data cached in the target level cache area to a level database table of a target database corresponding to the data level if the data amount cached in the target level cache area reaches the specified threshold value.

According to one embodiment of the present invention, the local cache module includes:

the identification determining unit is used for determining the data identification of the cube data according to the data level to which the cube data belongs and the time value on the designated time level in the time dimension data of the corresponding multidimensional data;

the local query unit is used for determining a target level cache area according to the data identifier of the cube data and querying whether first cube data corresponding to the data identifier exists in the target level cache area;

and the first merging unit is used for acquiring the first cube data if the first cube data exists, merging the cube data with the acquired first cube data, and modifying the first cube data in the target level cache region into merged cube data.

According to one embodiment of the present invention, the local cache module further includes:

the identification searching unit is used for searching the data identification in the established data identification table if the first cube data does not exist in the target level cache region; the data identification table is recorded with the data identification of the cube data stored in the target database;

the second merging unit is used for acquiring second cube data corresponding to the data identifier from the target database if the second cube data is found, merging the cube data with the acquired second cube data, storing the merged cube data and the data identifier into a target level cache area, and deleting the data identifier in the data identifier table;

And the caching unit is used for storing the cube data and the data identifier into a target level cache area if the cube data and the data identifier are not found.

According to one embodiment of the invention, the aggregation processing module comprises:

a time judging unit for judging whether the time dimension data in the multidimensional data is in a specified time interval;

a hierarchy determining first unit, configured to determine, if yes, a data hierarchy to which the multidimensional data belongs as a first hierarchy;

and the hierarchy determining second unit is used for judging whether the time dimension data in the multidimensional data is smaller than the smaller endpoint time of the appointed time interval or not if not, and determining the data hierarchy to which the multidimensional data belongs as a second hierarchy if not.

According to one embodiment of the invention, the hierarchy determining the second unit further comprises:

and the hierarchy determining subunit is used for adjusting the appointed time interval according to the time dimension data if the time dimension data is not smaller than the preset time interval, so that the time dimension data is in the adjusted appointed time interval, determining the data hierarchy to which the multidimensional data belongs as a first hierarchy, and deleting the stored cube data which belongs to the first hierarchy and corresponds to the multidimensional data and is not in the adjusted appointed time interval and the data related to the cube data.

According to one embodiment of the invention, the aggregation processing module further comprises:

the first aggregation processing unit is used for carrying out aggregation processing on time values on all time levels of time dimension data in the multi-dimensional data and the remaining appointed dimension data in the multi-dimensional data if the multi-dimensional data belongs to a first level to obtain cube data belonging to the first level;

the second aggregation processing unit is used for carrying out aggregation processing on the time value on the target time level of the time dimension data in the multi-dimensional data and the remaining appointed dimension data in the multi-dimensional data if the multi-dimensional data belongs to a second level, so as to obtain cube data belonging to the second level;

A third aspect of the invention provides an electronic device comprising a processor and a memory; the memory stores a program that can be called by the processor; wherein the processor, when executing the program, implements the multidimensional data processing method as described in the foregoing embodiments.

A fourth aspect of the invention provides a machine readable storage medium having stored thereon a program which, when executed by a processor, implements a multi-dimensional data processing method as described in the previous embodiments.

The embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, the multidimensional data is subjected to data classification, the multidimensional data is polymerized according to the data hierarchy to which the multidimensional data belongs, and the obtained cube data also correspondingly realizes data classification.

And after the multidimensional data are aggregated and calculated to obtain cube data, the cube data are not directly stored in the target database, but are cached in the local target level cache region, and only when the data amount in the target level cache region reaches a specified threshold value, the cached data in the target level cache region can be transferred to the target database, so that the access frequency to the target database can be reduced, the target database can deal with data access in time, and the throughput of a system is improved.

Drawings

FIG. 1 is a flow diagram of a multi-dimensional data processing method according to an exemplary embodiment of the present invention;

FIG. 2 is a block diagram of a multi-dimensional data processing apparatus according to an exemplary embodiment of the present invention;

FIG. 3 is a flow chart of a multi-dimensional data processing method according to a more specific embodiment of the invention;

FIG. 4 is a flow chart illustrating a method for determining a data hierarchy to which multidimensional data belongs in accordance with an exemplary embodiment of the present invention;

fig. 5 is a block diagram of an electronic device according to an exemplary embodiment of the present invention.

Description of the embodiments

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various devices, these information should not be limited by these terms. These terms are only used to distinguish one device from another of the same type. For example, a first device could also be termed a second device, and, similarly, a second device could also be termed a first device, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

In order to make the description of the present invention clearer and more concise, some technical terms of the present invention are explained below:

kafKa: a high throughput distributed publish-subscribe messaging system that can handle all action flow data in consumer scale websites.

Spark Streaming: the basic principle of the streaming batch processing engine is that input data is processed in batches at a certain time interval, and when the batch processing interval is shortened to the second level, the streaming batch processing engine can be used for processing real-time data streams. Spark Streaming supports the retrieval of data from a variety of data sources, including KafKa, flume, twitter, zeroMQ, kinesis, TCP sockets, etc., from which advanced functions such as map, reduce, join, etc., may be used to perform complex algorithm processing.

The following describes the multi-dimensional data processing method according to the embodiment of the present invention in more detail, but the method is not limited thereto. In one embodiment, referring to FIG. 1, a multi-dimensional data processing method may include the steps of:

s100: determining a data level to which the read multidimensional data belongs, and aggregating the multidimensional data according to the data level to obtain cube data belonging to a corresponding data level;

s200: storing the cube data into a local target level cache region, wherein the target level cache region is a local level cache region corresponding to a data level to which the cube data belongs;

s300: judging whether the data amount cached in the target level cache area reaches a specified threshold value, if so, transferring the data cached in the target level cache area to a level database table of a target database corresponding to the data level.

In an embodiment of the present invention, the multidimensional data includes a plurality of information data, each of which is one of the multidimensional data. For example, one piece of multidimensional data includes a plurality of information data of time, name, age, commodity, price of purchased commodity, and the like, which are different dimensional data among the multidimensional data.

The execution body of the multidimensional data processing method in the embodiment of the invention can be an electronic device, and further can be a processor of the electronic device, wherein the processor can be one or more processors, and the processor can be a general purpose processor or a special purpose processor. The electronic device is, for example, a computer device, and is particularly not limited.

In step S100, a data hierarchy to which the read multidimensional data belongs is determined, and the multidimensional data is aggregated according to the data hierarchy to obtain cube data belonging to a corresponding data hierarchy.

The multi-dimensional data may be obtained from a data source in batches, a batch of multi-dimensional data comprising a plurality of pieces of multi-dimensional data. After the multidimensional data are acquired, determining the data level of each piece of multidimensional data in the current batch, and aggregating the multidimensional data according to the data level of each piece of multidimensional data to obtain corresponding cube data, wherein each piece of multidimensional data can be aggregated to obtain corresponding cube data or a plurality of pieces of cube data, and if the plurality of pieces of cube data are obtained, the data identification of the plurality of pieces of cube data can be different.

The method for obtaining multidimensional data in batch may specifically be that multidimensional data is obtained from a KafKa data source by Spark Streaming. Of course, the multidimensional data may be obtained in other manners, or may be obtained from other data sources, which is not limited in particular. The Spark Streaming can acquire a batch of multidimensional data each time, and the time interval between each batch is smaller, so that the real-time performance of multidimensional data processing is facilitated.

After Spark Streaming reads multidimensional data from KafKa, the message partition of the data which is currently read in KafKa and the offset of the data in the message partition are managed. Preferably, the local can perform persistence operation on the offset of the Spark Streaming managed message partition and the data in the message partition, that is, the offset is stored in the local nonvolatile memory, so that after the local power failure is restarted, the data which is currently read in the KafKa can be accurately positioned according to the offset of the message partition and the data in the message partition, thereby ensuring that the multidimensional data can be correctly consumed and not be repeatedly consumed.

It is to be understood that the present invention is not limited to acquiring multidimensional data in a batch manner, but may acquire multidimensional data in a single acquisition manner. In this case, each time a piece of multidimensional data is acquired, a data hierarchy to which the multidimensional data belongs is determined, and the multidimensional data is aggregated according to the data hierarchy to obtain corresponding cube data.

The multi-dimensional data hierarchy method may specifically divide multi-dimensional data within 30 days and multi-dimensional data before 30 days into two data hierarchies. For example, a first piece of multi-dimensional data is time-dimensional data 2018-08-22-10-10-10 and a second piece of multi-dimensional data is time-dimensional data 2018-07-22-10-10-10, then the two pieces of multi-dimensional data are categorized into two data levels. For example, a first piece of multidimensional data belongs to a first hierarchy and a second piece of multidimensional data belongs to a second hierarchy.

The multidimensional data belonging to different data levels have corresponding multidimensional data aggregation modes, the multidimensional data can be aggregated according to the multidimensional data aggregation modes corresponding to the data levels to which the multidimensional data belong, and the obtained cube data correspondingly belong to the data levels corresponding to the multidimensional data. Different multidimensional data aggregation may refer to different data being aggregated, and/or different combinations of data, and so forth.

The multidimensional data and the corresponding cube data may be categorized under the same data hierarchy, for example, when the multidimensional data belongs to the first hierarchy, the aggregated cube data also belongs to the first hierarchy.

Corresponding memory caches are allocated locally for storing the cube data, and the corresponding memory caches are used as hierarchical cache areas for caching the cube data, and are not limited to the memory caches. More than two levels of cache may be locally configured, depending on how many data levels are present. For example, when the data hierarchy includes a first hierarchy and a second hierarchy, the local may have a first-hierarchy cache and a second-hierarchy cache. If the cube data belong to the first level, storing the cube data into a first level cache area; if the cube data belongs to the second level, storing the cube data into a second level cache region.

It can be understood that the associated data associated with the corresponding cube data may be cached in each level cache area, and the associated data is not limited, for example, a data identifier of the cube data. The cube data and the corresponding association data may be stored in the same entry of the hierarchical cache.

In step S200, a target level buffer corresponding to the data level to which the cube data belongs is determined, and the cube data is stored in the target level buffer.

In this embodiment, when locally caching the cube data obtained by aggregation, after all the cube data obtained by aggregation of the multidimensional data of the batch are cached at one time, only the hierarchical cache region corresponding to the cube data is determined. It can be understood that, of course, it is also possible to obtain a cube data per aggregation, locally cache the cube data, and aggregate the next piece of multidimensional data.

When multi-dimensional data is obtained in batches by Spark Streaming, the time interval of each batch is small, in this case, if a batch of cube data obtained by aggregation is directly stored into the database each time, the access to the database is very frequent, and especially for a distributed database such as Hbase, the access is not so frequent.

In the embodiment of the invention, after the multi-dimensional data is aggregated, the multi-dimensional data is not directly stored in the database, but is stored in a local level cache area. Even if the time interval of Spark Streaming batch processing is reduced to the order of seconds or milliseconds, access to a distributed database such as Hbase is not burdened.

In step S300, it is determined whether the amount of data buffered in the target level buffer reaches a specified threshold, and if so, the data buffered in the target level buffer is transferred to a level database table of the target database corresponding to the data level.

The method comprises the steps that a specified threshold value is set for each level buffer area locally, each level buffer area can buffer a certain amount of cube data, and when the buffered cube data reach a certain amount, cube data need to be transferred to a target database. After each time cube data is stored into the corresponding target level cache area, calculating the data amount cached in the target level cache area, comparing the data amount with a specified threshold value, and when the data amount reaches the specified threshold value, transferring the data cached in the target level cache area into a level database table of a target database corresponding to the data level.

The specific value of the specified threshold is not limited and may be determined as needed. Preferably, the specified threshold is set such that a data amount greater than a batch of cube data can be stored in the target level cache.

The specified threshold may be a specified data size or a specified number of cube data, and accordingly, the amount of data buffered in the target level buffer area may also be measured by the buffered data size or the number of cube data buffered, which is not particularly limited. Taking the specified threshold value as an example of the specified number of pieces, when the cube data is stored in the target level cache area, calculating the accumulated cube data number, and when the cube data number reaches the specified number of pieces, indicating that the data amount cached in the target level cache area reaches the specified threshold value.

And if the data quantity cached in the target level cache area reaches a specified threshold value, the data cached in the target level cache area is transferred to a target database. Of course, if the amount of data buffered in the target level buffer does not reach the specified threshold, processing is continued for the next batch of multidimensional data, or processing is performed for the next piece of multidimensional data.

The transfer is to empty the target level buffer after the data buffered in the target level buffer is stored in the target database. Of course, the time for emptying is not limited, and the target level buffer area may be emptied immediately after the target level buffer area is stored in the target database, or the target level buffer area may be emptied again when the next cube data needs to be stored in the target level buffer area.

The target database may be tied to an external server. The target database may be, for example, a distributed database such as Hbase, although the specific type is not limited.

In both the local cache and the target database, the cube data is stored according to the data hierarchy to which the cube data belongs, in other words, the cube data is stored in a classified manner, the cube data belonging to the same data hierarchy can be stored in the storage space of the same data hierarchy, and the cube data belonging to different data hierarchies can be stored in the storage space of different data hierarchies.

Furthermore, since the data cube storage structure of the embodiments of the present invention actually includes a local hierarchical cache region and a target database, and is divided into a first hierarchical data cube storage structure and a second hierarchical data cube storage structure as a whole. When inquiring the cube data, whether the cube storage structure of the first level is required to be inquired or the cube storage structure of the second level is required to be inquired can be distinguished according to the inquiry condition input by the user, the direct return result can be searched in the local level cache area, and if the direct return result is not found, the cube storage structure is inquired in the level database table of the target database, so that the needed cube data can be inquired more quickly, and the inquiry speed of the cube data is improved.

In order to ensure the consistency of the data, before the data cached in the target level cache area is transferred to the target database, the data cached in the target level cache area is persisted to a local nonvolatile memory such as hdfs. After the persistence is completed, it is indicated that the data received from the data source is successfully consumed, and the message partition and the offset of the data in the message partition mentioned in the foregoing embodiment may be persisted at this time, so that the data of the data source may be consumed from a correct position during restarting, and then the cached data is restored from the hdfs to the target level cache area at the same time, and then restored to the target database.

In one embodiment, the above-described method flow may be performed by the multidimensional data processing apparatus 10, as shown in fig. 2, the multidimensional data processing apparatus 10 mainly includes 3 modules: the aggregation processing module 100, the local caching module 200 and the data transfer module 300. The aggregation processing module 100 is configured to execute the step S100, the local buffer module 200 is configured to execute the step S200, and the data transfer module 300 is configured to execute the step S300.

In one embodiment, in step S200, storing the cube data in the local target level cache area may include the following steps:

s201: determining a data identifier of cube data according to a data level to which the cube data belongs and a time value on a designated time level in time dimension data of corresponding multidimensional data;

s202: determining a target level cache area according to the data identifier of the cube data, and inquiring whether first cube data corresponding to the data identifier exists in the target level cache area;

s203: if the first cube data exist, the first cube data are acquired, the cube data are combined with the acquired first cube data, and the first cube data in the target level cache area are modified into combined cube data.

The multidimensional data includes time dimension data that can be obtained from a timestamp field of the multidimensional data. The time dimension data is divided into a plurality of time levels, and each time level has a corresponding time value. The specified time hierarchy may include one time hierarchy or more than two time hierarchies, as desired. The time dimension data is, for example, 2018-08-22-10-10-10, and represents that 10 minutes and 10 seconds are taken as the time level of 2018, 8, 22, 10 days, 10 minutes and 10 seconds, the time level comprises year, month, day, time, minutes and seconds, the designated time level can comprise year, month, day and time, and the time value on the designated time level in the time dimension data of the multi-dimensional data is 2018-08-22-10.

In step S201, if the time values (e.g. 2018-08-22-10) on the specified time levels in the data levels to which the two pieces of cube data belong and the time dimension data corresponding to the multidimensional data are the same, the two pieces of cube data have the same data identifier.

In step S202, whether first cube data corresponding to the data identifier of the cube data exists in the target level buffer is determined according to the determined data identifier, and if so, it is indicated that the cube data needs to be combined with the first cube data. Thus, the cube data to be combined is determined by matching the data identification.

Generally, after a certain amount of cube data is locally cached, the probability of successful matching of the data identification increases, so that the probability of finding cube data to be combined in a local target level cache area becomes larger, the query speed is faster than that of finding cube data in a database, and the overall processing time is shortened.

It can be appreciated that the cube data stored in the local hierarchy buffer may have a corresponding data identifier. The data identifier may be stored in the local level cache corresponding to the cube data, e.g., the data identifier is in the same entry as the corresponding cube data. Or a block of identification buffer can be opened up additionally for storing the data identification and the address required by addressing the corresponding cube data in the hierarchical buffer zone, the data identification is searched in the identification buffer firstly, the corresponding address is read when the data identification is searched, and the first cube data in the target hierarchical buffer zone is addressed and acquired according to the address, which is not particularly limited.

In step S203, when the first cube data exists in the target level cache, the cube data is merged with the first cube data, and the first cube data in the target level cache is modified into merged cube data.

The data merging mode is not limited, and one or more items of the two cube data can be merged. Illustratively, the cube data includes: time, name, age, commodity, price of purchased commodity; the merging may be followed by, but is not limited to: time, name, age, merchandise, sum of two merchandise prices purchased. That is, the price of purchased goods in the cube data is superimposed, while other items in the cube data are unchanged.

After the cube data is merged with the first cube data, only the first cube data in the target level cache area is replaced by the merged cube data, and associated data corresponding to the first cube data, such as data identification and the like, does not need to be changed.

And the merging of the cube data is carried out locally, so that when the data in the target level cache region is transferred to the target database, the merging operation of the cube data is not needed when the data is put in storage, and the time for storing the cube data into the database can be shortened.

In one embodiment, in step S200, storing the cube data in a local target level cache area further includes:

s204: if the first cube data does not exist in the target level cache area, searching the data identifier in an established data identifier table; the data identification table is recorded with the data identification of the cube data stored in the target database;

S205: if the data is found, second cube data corresponding to the data identifier is obtained from the target database, the cube data is combined with the obtained second cube data, the combined cube data and the data identifier are stored into a target level cache area, and the data identifier in the data identifier table is deleted;

s206: and if not, storing the cube data and the data identifier into a target level cache area.

In step S204, when the first cube data does not exist in the target level cache, it is further determined whether the second cube data corresponding to the data identifier exists in the target database.

And determining whether second cube data exists in the target database by searching the data identifier in the established data identifier table. Because the data identification table records the data identification of the cube data transferred to the target database, if the data identification table exists, the second cube data exists in the target database, otherwise, the second cube data does not exist in the target database.

In step S205, when the data identifier is found in the established data identifier table, a target level database table corresponding to the data level to which the cube data belongs in the target database is determined according to the data identifier, then second cube data corresponding to the data identifier is found in the target level database table (in case of no anomaly, the second cube data can be necessarily found in the target level database table because the data identifier is found in the data identifier table), and then the second cube data is acquired.

And combining the cube data with the acquired second cube data, and storing the combined cube data and the data identifier into a target level cache region. Of course, if the cube data and the data identifier are stored in separate tables, the merged cube data and the data identifier may be stored in the corresponding table entries respectively, so long as the two table entries are associated.

After second cube data corresponding to the data identifier is obtained from the target database, the data identifier in the data identifier table is deleted, and when the second cube data is subsequently transferred to the target database, the data identifier is recorded in the data identifier table, wherein the cube data in the table item corresponding to the data identifier in the target database can be covered when the combined cube data is transferred.

The data identification table is stored locally, and when the data in the hierarchical cache region is transferred to the target database, the data identification table is updated, and the data identification corresponding to the cube data transferred to the target database is stored in the data identification table.

In step S206, when it is determined that the cube data does not need to be merged, the cube data is directly stored in the target level cache region.

In this embodiment, the merging operation of cube data is performed locally, so that when the data in the target level cache area is transferred to the target database, the merging operation during storage is not needed, and the storage time is shortened; when the combined data is uploaded to the database, the data volume is small, so that the problem of network congestion is prevented; most of target cube data required by merging can be directly obtained from the local, so that the reading of the data from the target database is reduced, and the frequent access to the target database is avoided.

In a more specific embodiment, referring to fig. 3, a specific flow of the multidimensional data processing method may be as follows:

11 Acquiring multidimensional data from the kafKa data source, for example, acquiring the multidimensional data in batches by Spark Streaming, determining a data level to which the acquired multidimensional data belongs, and then executing step 12);

the data hierarchy may include a first hierarchy and a second hierarchy, such that the data hierarchy to which the multidimensional data belongs is either the first hierarchy or the second hierarchy;

12 Judging whether the multidimensional data belong to a first level, if so, aggregating the multidimensional data according to an aggregation mode of the first level to obtain cube data belonging to the first level, and if not, aggregating the multidimensional data according to an aggregation mode of a second level to obtain cube data belonging to the second level; the subsequent step 13) can be executed every time one cube data is obtained, or all cube data of the batch can be obtained and then the step 13) is executed;

13 Judging whether first cube data to be combined exist in a target level cache area corresponding to a data level to which the cube data belong, if so, executing the step 14), otherwise, executing the step 15);

each cube data can determine a corresponding data identifier before being stored, so that cube data to be combined can be determined by matching the data identifiers; in this step, whether the first cube data to be merged exists or not may be determined by determining the data identifier of the cube data and searching the data identifier in the target level buffer, and when the data identifier of the cube data exists in the target level buffer, it is indicated that the first cube data (corresponding to the data identifier) to be merged exists in the target level buffer;

14 Acquiring the first cube data from the target level cache region, merging the cube data obtained by aggregation in the step 12) with the first cube data, modifying the first cube data in the target level cache region into merged cube data, and executing the step 18);

15 Judging whether second cube data to be combined exist in the target database, if so, executing the step 16), otherwise, executing the step 17);

when second cube data corresponding to the data identifier of the cube data exists in the target database, the fact that the second cube data to be combined exists in the target database is indicated;

16 Acquiring second cube data from the target database, combining the cube data obtained by aggregation in the step 12) with the second cube data, caching the combined cube data in a target level cache area, and then executing the step 18);

17 Caching the cube data obtained by aggregation in the step 12) in a target level cache area, and then executing the step 18);

18 Judging whether the data cached in the target level cache area reaches a specified threshold value, if so, transferring the data cached in the target level cache area to a corresponding level database table of the target database; returning to step 11).

In one embodiment, in step S100, determining the data hierarchy to which the read multidimensional data belongs may include the steps of:

s101: judging whether time dimension data in the multidimensional data is in a designated time interval or not;

s102: if yes, determining a data level to which the multidimensional data belongs as a first level;

s103: if not, judging whether the time dimension data in the multi-dimensional data is smaller than the smaller endpoint time of the appointed time interval, and if so, determining the data level to which the multi-dimensional data belongs as a second level.

The specified time interval may be a custom time interval, and may be specific and not limited, for example, in the last 30 days, assuming that the current time is 2018 (year) -08 (month) -21 (day) -10 (hour) -10 (minutes) -10 (seconds), the specified time interval may be 2018-07-22-10-10-10 to 2018-08-21-10-10-10, and of course, the larger end time of the specified time interval is not necessarily the current time.

If the time dimension data in the multidimensional data is 2018-07-25-10-10-10, the time dimension data is in a specified time interval, and the data hierarchy to which the multidimensional data belongs is determined to be a first hierarchy.

If the time dimension data in the multidimensional data is 2018-07-21-10-10-10, the time dimension data is not in the appointed time interval, whether the time dimension data is smaller than the smaller endpoint time of the appointed time interval is continuously judged, and the data level to which the multidimensional data belongs is determined to be a second level because the 2018-07-21-10-10-10 is smaller than 2018-07-22-10-10.

In the embodiment of the invention, the smaller endpoint time is the left endpoint value of the appointed time interval, and the larger endpoint time is the right endpoint value of the appointed time interval.

In one embodiment, in step S103, after determining whether the time dimension data in the multi-dimensional data is smaller than the smaller end point time of the specified time interval, the method further includes:

s1031: and if not smaller than the preset time interval, adjusting the preset time interval according to the time dimension data so that the time dimension data is in the adjusted preset time interval, determining a data level to which the multidimensional data belongs as a first level, and deleting the stored cube data which belongs to the first level and corresponds to the multidimensional data and is not in the adjusted preset time interval and the data related to the cube data.

If the time dimension data in the multidimensional data is not smaller than the smaller end point time of the appointed time interval, the time dimension data is larger than the larger end point time of the appointed time interval, and the appointed time interval needs to be adjusted so that the time dimension data is in the adjusted appointed time interval. The specific time interval may be adjusted, for example, by increasing the smaller end point and the larger end point of the specific time interval by the same value while maintaining the length of the specific time interval.

Specifically, if the time dimension data in the multidimensional data is 2018-08-23-10-10-10, and the larger end time 2018-08-21-10-10 is larger than the specified time interval, the specified time interval can be adjusted from 2018-07-22-10-10-10-2018-08-21-10-10-10 to 2018-07-24-10-10-10-2018-08-23-10-10-10, and then the data level to which the multidimensional data belongs is determined to be the first level for subsequent aggregation calculation.

After the designated time interval is adjusted, the cube data which belongs to the first hierarchy and is not in the adjusted designated time interval in the corresponding multidimensional data and the data related to the cube data which are stored in the local hierarchy buffer area and the target database are deleted. Namely, the cube data corresponding to the multidimensional data of the first hierarchy and the time dimension data in 2018-07-24-10-10-2018-08-23-10-10-10 are deleted from the local hierarchy buffer area and the target database, so that the cube data belonging to the first hierarchy stored in the local hierarchy buffer area and the target database are in the adjusted designated time interval 2018-07-24-10-10-10-2018-08-23-10-10.

After the designated time interval is adjusted, the cube data which is not currently in the designated time interval is deleted, so that cube data belonging to the first level stored in the local level cache region and the target database are cube data which are currently in the designated time interval, the storage space required by the cube data can be reduced, and quick searching is facilitated.

In one embodiment, in step S100, aggregating the multidimensional data according to the data hierarchy to obtain cube data belonging to the corresponding data hierarchy may include the following steps:

s104: if the multi-dimensional data belongs to a first level, aggregating time values on all time levels of time dimension data in the multi-dimensional data and the remaining appointed dimension data in the multi-dimensional data to obtain cube data belonging to the first level;

s105: if the multi-dimensional data belongs to the second hierarchy, aggregating the time value on the target time hierarchy of the time dimension data in the multi-dimensional data and the remaining appointed dimension data in the multi-dimensional data to obtain cube data belonging to the second hierarchy;

When the multidimensional data belonging to the first hierarchy is aggregated, all time values on time hierarchies of the time dimension data participate in aggregation, for example, when the multidimensional data in 2018-07-24-10-10-10-2018-08-23-10-10-10 are aggregated, all time values on time hierarchies of year, month, day, time, minute and second participate in aggregation, so that more accurate results can be obtained in subsequent inquiry.

When the multidimensional data belonging to the second hierarchy is aggregated, not all time values on the time hierarchy participate in the aggregation, for example, only the time values on the target time hierarchies such as year, month and day participate in the aggregation, so that cube data with the same year, month and day can be combined into one cube data, and the user is not interested in the aggregation statistics values of the time hierarchies such as time, minute and second earlier than 2018-07-24-10-10-10, so that accurate statistics is not needed, the storage space required by cube data can be greatly reduced, and the query speed can be improved.

In a more specific embodiment, referring to fig. 4, when a batch of multidimensional data is obtained from a data source, the process of determining the data hierarchy to which the multidimensional data belongs may specifically include the following steps:

21 Judging whether a specified time interval exists, if so, executing the step 23), otherwise, executing the step 22);

the specified time interval may be set when a batch of multidimensional data is first received after each device start-up;

22 Acquiring time dimension data of a first piece of multidimensional data of the batch; generating a specified time interval according to the acquired time dimension data and the preset time dimension data length, for example, taking the acquired time dimension data as the larger endpoint time of the specified time interval, and taking the preset time dimension data length as the interval length of the specified time interval; acquiring the residual multidimensional data of the batch, and then executing step 23);

23 Judging whether the time dimension data of the multidimensional data is in a specified time interval or not, if so, executing the step 27), otherwise, executing the step 24);

24 Judging whether the time dimension data of the multidimensional data is smaller than the smaller endpoint time of the appointed time interval, if yes, executing the step 25), otherwise, executing the step 26);

25 Determining a data hierarchy to which the multidimensional data belongs as a second hierarchy; if the multidimensional data of the current batch are processed, ending and waiting for the next batch of data to come and executing the step 21); otherwise, returning to the step 23);

26 Adjusting a specified time interval according to the time dimension data of the multi-dimensional data, for example, setting the larger end point time of the specified time interval as the time dimension data of the multi-dimensional data without changing the interval length, and then executing step 27);

27 Determining the data level to which the multidimensional data belongs as a first level, ending if the multidimensional data of the current batch are processed, waiting for the next batch of data to come and returning to execute the step 21), otherwise returning to the step 23).

The present invention also provides a multi-dimensional data processing apparatus, with continued reference to FIG. 2, the multi-dimensional data processing apparatus 10 may comprise:

the aggregation processing module 100 is configured to determine a data level to which the read multidimensional data belongs, and aggregate the multidimensional data according to the data level to obtain cube data belonging to a corresponding data level;

the local buffer module 200 is configured to store the cube data into a local target level buffer, where the target level buffer is a local level buffer corresponding to a data level to which the cube data belongs;

and the data transfer module 300 is configured to determine whether the amount of data buffered in the target level buffer reaches a specified threshold, and if so, transfer the data buffered in the target level buffer to a level database table of the target database corresponding to the data level.

In one embodiment, the local cache module includes:

In one embodiment, the local cache module further includes:

In one embodiment, the aggregation processing module includes:

In one embodiment, the hierarchy determining the second unit further comprises:

In one embodiment, the aggregation processing module further comprises:

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements.

The invention also provides an electronic device, which comprises a processor and a memory; the memory stores a program that can be called by the processor; wherein the processor, when executing the program, implements the multidimensional data processing method as described in the foregoing embodiments.

Embodiments of the multi-dimensional data processing apparatus of the present invention may be applied to an electronic device. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of an electronic device where the device is located for operation. In terms of hardware, as shown in fig. 5, fig. 5 is a hardware structure diagram of an electronic device where the multidimensional data processing apparatus 10 according to an exemplary embodiment of the present invention is located, and in addition to the processor 510, the memory 530, the interface 520, and the nonvolatile memory 540 shown in fig. 5, the electronic device where the apparatus 10 is located in the embodiment generally includes other hardware according to the actual functions of the electronic device, which is not described herein.

The present invention also provides a machine-readable storage medium having stored thereon a program which, when executed by a processor, causes an electronic device to implement a multi-dimensional data processing method as described in the previous embodiments.

The present invention may take the form of a computer program product embodied on one or more storage media (including, but not limited to, magnetic disk storage, CD-ROM, optical storage, etc.) having program code embodied therein. Machine-readable storage media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of machine-readable storage media include, but are not limited to: phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by the computing device.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims

1. A method of multidimensional data processing, comprising:

determining the data hierarchy to which the read multidimensional data belongs, comprising: judging whether time dimension data in the multidimensional data is in a designated time interval or not; if yes, determining a data level to which the multidimensional data belongs as a first level; if not, judging whether the time dimension data in the multi-dimensional data is smaller than the smaller endpoint time of the appointed time interval, if so, determining the data level to which the multi-dimensional data belongs as a second level, and if not, adjusting the appointed time interval according to the time dimension data so that the time dimension data is in the adjusted appointed time interval, and determining the data level to which the multi-dimensional data belongs as a first level;

the specified time interval has a smaller end point time and a larger end point time, the smaller end point time is a left end point value of the specified time interval, and the larger end point time is a right end point value of the specified time interval;

aggregating the multidimensional data according to the data hierarchy to obtain cube data belonging to the corresponding data hierarchy;

2. The method of multidimensional data processing as recited in claim 1, wherein storing the cube data in a local target level cache includes:

3. The method of multidimensional data processing as recited in claim 2, wherein storing the cube data in a local target level cache, further comprising:

4. The multi-dimensional data processing method of claim 1, wherein after adjusting the specified time interval in accordance with the time dimension data, the method further comprises:

and deleting the stored cube data which belongs to the first hierarchy and is not in the regulated designated time interval in the corresponding multi-dimensional data and the data related to the cube data.

5. The method of multidimensional data processing according to claim 1, wherein aggregating the multidimensional data according to the data hierarchy to obtain cube data belonging to a corresponding data hierarchy comprises:

6. A multi-dimensional data processing apparatus, comprising:

the aggregation processing module comprises:

A hierarchy determining second unit, configured to determine whether time dimension data in the multidimensional data is smaller than a smaller end time of the specified time interval, determine a data hierarchy to which the multidimensional data belongs as a second hierarchy if the time dimension data is smaller than the smaller end time of the specified time interval, and adjust the specified time interval according to the time dimension data so that the time dimension data is within the adjusted specified time interval if the time dimension data is not smaller than the smaller end time of the specified time interval, and determine the data hierarchy to which the multidimensional data belongs as a first hierarchy;

7. The multi-dimensional data processing device of claim 6, wherein the local caching module comprises:

8. The multi-dimensional data processing device of claim 7, wherein the local caching module further comprises:

9. The multi-dimensional data processing device of claim 6, wherein the hierarchy determining second unit further comprises:

and the hierarchy determining subunit is used for deleting the stored cube data which belongs to the first hierarchy and is not in the adjusted specified time interval in the corresponding multidimensional data and the data related to the cube data after the specified time interval is adjusted according to the time dimension data.

10. The multi-dimensional data processing device of claim 6, wherein the aggregate processing module further comprises:

11. An electronic device, comprising a processor and a memory; the memory stores a program that can be called by the processor; wherein the processor, when executing the program, implements the multi-dimensional data processing method according to any one of claims 1-5.

12. A machine readable storage medium having stored thereon a program which, when executed by a processor, implements a multi-dimensional data processing method according to any of claims 1-5.