CN113961582A - Data processing method, device, equipment and storage medium - Google Patents

Data processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN113961582A
CN113961582A CN202010700369.9A CN202010700369A CN113961582A CN 113961582 A CN113961582 A CN 113961582A CN 202010700369 A CN202010700369 A CN 202010700369A CN 113961582 A CN113961582 A CN 113961582A
Authority
CN
China
Prior art keywords
data
result
sql
analysis
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010700369.9A
Other languages
Chinese (zh)
Inventor
张向阳
高阳
黄皎
罗川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202010700369.9A priority Critical patent/CN113961582A/en
Publication of CN113961582A publication Critical patent/CN113961582A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • G06F16/244Grouping and aggregation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching

Abstract

The invention discloses a data processing method, a data processing device, data processing equipment and a storage medium. Wherein, the method comprises the following steps: acquiring a query factor corresponding to the query request; generating query SQL based on the obtained query factors; obtaining an analysis result of the query SQL, and determining Cube (data Cube) metadata corresponding to the query request based on the analysis result; and obtaining result data corresponding to the query request from the result database based on the Cube metadata corresponding to the query request. The embodiment of the invention can support the acquisition of result data of multi-dimensional analysis, does not need business personnel to compile different SQL sentences aiming at different dimensional analysis, is simple and convenient to operate, and is beneficial to enhancing the timeliness of the multi-dimensional analysis.

Description

Data processing method, device, equipment and storage medium
Technical Field
The present invention relates to the field of data analysis, and in particular, to a data processing method, apparatus, device, and storage medium.
Background
With the advent of the big data age, the demand for data analysis is higher and higher. In the related art, data is often subjected to multidimensional analysis based on a data Cube (Cube), for example, Cube is established for a data set, Cube represents a data set described by a plurality of dimensions, each dimension reflects a business angle of the data set, and a multidimensional analysis manner of data includes:
writing data into a database, and performing Query analysis by writing an SQL (Structured Query Language) statement;
in the second mode, a third-party tool, such as kylin, is used to perform index calculation.
For the first mode, when performing multidimensional analysis by writing SQL statements, business personnel need to write different SQL statements for different dimensions, which has a large workload and high requirements on the professional of business personnel. In addition, if multi-dimensional analysis is carried out on mass data, performance problems are easy to occur, and performance and timeliness are difficult to guarantee.
For the second mode, the third-party tool kylin is used for calculating, and due to the fact that the provided operators are limited, the complex index algorithm is poorly supported, and incremental calculation of batch data is not supported.
Disclosure of Invention
In view of this, embodiments of the present invention provide a data processing method, an apparatus, a device, and a storage medium, which are used to solve the problem that the multidimensional analysis of data has complicated operations or is difficult to support a complex index algorithm due to writing SQL statements.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a data processing method, which comprises the following steps:
acquiring a query factor corresponding to the query request;
generating a query SQL (structured query language) based on the obtained query factor;
acquiring an analysis result of the query SQL, and determining data cube metadata corresponding to the query request based on the analysis result;
and obtaining result data corresponding to the query request from a result database based on the data cube metadata corresponding to the query request.
An embodiment of the present invention further provides a data processing apparatus, including:
the acquisition module is used for acquiring the query factor corresponding to the query request;
the generating module is used for generating query SQL based on the acquired query factors;
the determining module is used for acquiring the analysis result of the query SQL and determining data cube metadata corresponding to the query request based on the analysis result;
and the query module is used for obtaining the result data corresponding to the query request from a result database based on the data cube metadata corresponding to the query request.
An embodiment of the present invention further provides a data processing apparatus, including: a processor and a memory for storing a computer program capable of running on the processor, wherein the processor, when running the computer program, is adapted to perform the steps of the method according to any of the embodiments of the present invention.
The embodiment of the invention also provides a storage medium, wherein a computer program is stored on the storage medium, and when the computer program is executed by a processor, the steps of the method of any embodiment of the invention are realized.
According to the technical scheme provided by the embodiment of the invention, the query SQL is generated based on the obtained query factor; obtaining an analysis result of the query SQL, and determining Cube metadata corresponding to the query request based on the analysis result; and obtaining result data corresponding to the query request from a result database based on the Cube metadata corresponding to the query request. The method can support the acquisition of result data of multi-dimensional analysis, does not need business personnel to compile different SQL statements aiming at different dimensional analysis, is simple and convenient to operate, and is beneficial to enhancing the timeliness of the multi-dimensional analysis; in addition, the measurement indexes corresponding to different algorithms can be generated based on the query factors, so that the multi-dimensional analysis requirement is supported.
Drawings
FIG. 1 is a flow chart of a data processing method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart illustrating multi-dimensional analysis of source data based on Cube according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of creating Cube of source data according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating SQL execution processing according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention;
FIG. 6A and FIG. 6B are schematic structural diagrams of a data processing apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
An embodiment of the present invention provides a data processing method, as shown in fig. 1, the method includes:
step 101, acquiring a query factor corresponding to a query request;
102, generating query SQL based on the acquired query factor;
103, acquiring an analysis result of the query SQL, and determining Cube metadata corresponding to the query request based on the analysis result;
and 104, obtaining result data corresponding to the query request from a result database based on the Cube metadata corresponding to the query request.
Here, Cube refers to a data set oriented to the same business topic. For example, for a dimension of data of the same service theme, aggregation operation is performed based on a metric, a result of the aggregation operation is stored as a Materialized View (also called Cube), and Cube of all the dimensions of the service theme is taken as a whole and called Cube, that is, Cube is a set of Materialized views aggregated according to the dimension.
In the embodiment of the present invention, the query factor includes: dimensions, measures, and operators. Here, a dimension refers to an angle from which data is observed, such as time, city, network elements, etc., and a dimension is generally a set of discrete values; the metric refers to numerical information which can be accurately quantified, such as response delay, network traffic and the like, and the metric is generally a continuous value; the operator is an algorithm corresponding to aggregation operation performed on the measurement based on the dimension, for example, an algorithm corresponding to aggregation operation such as accumulation, an average value, a maximum value, a minimum value and the like.
The data processing method provided by the embodiment of the invention can support the acquisition of result data of multi-dimensional analysis, does not need business personnel to compile different SQL statements aiming at different dimensional analysis, is simple and convenient to operate, and is also beneficial to enhancing the timeliness of the multi-dimensional analysis; in addition, the measurement indexes corresponding to different algorithms can be generated based on dimensions, measurement and operators, so that the multi-dimensional analysis requirement is supported. For example, for the time dimension, the data flow is used as a measure, and the accumulation function is used as an operator, so that data flow values of different hierarchy dimensions such as week, month, and quarter can be queried and obtained, wherein the week, month, and quarter represent different hierarchies of the time dimension.
In some embodiments, the data processing method further comprises:
performing multi-dimensional analysis on the source data based on the data cube to obtain an analysis result;
storing the analysis results in the results database.
In the embodiment of the invention, multi-dimensional analysis is carried out based on Cube, the analysis result has a corresponding aggregation relation, query SQL is generated based on the Dimension, the measurement and the operator by carrying the Dimension, the measurement and the operator in the query request, and then the query SQL is analyzed to obtain Cube metadata corresponding to the query request, namely a Fact Table (Fact Table), a Dimension Table (Dimension Table), the Dimension, the measurement and the operator corresponding to the query request are determined, so that mapped result data can be determined from a result database based on Cube metadata. Here, the fact table refers to a table in which fact records such as a system log, a sales record, and the like are stored; records of fact tables are constantly growing dynamically, so its volume is usually much larger than other tables; the dimension Table, also called dimension Table or Lookup Table (Lookup Table), is a Table corresponding to the fact Table, and stores attribute values of the dimensions, which can be associated with the fact Table. The method is equivalent to extracting and standardizing the attributes which frequently appear repeatedly on a fact table and managing the attributes by using one table. Common dimension tables are a date table (storing attributes of week, month, quarter, etc. corresponding to the date), a place table (including attributes of country, province/state, city, etc.), and the like.
In some embodiments, a user may input dimensions, metrics, and operators corresponding to a query request through an input unit on a display interface of the data processing device, for example, the query request is input through table filling or graph dragging. The data processing equipment receives a query request input by a user, and generates query SQL based on the dimensionality, the measurement and the operator corresponding to the query request.
In some embodiments, generating a query SQL based on the obtained query factor comprises:
packaging the measurement and the operator into a measurement index;
query SQL is generated based on the metric index and the dimension.
Here, the metric index is used to indicate how the metric of the result data is derived from a fact table or a dimension table; based on the metric and the dimensions, query SQL may then be generated. For example, if the measurement is data traffic, the operator is an accumulation algorithm, and the dimension is year, then query SQL for counting the annual traffic accumulation value is generated.
In some embodiments, parsing the query SQL to determine Cube metadata corresponding to the query request includes: determining a fact table, a dimension, a measure and an operator corresponding to a query request, and accordingly obtaining result data corresponding to the query request from the result database based on Cube metadata corresponding to the query request, including:
and determining a result table mapped by the query request based on the fact table, the dimension, the measurement and the operator corresponding to the query request to obtain result data.
In practical applications, the result data may be exported, or the result data may be displayed on an interface, for example, in a form of table, graph, or the like.
In some embodiments, as shown in fig. 2, performing multi-dimensional analysis on the source data based on Cube to obtain an analysis result, including:
step 201, generating task metadata based on Cube metadata of Cube of source data;
step 202, generating a scheduling task based on task metadata;
step 203, analyzing the scheduling task to generate execution SQL;
and step 204, processing and executing the SQL to obtain an analysis result of the multidimensional analysis.
Here, generating the task metadata based on Cube metadata of cubes of the source data includes:
creating a Cube of the source data based on the initial parameters, and storing metadata of the Cube;
and generating the metadata of the task to be scheduled based on the metadata of the Cube.
In some embodiments, the creating of the Cube of the source data based on the initial parameters includes at least one of:
determining missing analysis items and creating corresponding analysis items;
determining a model of the missing Cube under the analysis project, creating a corresponding model and creating the Cube based on the corresponding model;
and determining a missing operator, creating a corresponding operator and storing the corresponding operator to an operator library.
As shown in FIG. 3, creating a Cube for source data includes:
step 301, judging whether the analysis item exists, if so, executing step 303, and if not, executing step 302;
here, the user may trigger the creation of Cube through a Reset (REST) interface. The data processing device determines whether the analysis item exists, and if so, directly executes step 303, skipping step 302.
Step 302, creating an analysis item;
the data processing device creates an analysis item, where the analysis item (project) is used to briefly describe the entire data analysis, and the description content may include the name of the item, the introduction of the item, the creator of the item, and so on.
Step 303, judging whether the Cube model is missing under the analysis item, if so, executing step 305, and if not, executing step 304;
here, the model (model) is used to describe Cube, and the description content may include: name, profile, subject, fact table, dimension column, measure column, etc. If Cube already exists, step 305 is executed directly, skipping step 304.
Step 304, creating a model;
the data processing apparatus creates a model from initial parameters of the model.
Step 305, generating Cube based on the model;
here, for each combination of dimensions, the metrics are aggregated and the result of the operation is saved as a materialized view called Cuboid. The Cube of the combination of all dimensions as a whole is called Cube. The content described by Cube can include: name, belonging model, compute engine, storage engine, dimension, metric, etc.
Step 306, judging whether operators are missing, if so, executing step 307, otherwise, executing step 308;
step 307, adding an operator, and storing the corresponding operator to an operator library;
and if the data processing equipment determines that the Cube is created with a missing operator, the data processing equipment can add the operator newly according to a set rule and identify the operator, for example, the operator name can be used as the unique identifier of the operator, if the operator exists, the operator is not added, otherwise, the operator is updated to the operator library. The rules of the operators may include: operator name, operator function description and SQL function description corresponding to the operator. Taking SUM operator as an example, the rule includes: name: a SUM; description of the function: accumulating; corresponding to the SQL function: sum (column). In practical application, whether the newly added operator meets the set rule or not can be judged, and only the operator meeting the set rule is added to the operator library.
And step 308, ending.
According to the embodiment of the invention, whether operators are missing or not is detected in the Cube creating process, so that the algorithm can be enriched, the operator types can be expanded, and the multi-dimensional data analysis requirements can be supported.
In practical application, after creating the Cube, the data processing device determines whether the corresponding Cube exists, if so, discards the Cube, and if not, stores Cube metadata of the corresponding Cube. Here, the Cube metadata refers to data describing Cube, and may include: belonged items, belonged models, fact tables, dimension tables, calculation engines, storage engines, dimension columns, measure columns, and the like.
In some embodiments, generating metadata for a task to be scheduled based on Cube metadata includes:
the data processing device packages the Cube metadata into task (job) metadata and stores the task metadata for use by the scheduling engine.
In some embodiments, generating the scheduled task based on the task metadata includes:
and generating a scheduling task based on stream processing or batch processing by using the metadata of the task to be scheduled.
Here, the scheduling engine of the data processing apparatus may generate the scheduling task based on a stream process or a batch process.
In some embodiments, the scheduling engine generates the scheduling task based on stream processing, including:
and the scheduling engine captures the tasks from the storage engine in real time and directly calls the calculation engine for processing.
In some embodiments, the scheduling engine generates the scheduled tasks based on a batch process, including:
and the scheduling engine analyzes the jobmetadata, acquires a scheduling period and a jobstate, encapsulates the scheduling task according to the scheduling period, and transmits the encapsulated scheduling task to the computing engine if the jobstate is executable.
The embodiment of the invention can support two processing modes of batch processing and stream processing, thereby carrying out multi-dimensional data analysis on different data types.
In some embodiments, the parsing the scheduling task and generating the execution SQL includes:
analyzing the scheduling task to obtain a wide table and a result table corresponding to the Cube of the source data, wherein the wide table is used for representing the dimension and the measurement of the source data, and the result table is used for representing the analysis result of the multidimensional analysis;
generating an execution SQL based on the broad table and the result table, the execution SQL comprising: the SQL database comprises a table building SQL and a summary SQL, wherein the table building SQL is generated based on the corresponding relation between the broad table and the result table, the summary SQL is generated based on the generation rule of the metric index in the result table, and the metric index is determined based on the metric and the corresponding operator.
In some embodiments, the calculation engine of the data processing apparatus parses parameters such as the measurement column, the dimension column, the scheduling period, and the summarizing step according to the scheduling task transmitted by the scheduling engine, and parses association rule information and the like of the measurement column and the dimension column. And the calculation engine analyzes the measurement columns and the association rules of the dimension columns according to the information of the scheduling tasks, so that the fact table and a plurality of dimension tables are combined into one wide table. Automatically generating a wide table name according to a set naming rule; the result table is a storage table of data analysis results, and each data dimension corresponds to one result table. The calculation engine analyzes the dimension index, the measurement index and the operator expression of the measurement index of the result table according to the information of the scheduling task, and the information of the data source table, the calculation step and the like of the result table, and automatically generates the table name of the result table according to a set rule.
The compute engine generates an execution SQL based on the broad table and the result table, including:
automatically generating a table building SQL according to the table name, the measurement index and the dimension index of the analyzed wide table and the result table;
in the analysis of the result table, operator expression information of the dimension index, the measurement index, the data source table and the measurement index of the result table is obtained, a specific operator is obtained by combining the operator expression with an operator library, and the information is aggregated into a calculation process, so that the summary SQL is obtained.
In some embodiments, as shown in FIG. 4, the process executes SQL to obtain the analysis result of the multidimensional analysis, including:
step 401, encapsulating an executable task;
and encapsulating the executable task (task) based on the steps of SQL execution, scheduling period and summarization.
Step 402, task analysis can be executed;
and analyzing the encapsulated executable tasks, analyzing the tasks into a tree structure according to the execution steps of each task, wherein each node is a corresponding encapsulated task and is called a task tree. And scanning the task tree layer by layer, starting calculation from a root node in order to fully utilize resources of a calculation component, executing tasks in the same layer concurrently, and executing tasks in different layers in series.
In practical application, in order to control the concurrency condition of tasks and prevent overhigh concurrency from causing overhigh load of a computing component, a task pool can be adopted to cache tasks corresponding to tasks.
Step 403, processing mode analysis;
and analyzing the processing mode of each task for judging the calculation mode of the task, wherein the processing mode comprises stream processing and batch processing.
Step 404, judging whether the task processing mode is batch processing or stream processing;
if the task processing mode is batch processing, go to step 405; if the task processing mode is stream processing, go to step 406.
Step 405, run batch mode;
and calling a batch processing interface to perform summary processing on the data of the same batch of the task and the historical calculated data in real time.
Step 406, running a stream processing mode;
and calling a stream processing interface, and processing the data of the new task in real time and summarizing the data stored in history.
Step 407, execute task.
And selecting a corresponding processing mode and a corresponding calculation component to execute calculation according to different processing modes, updating the execution state to a storage engine, and storing the summary result to the storage engine.
Here, the storage engine is used to manage metadata information generated by other engine modules and a data analysis result finally generated as an interaction medium of each engine. In particular, the storage engine may be used to store: cube metadata, task metadata, result database, and operator library as described above. The compute engine may update the task state in the results table and task metadata in the results database based on the processing to execute SQL. The result table in the result database can be mapped with the cube metadata, so that the query request can conveniently acquire the data of the result table.
The calculation engine provided by the embodiment of the invention supports incremental calculation of multiple dimensions on the source data, and obtains the summary value of the multi-dimensional measurement index in real time.
In the related art, the prediction of data is still limited to index prediction of a single dimension, and the diversified prediction requirements of the data are difficult to meet. Based on this, in the embodiment of the present invention, the data processing method further includes:
and predicting the result data corresponding to the query request based on historical data to obtain a corresponding predicted value.
In some embodiments, the predicting the result data corresponding to the query request based on historical data to obtain a corresponding predicted value includes:
acquiring historical data corresponding to the result data from a result database;
and importing the historical data and the result data into a prediction model, and analyzing a predicted value corresponding to the result data.
Here, creating the Cube of the source data based on the initial parameters may further include:
and configuring configuration parameters of the prediction model.
Therefore, historical data and result data can be trained based on configuration parameters of the prediction model, for example, the result tables of different dimensions are respectively predicted by using a set time sequence prediction algorithm to obtain a prediction result table, so that the result table of a multi-dimensional analysis result and the predicted result data can be displayed through an interface, and index trends can be evaluated comprehensively. Here, the time-series prediction algorithm may be a time-series prediction algorithm such as a cubic exponential smoothing Holt-Winters (Holt-hotts) method.
In order to implement the method according to the embodiment of the present invention, an embodiment of the present invention further provides a data processing apparatus, where the data processing apparatus corresponds to the data processing method, and each step in the data processing method is also completely applicable to the embodiment of the data processing apparatus.
As shown in fig. 5, the data processing apparatus includes: the system comprises an acquisition module 501, a generation module 502, a determination module 503 and a query module 504. The obtaining module 501 is configured to obtain a query factor corresponding to the query request; the generating module 502 is configured to generate a query SQL based on the obtained query factor; the determining module 503 is configured to obtain an analysis result of the query SQL, and determine Cube metadata corresponding to the query request based on the analysis result; the query module 504 is configured to obtain result data corresponding to the query request from a result database based on the Cube metadata corresponding to the query request.
In some embodiments, the data processing apparatus further comprises: the analysis processing module 505 is configured to perform multidimensional analysis on the source data based on Cube to obtain an analysis result, and store the analysis result in a result database;
in some embodiments, the analysis processing module 505 is specifically configured to:
generating task metadata based on Cube metadata of cubes of the source data;
generating a scheduling task based on the task metadata;
analyzing the scheduling task to generate execution SQL;
and processing and executing the SQL to obtain an analysis result of the multidimensional analysis.
In some embodiments, the analysis processing module 505 generates task metadata based on Cube metadata of cubes of the source data, including:
creating a Cube of the source data based on the initial parameters, and storing metadata of the Cube;
and generating the metadata of the task to be scheduled based on the metadata of the Cube.
In some embodiments, the analytics processing module 505 generates scheduled tasks based on the task metadata, including:
and generating a scheduling task based on stream processing or batch processing by using the metadata of the task to be scheduled.
In some embodiments, the analysis processing module 505 parses the scheduling task to generate execution SQL, which includes:
analyzing the scheduling task to obtain a wide table and a result table corresponding to the Cube of the source data, wherein the wide table is used for representing the dimension and the measurement of the source data, and the result table is used for representing the analysis result of the multidimensional analysis;
generating an execution SQL based on the broad table and the result table, the execution SQL comprising: the SQL database comprises a table building SQL and a summary SQL, wherein the table building SQL is generated based on the corresponding relation between the broad table and the result table, the summary SQL is generated based on the generation rule of the metric index in the result table, and the metric index is determined based on the metric and the corresponding operator.
In some embodiments, the analysis processing module 505 creates Cube of the source data based on the initial parameters, including at least one of:
determining missing analysis items and creating corresponding analysis items;
determining a model of the missing Cube under the analysis project, creating a corresponding model and creating the Cube based on the corresponding model;
and determining a missing operator, creating a corresponding operator and storing the corresponding operator to an operator library.
In some embodiments, the data processing apparatus further comprises:
the prediction module 506 is configured to predict result data corresponding to the query request based on historical data to obtain a corresponding prediction value.
In some embodiments, prediction module 506 is specifically configured to:
acquiring historical data corresponding to the result data from a result database;
and importing the historical data and the result data into a prediction module, and analyzing a prediction value corresponding to the result data.
In actual application, the obtaining module 501, the generating module 502, the determining module 503, the querying module 504, the analyzing module 505, and the predicting module 506 may be implemented by a processor in a data processing apparatus. Of course, the processor needs to run a computer program in memory to implement its functions.
It should be noted that: in the data processing apparatus provided in the above embodiment, when performing data processing, only the division of each program module is exemplified, and in practical applications, the processing may be distributed to different program modules according to needs, that is, the internal structure of the apparatus may be divided into different program modules to complete all or part of the processing described above. In addition, the data processing apparatus and the data processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.
The present invention will be described in further detail with reference to the following application examples.
As shown in fig. 6A and 6B, the data processing apparatus according to the present embodiment includes: model building engine, storage engine, visualization interface, scheduling engine, calculation engine, and prediction engine, wherein A, B, C, D, E in fig. 6A is connected to A, B, C, D, E in fig. 6B, respectively.
The following describes each module in the data processing apparatus:
model construction engine
The model construction engine is used for constructing information required by the whole multi-dimensional index analysis according to a specified format, and comprises the steps of project construction, model construction, data Cube (Cube) generation, prediction model construction and the like, and the information is transmitted to other modules. The model building engine is specifically configured to:
1) creating project: project is used for brief description of the entire data analysis, and description contents include a project name, a project introduction, a project creator, and the like.
2) Building a model: the model includes all dimension columns and measurement columns of the data, namely, the observation angle and the analysis index of the data. The description content mainly includes name, introduction, belonging project, fact table, dimension column, measurement column, etc. Wherein:
fact Table (Fact Table): tables storing fact records, such as system logs, sales records, and the like; the records of fact tables are constantly growing dynamically, so its volume is usually much larger than other tables.
Dimension Table (Dimension Table): the method is a table corresponding to a fact table, and the method stores attribute values of dimensions and can be associated with the fact table. The method is equivalent to extracting and standardizing the attributes which frequently appear repeatedly on a fact table and managing the attributes by using one table. Common dimension tables are a date table (storing attributes of week, month, quarter, etc. corresponding to the date), a place table (including attributes of country, province/state, city, etc.), and the like.
Dimension column: the dimension column is a viewing angle of the data, which is derived from a fact table or a dimension table, and is generally a discrete value used for analyzing the data from different angles, such as time, city, network elements, and the like.
The metric columns are: that is, the index to be analyzed is derived from the fact table, and reflects the statistical result of data in the blind dimension, which is generally a continuous value, such as user traffic, TCP response delay, and the like.
3) And the cube generation: for each combination of dimensions, the measurement indexes are subjected to aggregation operation, and then the operation result is stored as a materialized view called Cuboid. The Cube of the combination of all dimensions as a whole is called Cube. The description includes name, belonging model, compute engine, storage engine, dimension, metric, etc.
4) And constructing a prediction model: and selecting a prediction model, wherein the prediction model comprises the existing time series prediction algorithms such as a regression model, ARIMA, time series decomposition, cubic exponential smoothing (Holt-Winters) and the like. And configuring a result table needing prediction and a result table field.
4) Submission job: and packaging the metadata of the cube into jobb metadata, and submitting the jobb metadata to a storage engine for use by a scheduling engine.
5) And newly adding an operator: and (4) adding an operator according to a rule by the user, wherein the name of the operator is used as a unique identifier, if the operator already exists, the operator is not added, and otherwise, the operator is updated to the storage engine. The rules of the operators include: operator name, operator function description and SQL function description corresponding to the operator. As SUM operator: name: a SUM; description of the function: accumulating; corresponding to the SQL function: sum (column).
Two, storage engine
The storage engine is used for managing metadata information generated by other engine modules and a finally generated data analysis result as an interaction medium of each engine. The storage engine is specifically configured to:
1) storing cube metadata: and storing the metadata information of the cube generated by the REQUEST interface.
2) Storing jobmetadata: storing metadata of the job generated by the REQUEST interface and using the metadata by a scheduling engine, submitting the job meeting scheduling conditions to a calculation engine by the scheduling engine for processing, and updating the calculation state into the metadata after the job calculation is finished.
3) And storing result data: and storing result data generated by the calculation engine, and mapping the data and the cube metadata, so that the query engine can conveniently search the result of data analysis through the mapping relation.
Third, calculation engine
The computing engine is used for further analyzing the job tasks, automatically generating a series of executable SQL tasks according to the analysis result, analyzing the tasks into a task tree, finally selecting computing components for computing, and submitting the computing result to the storage engine. The calculation engine is specifically configured to:
1) and parameter analysis: parameters such as a measurement column, a dimension column, a scheduling period, a summary step and the like are analyzed according to a jobtask transmitted by a scheduling engine, and association rule information and the like of the measurement column and the dimension column are analyzed. The method specifically comprises the following steps:
broad table analysis: the wide table is a table including all the dimension indexes and the measurement indexes. And analyzing association rules of the measurement column and the dimension column according to the jobinformation, so that the fact table and the multiple dimension tables are combined into a wide table. And automatically generating the wide table name according to a given naming rule.
And (4) analyzing a result table: the result table is a storage table of data analysis results, and each data dimension corresponds to one result table. And analyzing the dimension index, the measurement index and the operator expression of the measurement index of the result table, the data source table of the result table, the calculation step and other information according to the job information. The table name of the result table is automatically generated according to a predetermined rule.
2) Generating a table building SQL: and automatically generating the table building SQL according to the table name, the measurement index and the dimension index of the analyzed wide table and the result table.
3) Generating summary SQL: in the analysis of the result table, operator expression information of the dimension index, the measurement index, the data source table and the measurement index of the result table is obtained, a specific operator is obtained by combining the operator expression with an operator library, and finally the information is aggregated into a calculation process, so that the summary SQL is obtained.
4) And packaging task: and packaging information such as table building SQL, summary SQL, scheduling period, summary steps and the like into specific executable tasks.
5) And task analysis: and analyzing the encapsulation tasks, analyzing the tasks into a tree structure according to the execution steps of each task, wherein each node is a corresponding encapsulation task and is called a task tree. And scanning the task tree layer by layer, starting calculation from a root node in order to fully utilize resources of a calculation component, executing tasks in the same layer concurrently, and executing tasks in different layers in series.
6) And a task pool: the task pool is used for caching task tasks, controlling the concurrency condition of the tasks and preventing overhigh concurrency from causing overhigh load of the computing components.
7) And analyzing the processing mode: and analyzing the processing mode of each task for judging the calculation mode of the task, wherein the processing mode comprises stream processing and batch processing.
8) And batch processing mode: and if the task processing mode is batch processing, calling a batch processing interface, and performing summary processing on the data of the same batch of the task and the historical calculated data in real time.
9) And a stream processing mode: and if the task processing mode is stream processing, calling a stream processing interface, and processing the data of the new task in real time and summarizing the data stored historically.
10) And executing the task: and selecting a corresponding processing mode and a corresponding calculation component to execute calculation according to different processing modes, updating the execution state to a storage engine, and storing the summary result to the storage engine.
Fourth, prediction engine
The prediction engine is used for predicting the result table data obtained by the calculation engine, the prediction model configured in the model construction engine is used, the time series prediction algorithm set by the user is used for predicting the result tables with different granularities respectively, and the results are stored in the prediction result tables.
Fifth, scheduling engine
The scheduling engine is used for calling the computing engine to execute the jobtask at fixed time aiming at the batch processing task; and monitoring the stream processing task, adding the stream processing task into a calculation engine in real time for real-time processing, finally updating the state of the jobs in a storage engine, calling a prediction engine to predict a result table after executing the jobs task each time, and storing a quasi-real-time or real-time prediction result. The scheduling engine is specifically configured to:
1) and flow processing task: and capturing a jobtask from the storage engine in real time, directly calling the calculation engine to calculate, and then calling the prediction engine to predict the multi-dimensional indexes.
2) And batch processing task: analyzing the jobmetadata, acquiring a scheduling period and a jobstate, packaging scheduling tasks according to the scheduling period, if the jobstate is executable, transmitting the packaged scheduling tasks to a computing engine, and automatically calling the prediction engine to perform index prediction when the computing engine is finished executing, otherwise, not processing.
Sixth, visual interface
The visual interface specifically comprises:
1) model configuration interface: and configuring a fact table, a dimension column, a measurement column and a model construction period required by model construction, and predicting the relevant configuration of the model.
2) And a multi-dimensional index presentation and prediction interface: listing all models configured by a user, displaying a trend graph of a multi-dimensional index result table configured in the models after selecting the models, displaying predicted result data, and comprehensively evaluating the index trend.
Based on the hardware implementation of the program module, and in order to implement the method according to the embodiment of the present invention, an embodiment of the present invention further provides a data processing apparatus. Fig. 7 shows only an exemplary structure of the data processing apparatus, not the entire structure, and a part or the entire structure shown in fig. 7 may be implemented as necessary.
As shown in fig. 7, a data processing apparatus 700 provided in an embodiment of the present invention includes: at least one processor 701, memory 702, user interface 703, and at least one network interface 704. The various components in the data processing device 700 are coupled together by a bus system 705. It will be appreciated that the bus system 705 is used to enable communications among the components. The bus system 705 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various busses are labeled in figure 7 as the bus system 705.
The user interface 703 may include, among other things, a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, or a touch screen.
The memory 702 in embodiments of the present invention is used to store various types of data to support the operation of the data processing apparatus. Examples of such data include: any computer program for operating on a data processing device.
The data processing method disclosed by the embodiment of the invention can be applied to the processor 701, or implemented by the processor 701. The processor 701 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the data processing method may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 701. The Processor 701 may be a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 701 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed by the embodiment of the invention can be directly implemented by a hardware decoding processor, or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 702, and the processor 701 may read information in the memory 702 and complete the steps of the data processing method provided by the embodiments of the present invention in combination with hardware thereof.
In an exemplary embodiment, the data processing Device may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), FPGAs, general purpose processors, controllers, Micro Controllers (MCUs), microprocessors (microprocessors), or other electronic components for performing the aforementioned methods.
It will be appreciated that the memory 702 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The described memory for embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.
In an exemplary embodiment, the embodiment of the present invention further provides a storage medium, that is, a computer storage medium, which may be specifically a computer readable storage medium, for example, including a memory 702 storing a computer program, where the computer program is executable by a processor 701 of a data processing apparatus to perform the steps described in the method of the embodiment of the present invention. The computer readable storage medium may be a ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM, among others.
It should be noted that: "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
In addition, the technical solutions described in the embodiments of the present invention may be arbitrarily combined without conflict.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (11)

1. A data processing method, comprising:
acquiring a query factor corresponding to the query request;
generating query SQL based on the obtained query factors;
acquiring an analysis result of the query SQL, and determining data cube metadata corresponding to the query request based on the analysis result;
and obtaining result data corresponding to the query request from a result database based on the data cube metadata corresponding to the query request.
2. The method of claim 1, further comprising:
performing multi-dimensional analysis on the source data based on the data cube to obtain an analysis result;
storing the analysis results in the results database.
3. The method of claim 2, wherein performing a multidimensional analysis on the source data based on the data cube to obtain an analysis result comprises:
generating task metadata based on data cube metadata of a data cube of the source data;
generating a scheduling task based on the task metadata;
analyzing the scheduling task to generate execution SQL;
and processing the execution SQL to obtain an analysis result of the multidimensional analysis.
4. The method of claim 3, wherein generating task metadata based on data cube metadata of the source data cube comprises:
creating a data cube of source data based on the initial parameters and storing metadata of the data cube;
generating task metadata to be scheduled based on the metadata of the data cube;
the generating of the scheduled task based on the task metadata includes:
and generating a scheduling task based on stream processing or batch processing by using the metadata of the task to be scheduled.
5. The method of claim 3, wherein parsing the scheduled task to generate execution SQL comprises:
analyzing the scheduling task to obtain a wide table and a result table corresponding to a data cube of the source data, wherein the wide table is used for representing the dimension and the measurement of the source data, and the result table is used for representing the analysis result of the multidimensional analysis;
generating an execution SQL based on the broad table and the result table, the execution SQL comprising: the SQL database comprises a table building SQL and a summary SQL, wherein the table building SQL is generated based on the corresponding relation between the broad table and the result table, the summary SQL is generated based on the generation rule of the metric index in the result table, and the metric index is determined based on the metric and the corresponding operator.
6. The method of claim 4, wherein creating the data cube of source data based on the initial parameters comprises at least one of:
determining missing analysis items and creating corresponding analysis items;
determining a model of a missing data cube under an analysis project, creating a corresponding model and creating the data cube based on the corresponding model;
and determining a missing operator, creating a corresponding operator and storing the corresponding operator to an operator library.
7. The method of claim 1, further comprising:
and predicting the result data corresponding to the query request based on historical data to obtain a corresponding predicted value.
8. The method of claim 7, wherein predicting the result data corresponding to the query request based on historical data to obtain a corresponding predicted value comprises:
acquiring historical data corresponding to the result data from a result database;
and importing the historical data and the result data into a prediction model, and analyzing a predicted value corresponding to the result data.
9. A data processing apparatus, comprising:
the acquisition module is used for acquiring the query factor corresponding to the query request;
the generating module is used for generating query SQL based on the acquired query factors;
the determining module is used for acquiring the analysis result of the query SQL and determining data cube metadata corresponding to the query request based on the analysis result;
and the query module is used for obtaining the result data corresponding to the query request from a result database based on the data cube metadata corresponding to the query request.
10. A data processing apparatus, characterized by comprising: a processor and a memory for storing a computer program capable of running on the processor, wherein,
the processor, when executing the computer program, is adapted to perform the steps of the method of any of claims 1 to 8.
11. A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of the method of any one of claims 1 to 8.
CN202010700369.9A 2020-07-20 2020-07-20 Data processing method, device, equipment and storage medium Pending CN113961582A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010700369.9A CN113961582A (en) 2020-07-20 2020-07-20 Data processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010700369.9A CN113961582A (en) 2020-07-20 2020-07-20 Data processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113961582A true CN113961582A (en) 2022-01-21

Family

ID=79459517

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010700369.9A Pending CN113961582A (en) 2020-07-20 2020-07-20 Data processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113961582A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115185999A (en) * 2022-09-13 2022-10-14 北京达佳互联信息技术有限公司 Data processing method and device
CN115510289A (en) * 2022-09-22 2022-12-23 中电金信软件有限公司 Data cube configuration method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115185999A (en) * 2022-09-13 2022-10-14 北京达佳互联信息技术有限公司 Data processing method and device
CN115510289A (en) * 2022-09-22 2022-12-23 中电金信软件有限公司 Data cube configuration method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US11182388B2 (en) Mechanism to chain continuous queries
US10489266B2 (en) Generating a visualization of a metric at one or multiple levels of execution of a database workload
US20140280280A1 (en) Estimating error propagation for database optimizers
US10922640B2 (en) Smart template for predictive analytics
CN113961582A (en) Data processing method, device, equipment and storage medium
Balliu et al. A big data analyzer for large trace logs
US10909117B2 (en) Multiple measurements aggregated at multiple levels of execution of a workload
EP3086244B1 (en) Database system and method of operation thereof
CN112732704B (en) Data processing method, device and storage medium
US10769164B2 (en) Simplified access for core business with enterprise search
CN117609362A (en) Data processing method, device, computer equipment and storage medium
CN114257528A (en) Internet of things equipment selection method and device, electronic equipment and storage medium
Wang et al. A scientific workflow framework integrated with object deputy model for data provenance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination