WO2016163903A1

WO2016163903A1 - Method and apparatus for automated generation of a data processing topology

Info

Publication number: WO2016163903A1
Application number: PCT/RU2015/000222
Authority: WO
Inventors: Sergey Sergeyevich ZOBNIN; Alexander Leonidovich PYAYT
Original assignee: Siemens Aktiengesellschaft
Priority date: 2015-04-08
Filing date: 2015-04-08
Publication date: 2016-10-13

Abstract

An automated generation of a date processing topology dealing with a large amount of data is provided. A data processing topology may be generated based on definitions of data processing requirements. Based on the data processing requirements, an environmental model of a distribution of data sources can be established and a plurality of subtasks for data processing can be computed in an automated manner. The individual subtasks can be assigned to framework entities and these framework entities are deployed to computational frameworks. In this way, a data processing topology can be generated in an automatic manner without the need of manual assistance. Hence, the generation of a data processing topology can be simplified and errors during the data processing topology generation can be avoided.

Description

METHOD AND APPARATUS FOR AUTOMATED GENERATION OF A DATA

PROCESSING TOPOLOGY

The present invention relates to a method and an apparatus for automated generation of a data processing topology.

Many modern technical systems are using real-time data analytic methods for monitoring critical infrastructure or conditions of large technological devices. For instance, gas turbines can be monitored and controlled by analyzing the data from many sensors related to the gas turbine. In such a case measurements from many sensors, for example thousands of sensors, have to be analyzed in real time. Hence a huge amount of data ("big data") has to be processed for further decision making. For this purpose, computational topologies have to be built for real time processing frameworks. These computational topologies can be built from standard blocks. However, the computational topologies can be built only in manual or at least semi-automated manner. Hence, additional user action is required in order to prepare an appropriate configuration of the respective blocks, or to adapt source code of the standard blocks. Since monitoring of infrastructure and large technological devices requires real time analytics methods for a large amount of data sources ("big data"), such manual or semi- automated modification of a computational topology is very complex and requires a large amount of time. Hence, there is a need for an automation of computational topology generation.

This is achieved by the features of the independent claims.

According to a first aspect, the present invention provides a method for automated generation of a data processing topology. The method comprises the steps of receiving definitions of data processing requirements; and establishing an environmental model of a distribution of data sources based on the received definitions of data processing requirements. Further, the method comprises the steps of computing a plurality of subtasks for data processing based on the received definitions of data processing requirements and the established environmental model; assigning the plurality of computed subtasks into framework entities; and deploying the framework entities to computational frameworks of a predetermined framework.

According to a further aspect, the present invention provides an apparatus for automated generation of a data processing topology. The method comprises a planning unit, adapted to receive definitions of data processing requirements, and to establish an environmental model for a distribution of data sources based on the received definitions of data processing requirements. The planning unit is further adapted to compute a plurality of subtasks for a data processing based on the received definitions of data processing requirements and the established environmental model, to assign the plurality of computed subtasks into framework entities, and to deploy the framework entities to computational frameworks of a predetermined framework.

The present invention takes into account that the generation of computational topology for systems dealing with real time processing of big data is very complex. Hence, a manual or semi-automated building of computational topologies is very difficult and time-consuming. This may lead to high costs for building the computational topology, and errors may be easily introduced during the building of the computational topology.

Thus, an idea underlying the present invention is to perform an automated generation of a data processing topology, in particular a data processing topology dealing with big data. It is for this purpose that all necessary data processing requirements may be prepared in advance. Such data processing requirements may be, for instance, specifications of the data sources (e.g. sensors), specification of operations to be applied to the data (e.g. FFT, filtering, etc.), the order of the operations to be applied to the data, and the data format of the resulting output data. Based on these data processing requirements, an environmental model for resource distribution may be established. For instance, a particular model may be selected in a deterministic or a probabilistic manner. For instance, network characteristics, a traffic model, a number of data sources such as sensors, etc. may be taken into consideration. Based on the established model in combination with the received definitions of the data processing requirements a plurality of subtasks may be computed. In particular, a plurality of subtasks may be selected and scheduled in an appropriate order to achieve the required goals. Additionally, latencies may be determined and analyzed.

Next, the computed plurality of subtasks may be assigned into framework entities. Any appropriate framework may be possible. Finally, the assigned framework entities are deployed into computational frameworks of a predetermined framework. Since all these operations may be performed in an automated manner, the building of a data processing topology can be accelerated. As there is no risk for errors caused by a user during this automated process, the reliability can be increased, too. Additionally, such an automated generation of a data processing topology can be applied to each computational environment. Hence, there are no limitations for a particular framework or programming language.

Further embodiments are subject-matter of the dependent claims.

According to an embodiment, the step for establishing an environmental model performs a deterministic or probabilistic establishing of the environmental model.

According to a further embodiment, the method further comprises a step for estimating execution latencies according to the established environmental model. The method further comprises a step for verifying whether or not the estimated execution latencies fulfill predetermined requirements.

In this way, the processing time of the individual tasks may be analyzed. Accordingly, it can be determined whether or not the processing results and/or the intermediate results are available in time. Hence, problems in a chain of a plurality of subtasks can be identified during the generation of the computational topology, and the computational topology can be adapted accordingly. According to a further embodiment, the environmental model comprises specifications of available resources, network characteristics and/or a traffic model.

In this way, the data processing topology can be adapted to these specifications.

According to a further embodiment, the step for establishing an environmental model selects an environmental model out of a plurality of predefined environmental models. Hence, the establishing of the environmental model can be performed in an easy manner.

According to a further embodiment, the step of computing a plurality of subtasks for data processing comprises selecting subtasks out of a set of predefined subtasks. By providing such a plurality of predefined subtasks and selecting the required subtasks out of this set of subtasks, the determination of the required subtasks for data processing can be performed in a very efficient manner.

According to a further embodiment, the step of computing a plurality of subtasks for data processing comprises scheduling the subtasks according to a predefined order. According to a further embodiment, the method comprises a step of providing definitions of data processing requirements in a text file. The text-file may be an arbitrary text-file, for example, a PDDL-file, a XML-file or another text-file. Based on such a text file, the step for receiving definitions of data processing requirements may read such a provided file.

According to a further embodiment, the step for deploying the framework entities is deploying the framework entities to computational frameworks of Apache Storm, Apache Spark or another predetermined computational framework. According to a further embodiment, the method further comprises the steps of applying the deployed framework entities to a computational environment; and collecting computation metrics relating to the applied deployed framework entities. Further, the method may comprise the steps of adapting at least one of the sub tasks for data processing based on the collected computation metrics, assigning the plurality of computational subtasks comprising the adapted subtasks into an amended set of framework entities; and deploying the amended set of framework entities to computational frameworks of a predetermined framework.

In this way, a previously generated data processing topology may be analyzed based on the computation metrics. Based on this analysis, an enhancement of the data processing topology may be identified and the framework entities may be adapted accordingly.

According to a further embodiment, the computation metrics comprises CPU load, memory consumption and/or latency of the computational environment.

According to a further embodiment of the apparatus for automated generation of a data processing topology, the apparatus may further comprise a memory for storing predefined environmental models and/or a memory for storing a plurality of predefined submodules.

In this way, the predefined models and/or submodules may be provided for an efficient selecting the respective data out of the data stored in the respective memory.

According to a further aspect, a measurement system is provided for analyzing a plurality of data streams. The measurement system comprises a first interface for receiving a plurality of data streams and a second interface for outputting processed data streams. The system further comprises a processor which is adapted to process the data streams received from the first interface and forward to process data streams to the second interface. Additionally, the measurement system comprises an apparatus for automated generation of a data processing topology according to the present invention, wherein the apparatus is adapted to setup the processor according to received definitions of data processing requirements.

According to a further aspect, the present invention provides a computer program product adapted to perform a method according to the present invention.

Further advantages and embodiments of the present invention will become more apparent by the following description in connection with the accompanying drawings, wherein:

Figure 1 : shows a schematic illustration of a computational topology;

Figure 2: shows a processing structure using a generated data processing topology according to an embodiment;

Figure 3 : illustrates a flowchart of a method for an automated generation of a data processing topology underlying an embodiment of the present invention;

Figure 4: shows additional tasks of another method for an automated generation of a data processing topology underlying an embodiment; and

Figure 5: shows a schematic illustration of an apparatus for an automated generation of a data processing topology according to an embodiment.

Figure 1 shows a schematic illustration of a computational topology, in particular topology of Apache Storm (in the following entitles as "Storm"). Storm is a distributed real time computational system. The system provides a set of general primitives for doing batch processing. Storm provides a set of general primitives for doing real time computation. A Storm application is designed as a topology in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. A spout is a source 10-i of streams in a computation. A bolt may be a processing element 20-i for processing any number of input streams and producing any number of new output streams. Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, etc. The data processing topology is a network of sources 10-i (spouts) and processes 20-i (bolts), wherein each edge in the network representing a process 20-i subscribing to the output stream of some other sources 10-i or processes 20-i.

Even though the topology is described in connection with Apache Storm, the present invention is not limited to such a topology. Any other computational topology is possible, too.

The data sources 10-i may be any possible source of data. For instance, data sources 10- i may be measurements, in particular measurements from sensors. For instance, the data from data sources 10-i may be data for monitoring an infrastructure or a technological device. For example, the data may be data from sensors of a gas turbine. However, any other data source may be possible, too. Usually, the data from data sources 10-i may be data which are to be analyzed in real time for further decision making. Hence, the data from data sources 10-i have to be adapted (pre-processed) in order to output the data in a format which is appropriate for a further analysis. For this purpose, the data from data sources 10-i are to be processed by one or a plurality of successive processes 20-i. Each of these processes 20-i may be a particular subtask performing a particular data processing. For instance, subtasks which are applied by the individual processes 20-i may be a fast Fourier transform (FFT), filtering, downsampling, computing a route mean square (RMS), or computing an autocorrelation, etc. However, any appropriate further algorithm for data processing may be possible, too.

Since the data output from the data sources 10-i may be in an arbitrary format having individual sampling rate, resolution, etc., the subsequent processes 20-i have to be adapted to these individual formats. For this purpose, a conventional configuration of a data processing topology requires a manual configuration of the processes 20-i in order to adapt the processes 20-i to the data provided by the data sources 10-i.

Figure 2 shows a schematic illustration of a data processing topology. Input data are provided by a data source 10. The input data may be processed by a plurality of subsequent subtasks 21, 22, 23. For instance, one of the subtasks may adapt the sampling rate and/or the resolution of the data provided by data source 10. Another subtask may apply a filtering, a windowing, a Fourier transform or any other appropriate algorithm for processing the data. Further, one of the subtasks may compute a route mean square, determined a maximum, a minimum or an average value, etc. In particular, the output of a first subtask 21 may be forwarded to a second subtask 22, and the output of the second subtask 22 may be forward to a third subtask 23. However, the present invention is not limited to a sequence of three subtasks. Any other number of subsequent subtasks is possible, too. Finally, the computed output value of the last subtask is output as output data 30 for a further processing or analysis.

Figure 3 illustrates a flowchart of a method for generating a data processing topology as described in the following. First, all necessary definitions of data processing requirements are specified in step SI . For instance, a user may specify the necessary data processing requirements by means of a graphical user interface (GUI). However, any other method for specifying processing requirements is possible, too. For example, the user may create a text-file and store all specifications in such a text-file. In particular, it is possible to specify the necessary data processing requirements in a PDDL file or an XML-file. Any other data format for storing the requirements is possible, too. In particular, the requirements may specify, for instance, global tasks which are applied to the data provided by the data sources 10-i in order to obtain the necessary output data for a subsequent analysis. For instance, such a requirement may specify a global task such as a FFT calculation for a filtered and resampled stream of data. However, any other requirement specifying the output data is also possible. In general, such a requirement as a global task may specify a fixed set of available computational blocks which form a computational topology when applied to a particular task. In particular, such a specified global task may comprise a plurality of subsequent subtasks. Each of the subtasks in such a global task may be, for instance, a predefined subtask. In particular, each of the subtasks may be already available as a prepared submodule which may be selected. For example, particular submodules for a Fourier transform, a filter application, a resampling, a modification of the resolution etc. may be stored in a database, and the necessary submodules may be used by referring to such a database. In order to adapt the respective subtasks to the individual application, it may be possible to setup one or more parameters of such a prepared submodule.

After receiving the definitions of the data processing requirements, an appropriate environmental model for a distribution of data sources is established in a further step S2. For this purpose, the received definitions of data processing requirements may be analyzed in order to determine an appropriate environmental model in a deterministic or probabilistic manner. For instance, the data processing requirements may be analyzed in order to identify all available data sources and to determine the type of the data sources and the expected data format provided by these data sources. Accordingly, an environmental model may be estimated or determined. In particular, an appropriate model may be selected out of a plurality of predetermined models. For this purpose, a plurality of predefined models may be stored in a memory, and step S2 for establishing an environmental model may select one of these predefined models stored in such a memory. For instance, the data processing requirements may be analyzed in order to determine a traffic model, network characteristics and further features. In particular, it may be possible to determine a number of devices, a particular type of a device, number and/or type of sensors for monitoring such a device or other features.

After an appropriate environmental model has been established, it may be possible to estimate expected execution latencies according to the selected environmental model. Expected latencies are necessary in order to check whether or not the constructed system can fulfill the respective requirements. If the expected latencies do not fulfill the requirements, it may be necessary to adapt the established environmental model. For instance, another environmental model may be selected. Alternatively, the selected environmental model may be adapted in such a manner that the respective requirements are fulfilled. For this purpose, a manual, semi-automated or fully automated modification of the model may be performed. For instance, the tasks or subtasks may be replaced by another task or subtask having an improved latency. Further, it may be also possible to adapt network characteristics, for instance, bandwidth, routing of a data transmission, distribution of the processing of the individual tasks/subtasks, etc.

Next, the plurality of subtasks for data processing is computed in step S4 based on the received definitions of data processing requirements and the established environmental model. For example, a global task may be split into a plurality of successive subtasks. Each successive subtask may be a subtask using the output of one or more previous subtasks as input data. For instance, a Fourier transform based on filtered and resembled data may be split into successive tasks of firstly resampling a data stream, next filtering the resampled data stream and finally calculating a Fourier transform based on the filtered data. For each of these subtasks, it may be possible to select appropriate predefined subtasks performing the respective processing operations. If necessary, one or a plurality of parameters may be adapted for using such a predefined subtask. In this way, it is possible to determine a sequence of subsequent subtasks for performing a global task by an automatic manner. In this way, the global task is divided into smaller subtasks, and the respective subtasks are scheduled in order to determine the required execution order.

After computing of the plurality of subtasks and scheduling of these subtasks has been performed, the respective subtasks are assigned into framework entities in step S5. Finally, the framework entities are deployed to computational frameworks of a predetermined framework in step S6. The computational framework may be, for instance, Apache Storm, Apache Spark or another predetermined computational framework. After the framework entities of the computed subtasks have been deployed to computational frameworks, these computational frameworks may be applied to a computational environment, and the result may be analyzed. If necessary, further modifications may be performed in order to improve the computational efficiency as illustrated in the flowchart of Figure 4.

It is for this purpose that after applying the deployed framework entities to the computational environment is step S10, computation metrics are collected in step SI 1. These collected computation metrics may relate to the applied deployed framework entities. For instance, the computation metrics may measure the load of one or more central processing units (CPU), memory consumption and/or latencies in the computational environment. For instance, the computation metrics may be compared with predetermined threshold values. If one or more of the computation metrics exceed a predetermined threshold value, it may be determined that further improvement of the data processing topology is necessary. In this case, at least one of the subtasks for the data processing may be adapted based on the analysis of the collected computation metrics. For instance, a subtask may be replaced by another subtask having improved properties with respect to the collected computation metrics in step S12. For instance, if it is determined that the memory consumption is too high, a subtask may be replaced by a subtask requiring less memory. Further, a subtask may be replaced by a subtask leading to a lower CPU load. Other modifications or replacement of a subtask may be also possible depending on the collected computation metrics.

After the subtasks for data processing has been adapted in step SI 2, the new set of computed subtasks comprising the adapted subtasks is assigned in step S13 into an amended set of framework entities. Finally, the amended set of framework entities is deployed to computational frameworks of a predetermined framework in step SI 4.

In this way, a further improvement of the data processing topology can be achieved in an automatic manner without the need of a user action.

Figure 5 shows a schematic illustration of an apparatus 1 for automated generation of a data processing topology. The apparatus 1 for automated generation of a data processing topology comprises a planning unit 2 for planning the data processing topology, a topology adapting unit 3 for adapting the generated data processing topology, and a processing section 4 for processing input data of the data sources based on the generated topology in order to output the processed data.

The apparatus 1 for an automated generation of a data processing topology comprises a planning unit 2, which is adapted to receive definitions of data processing requirements and to establish the above described environmental model for a distribution of data sources based on the received definitions of data processing requirements. Additionally, the planning unit may be adapted to compute a plurality of subtasks for data processing based on the received definitions of data processing requirements and the established environmental model. Further, the planning unit may assign a plurality of computed subtasks into framework entities and deployed framework entities to computational frameworks of a predetermined framework.

Additionally, the apparatus 1 for automated generation of data processing topology, and in particular the planning unit 2, may comprise one or a plurality of memories for storing predefined environmental models and/or storing a plurality of predefined submodules.

After the data processing topology has been generated by planning unit 2, the generated data processing topology may be forward to topology adapting unit 3. Topology adapting unit 3 may assign the individual subtasks of the computational frameworks to individual elements of the computational network for executing the respective subtasks. Based on this, the data processing of the data provided by data sources may be performed and the desired output data may be provided as output data.

In this way, a measurement system for analyzing a plurality of data streams may be configured in an automatic manner. Measurements, in particular measurements obtained from a plurality of sensors may be obtained and provided to the measurement system via a first interface. In particular, a large number of measurements may be performed by 1000 or more sensors. The data may be provided to a processor. The processor may perform processing of the received data streams received by the first interface and forward the processed data streams to a second interface. This second interface may be adapted to output the processed data streams for a further analysis. For instance, the outputted data streams may be provided to a computer system receiving, analyzing, storing and/or displaying the data streams output by the second interface. In order to adapt the processor, the data processing topology of the measurement system may be adapted by the above described apparatus 1 for automated generation of a data processing topology. For this purpose, apparatus 1 for automated generation of data processing topology may setup the processor according to received definitions of data processing requirements. For instance, the measurement system may measure a plurality of sensor data of a device, in particular an industrial application. For example, the measurement system may be applied in order to monitor and analyze the status of a gas turbine or any other industrial device. Summarizing, the present invention relates to an automated generation of a date processing topology dealing with a large amount of data. A data processing topology may be generated based on definitions of data processing requirements. Based on the data processing requirements, an environmental model of a distribution of data sources can be established and a plurality of subtasks for data processing can be computed in an automated manner. The individual subtasks can be assigned to framework entities and these framework entities are deployed to computational frameworks. In this way, a data processing topology can be generated in an automatic manner without the need of manual assistance. Hence, the generation of a data processing topology can be simplified and errors during the data processing topology generation can be avoided.

Claims

PATENT CLAIMS

1. A method for automated generation of a data processing topology, comprising the steps of:

receiving (S 1 ) definitions of data processing requirements;

establishing (S2) an environmental model for a distribution of data sources based on the received definitions of data processing requirements;

computing (S4) a plurality of subtasks for data processing based on the received definitions of data processing requirements and the established environmental model; assigning (S5) the plurality of computed subtasks into framework entities; and deploying (S6) the framework entities to computational frameworks of a predetermined framework.

2. The method according to claim 1 , wherein said step (S2) for establishing an environmental model performs a deterministic or a probabilistic establishing of said environmental model.

3. The method according claim 1 or 2, wherein said method further comprising a step (S3) for estimating execution latencies according to said established environmental model, and verifying whether or not the estimated executions latencies fulfill predetermined requirements.

4. The method according to any of claims 1 to 3, wherein said environmental model comprising specifications of available resources, network characteristics and/or a traffic model.

5. The method according to any of claims 1 to 4, wherein the step (S2) for establishing an environmental model selecting an environmental model out of a plurality of predefined environmental models.

6. The method according to any of claims 1 to 5, wherein said step (S4) of computing a plurality of subtasks for data processing comprises selecting subtasks out of a set of predefined subtasks.

7. The method according to any of claims 1 to 6, wherein said step (S4) of computing a plurality of subtasks for data processing comprises scheduling the subtasks according to a predefined order.

8. The method according to any of claims 1 to 7, further comprising a step of providing definitions of data processing requirements in a PDDL-file, a XML-file or another text-file;

wherein the step (SI) for receiving definitions of data processing requirements is reading said provided file.

9. The method according to any of claims 1 to 8, wherein the step (S6) for deploying the framework entities is deploying the framework entities to computational frameworks of Apache Storm, Apache Spark or another predetermined computational framework.

10. The method according to any of claims 1 to 9, further comprising the steps of: applying (S 10) the deployed framework entities to a computational environment;

collecting (Sl l) computation metrics relating to the applied deployed framework entities;

adapting (SI 2) at least one of the subtasks for data processing based on the collected computation metrics;

assigning (SI 3) the plurality of computed subtasks comprising the adapted subtasks into an amended set of framework entities; and

deploying (S 14) the amended set of framework entities to computational frameworks of a predetermined framework.

1 1. The method according to claim 10, wherein said computation metrics comprising CPU load, memory consumption and/or latency of the computational environment.

12. An apparatus for automated generation of a data processing topology, comprising:

a planning unit (2), adapted to receive definitions of data processing requirements, establish an environmental model for a distribution of data sources based on the received definitions of data processing requirements, compute a plurality of subtasks for data processing based on the received definitions of data processing requirements and the established environmental model, assign the plurality of computed subtasks into framework entities, and deploy the framework entities to computational frameworks of a predetermined framework.

13. The apparatus according to claim 12, further comprising a memory for storing predefined environmental models and/or a memory for storing a plurality of predefined submodules.

14. A measurement system for analyzing a plurality of data streams, comprising: a first interface for receiving a plurality of data streams;

a second interface for outputting processed data streams;

a processor adapted to process the data streams received from said first interface and to forward the processed data streams to said second interface; and

an apparatus for automated generation of a data processing topology according to claims 12 or 13,

wherein said apparatus for automated generation of a data processing topology is adapted to set up said processor according to received definitions of data processing requirements.

15. A computer program product adapted to perform the method according to any of claims 1 to 11.