WO2016163903A1 - Method and apparatus for automated generation of a data processing topology - Google Patents

Method and apparatus for automated generation of a data processing topology Download PDF

Info

Publication number
WO2016163903A1
WO2016163903A1 PCT/RU2015/000222 RU2015000222W WO2016163903A1 WO 2016163903 A1 WO2016163903 A1 WO 2016163903A1 RU 2015000222 W RU2015000222 W RU 2015000222W WO 2016163903 A1 WO2016163903 A1 WO 2016163903A1
Authority
WO
WIPO (PCT)
Prior art keywords
data processing
subtasks
data
framework
topology
Prior art date
Application number
PCT/RU2015/000222
Other languages
French (fr)
Inventor
Sergey Sergeyevich ZOBNIN
Alexander Leonidovich PYAYT
Original Assignee
Siemens Aktiengesellschaft
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens Aktiengesellschaft filed Critical Siemens Aktiengesellschaft
Priority to PCT/RU2015/000222 priority Critical patent/WO2016163903A1/en
Publication of WO2016163903A1 publication Critical patent/WO2016163903A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs

Definitions

  • the present invention relates to a method and an apparatus for automated generation of a data processing topology.
  • the present invention provides a method for automated generation of a data processing topology.
  • the method comprises the steps of receiving definitions of data processing requirements; and establishing an environmental model of a distribution of data sources based on the received definitions of data processing requirements. Further, the method comprises the steps of computing a plurality of subtasks for data processing based on the received definitions of data processing requirements and the established environmental model; assigning the plurality of computed subtasks into framework entities; and deploying the framework entities to computational frameworks of a predetermined framework.
  • the present invention provides an apparatus for automated generation of a data processing topology.
  • the method comprises a planning unit, adapted to receive definitions of data processing requirements, and to establish an environmental model for a distribution of data sources based on the received definitions of data processing requirements.
  • the planning unit is further adapted to compute a plurality of subtasks for a data processing based on the received definitions of data processing requirements and the established environmental model, to assign the plurality of computed subtasks into framework entities, and to deploy the framework entities to computational frameworks of a predetermined framework.
  • the present invention takes into account that the generation of computational topology for systems dealing with real time processing of big data is very complex. Hence, a manual or semi-automated building of computational topologies is very difficult and time-consuming. This may lead to high costs for building the computational topology, and errors may be easily introduced during the building of the computational topology.
  • an idea underlying the present invention is to perform an automated generation of a data processing topology, in particular a data processing topology dealing with big data. It is for this purpose that all necessary data processing requirements may be prepared in advance. Such data processing requirements may be, for instance, specifications of the data sources (e.g. sensors), specification of operations to be applied to the data (e.g. FFT, filtering, etc.), the order of the operations to be applied to the data, and the data format of the resulting output data. Based on these data processing requirements, an environmental model for resource distribution may be established. For instance, a particular model may be selected in a deterministic or a probabilistic manner. For instance, network characteristics, a traffic model, a number of data sources such as sensors, etc. may be taken into consideration.
  • a plurality of subtasks may be computed.
  • a plurality of subtasks may be selected and scheduled in an appropriate order to achieve the required goals.
  • latencies may be determined and analyzed.
  • the computed plurality of subtasks may be assigned into framework entities. Any appropriate framework may be possible.
  • the assigned framework entities are deployed into computational frameworks of a predetermined framework. Since all these operations may be performed in an automated manner, the building of a data processing topology can be accelerated. As there is no risk for errors caused by a user during this automated process, the reliability can be increased, too. Additionally, such an automated generation of a data processing topology can be applied to each computational environment. Hence, there are no limitations for a particular framework or programming language.
  • the step for establishing an environmental model performs a deterministic or probabilistic establishing of the environmental model.
  • the method further comprises a step for estimating execution latencies according to the established environmental model.
  • the method further comprises a step for verifying whether or not the estimated execution latencies fulfill predetermined requirements.
  • the processing time of the individual tasks may be analyzed. Accordingly, it can be determined whether or not the processing results and/or the intermediate results are available in time. Hence, problems in a chain of a plurality of subtasks can be identified during the generation of the computational topology, and the computational topology can be adapted accordingly.
  • the environmental model comprises specifications of available resources, network characteristics and/or a traffic model.
  • the step for establishing an environmental model selects an environmental model out of a plurality of predefined environmental models. Hence, the establishing of the environmental model can be performed in an easy manner.
  • the step of computing a plurality of subtasks for data processing comprises selecting subtasks out of a set of predefined subtasks.
  • the step of computing a plurality of subtasks for data processing comprises scheduling the subtasks according to a predefined order.
  • the method comprises a step of providing definitions of data processing requirements in a text file.
  • the text-file may be an arbitrary text-file, for example, a PDDL-file, a XML-file or another text-file. Based on such a text file, the step for receiving definitions of data processing requirements may read such a provided file.
  • the step for deploying the framework entities is deploying the framework entities to computational frameworks of Apache Storm, Apache Spark or another predetermined computational framework.
  • the method further comprises the steps of applying the deployed framework entities to a computational environment; and collecting computation metrics relating to the applied deployed framework entities.
  • the method may comprise the steps of adapting at least one of the sub tasks for data processing based on the collected computation metrics, assigning the plurality of computational subtasks comprising the adapted subtasks into an amended set of framework entities; and deploying the amended set of framework entities to computational frameworks of a predetermined framework.
  • a previously generated data processing topology may be analyzed based on the computation metrics. Based on this analysis, an enhancement of the data processing topology may be identified and the framework entities may be adapted accordingly.
  • the computation metrics comprises CPU load, memory consumption and/or latency of the computational environment.
  • the apparatus may further comprise a memory for storing predefined environmental models and/or a memory for storing a plurality of predefined submodules.
  • the predefined models and/or submodules may be provided for an efficient selecting the respective data out of the data stored in the respective memory.
  • a measurement system for analyzing a plurality of data streams.
  • the measurement system comprises a first interface for receiving a plurality of data streams and a second interface for outputting processed data streams.
  • the system further comprises a processor which is adapted to process the data streams received from the first interface and forward to process data streams to the second interface.
  • the measurement system comprises an apparatus for automated generation of a data processing topology according to the present invention, wherein the apparatus is adapted to setup the processor according to received definitions of data processing requirements.
  • the present invention provides a computer program product adapted to perform a method according to the present invention.
  • Figure 1 shows a schematic illustration of a computational topology
  • Figure 2 shows a processing structure using a generated data processing topology according to an embodiment
  • Figure 3 illustrates a flowchart of a method for an automated generation of a data processing topology underlying an embodiment of the present invention
  • Figure 4 shows additional tasks of another method for an automated generation of a data processing topology underlying an embodiment
  • Figure 5 shows a schematic illustration of an apparatus for an automated generation of a data processing topology according to an embodiment.
  • FIG 1 shows a schematic illustration of a computational topology, in particular topology of Apache Storm (in the following entitles as "Storm”).
  • Storm is a distributed real time computational system. The system provides a set of general primitives for doing batch processing. Storm provides a set of general primitives for doing real time computation.
  • a Storm application is designed as a topology in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline.
  • a spout is a source 10-i of streams in a computation.
  • a bolt may be a processing element 20-i for processing any number of input streams and producing any number of new output streams. Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, etc.
  • the data processing topology is a network of sources 10-i (spouts) and processes 20-i (bolts), wherein each edge in the network representing a process 20-i subscribing to the output stream of some other sources 10-i or processes 20-i.
  • topology is described in connection with Apache Storm, the present invention is not limited to such a topology. Any other computational topology is possible, too.
  • the data sources 10-i may be any possible source of data.
  • data sources 10- i may be measurements, in particular measurements from sensors.
  • the data from data sources 10-i may be data for monitoring an infrastructure or a technological device.
  • the data may be data from sensors of a gas turbine.
  • any other data source may be possible, too.
  • the data from data sources 10-i may be data which are to be analyzed in real time for further decision making.
  • the data from data sources 10-i have to be adapted (pre-processed) in order to output the data in a format which is appropriate for a further analysis.
  • the data from data sources 10-i are to be processed by one or a plurality of successive processes 20-i.
  • Each of these processes 20-i may be a particular subtask performing a particular data processing.
  • subtasks which are applied by the individual processes 20-i may be a fast Fourier transform (FFT), filtering, downsampling, computing a route mean square (RMS), or computing an autocorrelation, etc.
  • FFT fast Fourier transform
  • RMS route mean square
  • autocorrelation any appropriate further algorithm for data processing may be possible, too.
  • the subsequent processes 20-i have to be adapted to these individual formats.
  • a conventional configuration of a data processing topology requires a manual configuration of the processes 20-i in order to adapt the processes 20-i to the data provided by the data sources 10-i.
  • Figure 2 shows a schematic illustration of a data processing topology.
  • Input data are provided by a data source 10.
  • the input data may be processed by a plurality of subsequent subtasks 21, 22, 23.
  • one of the subtasks may adapt the sampling rate and/or the resolution of the data provided by data source 10.
  • Another subtask may apply a filtering, a windowing, a Fourier transform or any other appropriate algorithm for processing the data.
  • one of the subtasks may compute a route mean square, determined a maximum, a minimum or an average value, etc.
  • the output of a first subtask 21 may be forwarded to a second subtask 22, and the output of the second subtask 22 may be forward to a third subtask 23.
  • the present invention is not limited to a sequence of three subtasks. Any other number of subsequent subtasks is possible, too.
  • the computed output value of the last subtask is output as output data 30 for a further processing or analysis.
  • FIG. 3 illustrates a flowchart of a method for generating a data processing topology as described in the following.
  • all necessary definitions of data processing requirements are specified in step SI .
  • a user may specify the necessary data processing requirements by means of a graphical user interface (GUI).
  • GUI graphical user interface
  • the user may create a text-file and store all specifications in such a text-file.
  • the requirements may specify, for instance, global tasks which are applied to the data provided by the data sources 10-i in order to obtain the necessary output data for a subsequent analysis.
  • such a requirement may specify a global task such as a FFT calculation for a filtered and resampled stream of data.
  • a global task such as a FFT calculation for a filtered and resampled stream of data.
  • any other requirement specifying the output data is also possible.
  • such a requirement as a global task may specify a fixed set of available computational blocks which form a computational topology when applied to a particular task.
  • such a specified global task may comprise a plurality of subsequent subtasks.
  • Each of the subtasks in such a global task may be, for instance, a predefined subtask.
  • each of the subtasks may be already available as a prepared submodule which may be selected.
  • particular submodules for a Fourier transform, a filter application, a resampling, a modification of the resolution etc. may be stored in a database, and the necessary submodules may be used by referring to such a database.
  • the necessary submodules may be used by referring to such a database.
  • an appropriate environmental model for a distribution of data sources is established in a further step S2.
  • the received definitions of data processing requirements may be analyzed in order to determine an appropriate environmental model in a deterministic or probabilistic manner.
  • the data processing requirements may be analyzed in order to identify all available data sources and to determine the type of the data sources and the expected data format provided by these data sources.
  • an environmental model may be estimated or determined.
  • an appropriate model may be selected out of a plurality of predetermined models.
  • a plurality of predefined models may be stored in a memory, and step S2 for establishing an environmental model may select one of these predefined models stored in such a memory.
  • the data processing requirements may be analyzed in order to determine a traffic model, network characteristics and further features.
  • it may be possible to determine a number of devices, a particular type of a device, number and/or type of sensors for monitoring such a device or other features.
  • expected execution latencies After an appropriate environmental model has been established, it may be possible to estimate expected execution latencies according to the selected environmental model. Expected latencies are necessary in order to check whether or not the constructed system can fulfill the respective requirements. If the expected latencies do not fulfill the requirements, it may be necessary to adapt the established environmental model. For instance, another environmental model may be selected. Alternatively, the selected environmental model may be adapted in such a manner that the respective requirements are fulfilled. For this purpose, a manual, semi-automated or fully automated modification of the model may be performed. For instance, the tasks or subtasks may be replaced by another task or subtask having an improved latency. Further, it may be also possible to adapt network characteristics, for instance, bandwidth, routing of a data transmission, distribution of the processing of the individual tasks/subtasks, etc.
  • a global task may be split into a plurality of successive subtasks.
  • Each successive subtask may be a subtask using the output of one or more previous subtasks as input data.
  • a Fourier transform based on filtered and resembled data may be split into successive tasks of firstly resampling a data stream, next filtering the resampled data stream and finally calculating a Fourier transform based on the filtered data.
  • one or a plurality of parameters may be adapted for using such a predefined subtask. In this way, it is possible to determine a sequence of subsequent subtasks for performing a global task by an automatic manner. In this way, the global task is divided into smaller subtasks, and the respective subtasks are scheduled in order to determine the required execution order.
  • the respective subtasks are assigned into framework entities in step S5.
  • the framework entities are deployed to computational frameworks of a predetermined framework in step S6.
  • the computational framework may be, for instance, Apache Storm, Apache Spark or another predetermined computational framework.
  • these computational frameworks may be applied to a computational environment, and the result may be analyzed. If necessary, further modifications may be performed in order to improve the computational efficiency as illustrated in the flowchart of Figure 4.
  • computation metrics are collected in step SI 1.
  • These collected computation metrics may relate to the applied deployed framework entities.
  • the computation metrics may measure the load of one or more central processing units (CPU), memory consumption and/or latencies in the computational environment.
  • the computation metrics may be compared with predetermined threshold values. If one or more of the computation metrics exceed a predetermined threshold value, it may be determined that further improvement of the data processing topology is necessary.
  • at least one of the subtasks for the data processing may be adapted based on the analysis of the collected computation metrics. For instance, a subtask may be replaced by another subtask having improved properties with respect to the collected computation metrics in step S12.
  • a subtask may be replaced by a subtask requiring less memory. Further, a subtask may be replaced by a subtask leading to a lower CPU load. Other modifications or replacement of a subtask may be also possible depending on the collected computation metrics.
  • step SI 2 After the subtasks for data processing has been adapted in step SI 2, the new set of computed subtasks comprising the adapted subtasks is assigned in step S13 into an amended set of framework entities. Finally, the amended set of framework entities is deployed to computational frameworks of a predetermined framework in step SI 4.
  • FIG. 5 shows a schematic illustration of an apparatus 1 for automated generation of a data processing topology.
  • the apparatus 1 for automated generation of a data processing topology comprises a planning unit 2 for planning the data processing topology, a topology adapting unit 3 for adapting the generated data processing topology, and a processing section 4 for processing input data of the data sources based on the generated topology in order to output the processed data.
  • the apparatus 1 for an automated generation of a data processing topology comprises a planning unit 2, which is adapted to receive definitions of data processing requirements and to establish the above described environmental model for a distribution of data sources based on the received definitions of data processing requirements. Additionally, the planning unit may be adapted to compute a plurality of subtasks for data processing based on the received definitions of data processing requirements and the established environmental model. Further, the planning unit may assign a plurality of computed subtasks into framework entities and deployed framework entities to computational frameworks of a predetermined framework.
  • the apparatus 1 for automated generation of data processing topology may comprise one or a plurality of memories for storing predefined environmental models and/or storing a plurality of predefined submodules.
  • Topology adapting unit 3 may assign the individual subtasks of the computational frameworks to individual elements of the computational network for executing the respective subtasks. Based on this, the data processing of the data provided by data sources may be performed and the desired output data may be provided as output data.
  • a measurement system for analyzing a plurality of data streams may be configured in an automatic manner. Measurements, in particular measurements obtained from a plurality of sensors may be obtained and provided to the measurement system via a first interface. In particular, a large number of measurements may be performed by 1000 or more sensors.
  • the data may be provided to a processor.
  • the processor may perform processing of the received data streams received by the first interface and forward the processed data streams to a second interface. This second interface may be adapted to output the processed data streams for a further analysis. For instance, the outputted data streams may be provided to a computer system receiving, analyzing, storing and/or displaying the data streams output by the second interface.
  • the data processing topology of the measurement system may be adapted by the above described apparatus 1 for automated generation of a data processing topology.
  • apparatus 1 for automated generation of data processing topology may setup the processor according to received definitions of data processing requirements.
  • the measurement system may measure a plurality of sensor data of a device, in particular an industrial application.
  • the measurement system may be applied in order to monitor and analyze the status of a gas turbine or any other industrial device.
  • the present invention relates to an automated generation of a date processing topology dealing with a large amount of data.
  • a data processing topology may be generated based on definitions of data processing requirements.
  • an environmental model of a distribution of data sources can be established and a plurality of subtasks for data processing can be computed in an automated manner.
  • the individual subtasks can be assigned to framework entities and these framework entities are deployed to computational frameworks.
  • a data processing topology can be generated in an automatic manner without the need of manual assistance.
  • the generation of a data processing topology can be simplified and errors during the data processing topology generation can be avoided.

Abstract

An automated generation of a date processing topology dealing with a large amount of data is provided. A data processing topology may be generated based on definitions of data processing requirements. Based on the data processing requirements, an environmental model of a distribution of data sources can be established and a plurality of subtasks for data processing can be computed in an automated manner. The individual subtasks can be assigned to framework entities and these framework entities are deployed to computational frameworks. In this way, a data processing topology can be generated in an automatic manner without the need of manual assistance. Hence, the generation of a data processing topology can be simplified and errors during the data processing topology generation can be avoided.

Description

METHOD AND APPARATUS FOR AUTOMATED GENERATION OF A DATA
PROCESSING TOPOLOGY
The present invention relates to a method and an apparatus for automated generation of a data processing topology.
Many modern technical systems are using real-time data analytic methods for monitoring critical infrastructure or conditions of large technological devices. For instance, gas turbines can be monitored and controlled by analyzing the data from many sensors related to the gas turbine. In such a case measurements from many sensors, for example thousands of sensors, have to be analyzed in real time. Hence a huge amount of data ("big data") has to be processed for further decision making. For this purpose, computational topologies have to be built for real time processing frameworks. These computational topologies can be built from standard blocks. However, the computational topologies can be built only in manual or at least semi-automated manner. Hence, additional user action is required in order to prepare an appropriate configuration of the respective blocks, or to adapt source code of the standard blocks. Since monitoring of infrastructure and large technological devices requires real time analytics methods for a large amount of data sources ("big data"), such manual or semi- automated modification of a computational topology is very complex and requires a large amount of time. Hence, there is a need for an automation of computational topology generation.
This is achieved by the features of the independent claims.
According to a first aspect, the present invention provides a method for automated generation of a data processing topology. The method comprises the steps of receiving definitions of data processing requirements; and establishing an environmental model of a distribution of data sources based on the received definitions of data processing requirements. Further, the method comprises the steps of computing a plurality of subtasks for data processing based on the received definitions of data processing requirements and the established environmental model; assigning the plurality of computed subtasks into framework entities; and deploying the framework entities to computational frameworks of a predetermined framework.
According to a further aspect, the present invention provides an apparatus for automated generation of a data processing topology. The method comprises a planning unit, adapted to receive definitions of data processing requirements, and to establish an environmental model for a distribution of data sources based on the received definitions of data processing requirements. The planning unit is further adapted to compute a plurality of subtasks for a data processing based on the received definitions of data processing requirements and the established environmental model, to assign the plurality of computed subtasks into framework entities, and to deploy the framework entities to computational frameworks of a predetermined framework.
The present invention takes into account that the generation of computational topology for systems dealing with real time processing of big data is very complex. Hence, a manual or semi-automated building of computational topologies is very difficult and time-consuming. This may lead to high costs for building the computational topology, and errors may be easily introduced during the building of the computational topology.
Thus, an idea underlying the present invention is to perform an automated generation of a data processing topology, in particular a data processing topology dealing with big data. It is for this purpose that all necessary data processing requirements may be prepared in advance. Such data processing requirements may be, for instance, specifications of the data sources (e.g. sensors), specification of operations to be applied to the data (e.g. FFT, filtering, etc.), the order of the operations to be applied to the data, and the data format of the resulting output data. Based on these data processing requirements, an environmental model for resource distribution may be established. For instance, a particular model may be selected in a deterministic or a probabilistic manner. For instance, network characteristics, a traffic model, a number of data sources such as sensors, etc. may be taken into consideration. Based on the established model in combination with the received definitions of the data processing requirements a plurality of subtasks may be computed. In particular, a plurality of subtasks may be selected and scheduled in an appropriate order to achieve the required goals. Additionally, latencies may be determined and analyzed.
Next, the computed plurality of subtasks may be assigned into framework entities. Any appropriate framework may be possible. Finally, the assigned framework entities are deployed into computational frameworks of a predetermined framework. Since all these operations may be performed in an automated manner, the building of a data processing topology can be accelerated. As there is no risk for errors caused by a user during this automated process, the reliability can be increased, too. Additionally, such an automated generation of a data processing topology can be applied to each computational environment. Hence, there are no limitations for a particular framework or programming language.
Further embodiments are subject-matter of the dependent claims.
According to an embodiment, the step for establishing an environmental model performs a deterministic or probabilistic establishing of the environmental model.
According to a further embodiment, the method further comprises a step for estimating execution latencies according to the established environmental model. The method further comprises a step for verifying whether or not the estimated execution latencies fulfill predetermined requirements.
In this way, the processing time of the individual tasks may be analyzed. Accordingly, it can be determined whether or not the processing results and/or the intermediate results are available in time. Hence, problems in a chain of a plurality of subtasks can be identified during the generation of the computational topology, and the computational topology can be adapted accordingly. According to a further embodiment, the environmental model comprises specifications of available resources, network characteristics and/or a traffic model.
In this way, the data processing topology can be adapted to these specifications.
According to a further embodiment, the step for establishing an environmental model selects an environmental model out of a plurality of predefined environmental models. Hence, the establishing of the environmental model can be performed in an easy manner.
According to a further embodiment, the step of computing a plurality of subtasks for data processing comprises selecting subtasks out of a set of predefined subtasks. By providing such a plurality of predefined subtasks and selecting the required subtasks out of this set of subtasks, the determination of the required subtasks for data processing can be performed in a very efficient manner.
According to a further embodiment, the step of computing a plurality of subtasks for data processing comprises scheduling the subtasks according to a predefined order. According to a further embodiment, the method comprises a step of providing definitions of data processing requirements in a text file. The text-file may be an arbitrary text-file, for example, a PDDL-file, a XML-file or another text-file. Based on such a text file, the step for receiving definitions of data processing requirements may read such a provided file.
According to a further embodiment, the step for deploying the framework entities is deploying the framework entities to computational frameworks of Apache Storm, Apache Spark or another predetermined computational framework. According to a further embodiment, the method further comprises the steps of applying the deployed framework entities to a computational environment; and collecting computation metrics relating to the applied deployed framework entities. Further, the method may comprise the steps of adapting at least one of the sub tasks for data processing based on the collected computation metrics, assigning the plurality of computational subtasks comprising the adapted subtasks into an amended set of framework entities; and deploying the amended set of framework entities to computational frameworks of a predetermined framework.
In this way, a previously generated data processing topology may be analyzed based on the computation metrics. Based on this analysis, an enhancement of the data processing topology may be identified and the framework entities may be adapted accordingly.
According to a further embodiment, the computation metrics comprises CPU load, memory consumption and/or latency of the computational environment.
According to a further embodiment of the apparatus for automated generation of a data processing topology, the apparatus may further comprise a memory for storing predefined environmental models and/or a memory for storing a plurality of predefined submodules.
In this way, the predefined models and/or submodules may be provided for an efficient selecting the respective data out of the data stored in the respective memory.
According to a further aspect, a measurement system is provided for analyzing a plurality of data streams. The measurement system comprises a first interface for receiving a plurality of data streams and a second interface for outputting processed data streams. The system further comprises a processor which is adapted to process the data streams received from the first interface and forward to process data streams to the second interface. Additionally, the measurement system comprises an apparatus for automated generation of a data processing topology according to the present invention, wherein the apparatus is adapted to setup the processor according to received definitions of data processing requirements.
According to a further aspect, the present invention provides a computer program product adapted to perform a method according to the present invention.
Further advantages and embodiments of the present invention will become more apparent by the following description in connection with the accompanying drawings, wherein:
Figure 1 : shows a schematic illustration of a computational topology;
Figure 2: shows a processing structure using a generated data processing topology according to an embodiment;
Figure 3 : illustrates a flowchart of a method for an automated generation of a data processing topology underlying an embodiment of the present invention;
Figure 4: shows additional tasks of another method for an automated generation of a data processing topology underlying an embodiment; and
Figure 5: shows a schematic illustration of an apparatus for an automated generation of a data processing topology according to an embodiment.
Figure 1 shows a schematic illustration of a computational topology, in particular topology of Apache Storm (in the following entitles as "Storm"). Storm is a distributed real time computational system. The system provides a set of general primitives for doing batch processing. Storm provides a set of general primitives for doing real time computation. A Storm application is designed as a topology in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. A spout is a source 10-i of streams in a computation. A bolt may be a processing element 20-i for processing any number of input streams and producing any number of new output streams. Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, etc. The data processing topology is a network of sources 10-i (spouts) and processes 20-i (bolts), wherein each edge in the network representing a process 20-i subscribing to the output stream of some other sources 10-i or processes 20-i.
Even though the topology is described in connection with Apache Storm, the present invention is not limited to such a topology. Any other computational topology is possible, too.
The data sources 10-i may be any possible source of data. For instance, data sources 10- i may be measurements, in particular measurements from sensors. For instance, the data from data sources 10-i may be data for monitoring an infrastructure or a technological device. For example, the data may be data from sensors of a gas turbine. However, any other data source may be possible, too. Usually, the data from data sources 10-i may be data which are to be analyzed in real time for further decision making. Hence, the data from data sources 10-i have to be adapted (pre-processed) in order to output the data in a format which is appropriate for a further analysis. For this purpose, the data from data sources 10-i are to be processed by one or a plurality of successive processes 20-i. Each of these processes 20-i may be a particular subtask performing a particular data processing. For instance, subtasks which are applied by the individual processes 20-i may be a fast Fourier transform (FFT), filtering, downsampling, computing a route mean square (RMS), or computing an autocorrelation, etc. However, any appropriate further algorithm for data processing may be possible, too.
Since the data output from the data sources 10-i may be in an arbitrary format having individual sampling rate, resolution, etc., the subsequent processes 20-i have to be adapted to these individual formats. For this purpose, a conventional configuration of a data processing topology requires a manual configuration of the processes 20-i in order to adapt the processes 20-i to the data provided by the data sources 10-i.
Figure 2 shows a schematic illustration of a data processing topology. Input data are provided by a data source 10. The input data may be processed by a plurality of subsequent subtasks 21, 22, 23. For instance, one of the subtasks may adapt the sampling rate and/or the resolution of the data provided by data source 10. Another subtask may apply a filtering, a windowing, a Fourier transform or any other appropriate algorithm for processing the data. Further, one of the subtasks may compute a route mean square, determined a maximum, a minimum or an average value, etc. In particular, the output of a first subtask 21 may be forwarded to a second subtask 22, and the output of the second subtask 22 may be forward to a third subtask 23. However, the present invention is not limited to a sequence of three subtasks. Any other number of subsequent subtasks is possible, too. Finally, the computed output value of the last subtask is output as output data 30 for a further processing or analysis.
Figure 3 illustrates a flowchart of a method for generating a data processing topology as described in the following. First, all necessary definitions of data processing requirements are specified in step SI . For instance, a user may specify the necessary data processing requirements by means of a graphical user interface (GUI). However, any other method for specifying processing requirements is possible, too. For example, the user may create a text-file and store all specifications in such a text-file. In particular, it is possible to specify the necessary data processing requirements in a PDDL file or an XML-file. Any other data format for storing the requirements is possible, too. In particular, the requirements may specify, for instance, global tasks which are applied to the data provided by the data sources 10-i in order to obtain the necessary output data for a subsequent analysis. For instance, such a requirement may specify a global task such as a FFT calculation for a filtered and resampled stream of data. However, any other requirement specifying the output data is also possible. In general, such a requirement as a global task may specify a fixed set of available computational blocks which form a computational topology when applied to a particular task. In particular, such a specified global task may comprise a plurality of subsequent subtasks. Each of the subtasks in such a global task may be, for instance, a predefined subtask. In particular, each of the subtasks may be already available as a prepared submodule which may be selected. For example, particular submodules for a Fourier transform, a filter application, a resampling, a modification of the resolution etc. may be stored in a database, and the necessary submodules may be used by referring to such a database. In order to adapt the respective subtasks to the individual application, it may be possible to setup one or more parameters of such a prepared submodule.
After receiving the definitions of the data processing requirements, an appropriate environmental model for a distribution of data sources is established in a further step S2. For this purpose, the received definitions of data processing requirements may be analyzed in order to determine an appropriate environmental model in a deterministic or probabilistic manner. For instance, the data processing requirements may be analyzed in order to identify all available data sources and to determine the type of the data sources and the expected data format provided by these data sources. Accordingly, an environmental model may be estimated or determined. In particular, an appropriate model may be selected out of a plurality of predetermined models. For this purpose, a plurality of predefined models may be stored in a memory, and step S2 for establishing an environmental model may select one of these predefined models stored in such a memory. For instance, the data processing requirements may be analyzed in order to determine a traffic model, network characteristics and further features. In particular, it may be possible to determine a number of devices, a particular type of a device, number and/or type of sensors for monitoring such a device or other features.
After an appropriate environmental model has been established, it may be possible to estimate expected execution latencies according to the selected environmental model. Expected latencies are necessary in order to check whether or not the constructed system can fulfill the respective requirements. If the expected latencies do not fulfill the requirements, it may be necessary to adapt the established environmental model. For instance, another environmental model may be selected. Alternatively, the selected environmental model may be adapted in such a manner that the respective requirements are fulfilled. For this purpose, a manual, semi-automated or fully automated modification of the model may be performed. For instance, the tasks or subtasks may be replaced by another task or subtask having an improved latency. Further, it may be also possible to adapt network characteristics, for instance, bandwidth, routing of a data transmission, distribution of the processing of the individual tasks/subtasks, etc.
Next, the plurality of subtasks for data processing is computed in step S4 based on the received definitions of data processing requirements and the established environmental model. For example, a global task may be split into a plurality of successive subtasks. Each successive subtask may be a subtask using the output of one or more previous subtasks as input data. For instance, a Fourier transform based on filtered and resembled data may be split into successive tasks of firstly resampling a data stream, next filtering the resampled data stream and finally calculating a Fourier transform based on the filtered data. For each of these subtasks, it may be possible to select appropriate predefined subtasks performing the respective processing operations. If necessary, one or a plurality of parameters may be adapted for using such a predefined subtask. In this way, it is possible to determine a sequence of subsequent subtasks for performing a global task by an automatic manner. In this way, the global task is divided into smaller subtasks, and the respective subtasks are scheduled in order to determine the required execution order.
After computing of the plurality of subtasks and scheduling of these subtasks has been performed, the respective subtasks are assigned into framework entities in step S5. Finally, the framework entities are deployed to computational frameworks of a predetermined framework in step S6. The computational framework may be, for instance, Apache Storm, Apache Spark or another predetermined computational framework. After the framework entities of the computed subtasks have been deployed to computational frameworks, these computational frameworks may be applied to a computational environment, and the result may be analyzed. If necessary, further modifications may be performed in order to improve the computational efficiency as illustrated in the flowchart of Figure 4.
It is for this purpose that after applying the deployed framework entities to the computational environment is step S10, computation metrics are collected in step SI 1. These collected computation metrics may relate to the applied deployed framework entities. For instance, the computation metrics may measure the load of one or more central processing units (CPU), memory consumption and/or latencies in the computational environment. For instance, the computation metrics may be compared with predetermined threshold values. If one or more of the computation metrics exceed a predetermined threshold value, it may be determined that further improvement of the data processing topology is necessary. In this case, at least one of the subtasks for the data processing may be adapted based on the analysis of the collected computation metrics. For instance, a subtask may be replaced by another subtask having improved properties with respect to the collected computation metrics in step S12. For instance, if it is determined that the memory consumption is too high, a subtask may be replaced by a subtask requiring less memory. Further, a subtask may be replaced by a subtask leading to a lower CPU load. Other modifications or replacement of a subtask may be also possible depending on the collected computation metrics.
After the subtasks for data processing has been adapted in step SI 2, the new set of computed subtasks comprising the adapted subtasks is assigned in step S13 into an amended set of framework entities. Finally, the amended set of framework entities is deployed to computational frameworks of a predetermined framework in step SI 4.
In this way, a further improvement of the data processing topology can be achieved in an automatic manner without the need of a user action.
Figure 5 shows a schematic illustration of an apparatus 1 for automated generation of a data processing topology. The apparatus 1 for automated generation of a data processing topology comprises a planning unit 2 for planning the data processing topology, a topology adapting unit 3 for adapting the generated data processing topology, and a processing section 4 for processing input data of the data sources based on the generated topology in order to output the processed data.
The apparatus 1 for an automated generation of a data processing topology comprises a planning unit 2, which is adapted to receive definitions of data processing requirements and to establish the above described environmental model for a distribution of data sources based on the received definitions of data processing requirements. Additionally, the planning unit may be adapted to compute a plurality of subtasks for data processing based on the received definitions of data processing requirements and the established environmental model. Further, the planning unit may assign a plurality of computed subtasks into framework entities and deployed framework entities to computational frameworks of a predetermined framework.
Additionally, the apparatus 1 for automated generation of data processing topology, and in particular the planning unit 2, may comprise one or a plurality of memories for storing predefined environmental models and/or storing a plurality of predefined submodules.
After the data processing topology has been generated by planning unit 2, the generated data processing topology may be forward to topology adapting unit 3. Topology adapting unit 3 may assign the individual subtasks of the computational frameworks to individual elements of the computational network for executing the respective subtasks. Based on this, the data processing of the data provided by data sources may be performed and the desired output data may be provided as output data.
In this way, a measurement system for analyzing a plurality of data streams may be configured in an automatic manner. Measurements, in particular measurements obtained from a plurality of sensors may be obtained and provided to the measurement system via a first interface. In particular, a large number of measurements may be performed by 1000 or more sensors. The data may be provided to a processor. The processor may perform processing of the received data streams received by the first interface and forward the processed data streams to a second interface. This second interface may be adapted to output the processed data streams for a further analysis. For instance, the outputted data streams may be provided to a computer system receiving, analyzing, storing and/or displaying the data streams output by the second interface. In order to adapt the processor, the data processing topology of the measurement system may be adapted by the above described apparatus 1 for automated generation of a data processing topology. For this purpose, apparatus 1 for automated generation of data processing topology may setup the processor according to received definitions of data processing requirements. For instance, the measurement system may measure a plurality of sensor data of a device, in particular an industrial application. For example, the measurement system may be applied in order to monitor and analyze the status of a gas turbine or any other industrial device. Summarizing, the present invention relates to an automated generation of a date processing topology dealing with a large amount of data. A data processing topology may be generated based on definitions of data processing requirements. Based on the data processing requirements, an environmental model of a distribution of data sources can be established and a plurality of subtasks for data processing can be computed in an automated manner. The individual subtasks can be assigned to framework entities and these framework entities are deployed to computational frameworks. In this way, a data processing topology can be generated in an automatic manner without the need of manual assistance. Hence, the generation of a data processing topology can be simplified and errors during the data processing topology generation can be avoided.

Claims

PATENT CLAIMS
1. A method for automated generation of a data processing topology, comprising the steps of:
receiving (S 1 ) definitions of data processing requirements;
establishing (S2) an environmental model for a distribution of data sources based on the received definitions of data processing requirements;
computing (S4) a plurality of subtasks for data processing based on the received definitions of data processing requirements and the established environmental model; assigning (S5) the plurality of computed subtasks into framework entities; and deploying (S6) the framework entities to computational frameworks of a predetermined framework.
2. The method according to claim 1 , wherein said step (S2) for establishing an environmental model performs a deterministic or a probabilistic establishing of said environmental model.
3. The method according claim 1 or 2, wherein said method further comprising a step (S3) for estimating execution latencies according to said established environmental model, and verifying whether or not the estimated executions latencies fulfill predetermined requirements.
4. The method according to any of claims 1 to 3, wherein said environmental model comprising specifications of available resources, network characteristics and/or a traffic model.
5. The method according to any of claims 1 to 4, wherein the step (S2) for establishing an environmental model selecting an environmental model out of a plurality of predefined environmental models.
6. The method according to any of claims 1 to 5, wherein said step (S4) of computing a plurality of subtasks for data processing comprises selecting subtasks out of a set of predefined subtasks.
7. The method according to any of claims 1 to 6, wherein said step (S4) of computing a plurality of subtasks for data processing comprises scheduling the subtasks according to a predefined order.
8. The method according to any of claims 1 to 7, further comprising a step of providing definitions of data processing requirements in a PDDL-file, a XML-file or another text-file;
wherein the step (SI) for receiving definitions of data processing requirements is reading said provided file.
9. The method according to any of claims 1 to 8, wherein the step (S6) for deploying the framework entities is deploying the framework entities to computational frameworks of Apache Storm, Apache Spark or another predetermined computational framework.
10. The method according to any of claims 1 to 9, further comprising the steps of: applying (S 10) the deployed framework entities to a computational environment;
collecting (Sl l) computation metrics relating to the applied deployed framework entities;
adapting (SI 2) at least one of the subtasks for data processing based on the collected computation metrics;
assigning (SI 3) the plurality of computed subtasks comprising the adapted subtasks into an amended set of framework entities; and
deploying (S 14) the amended set of framework entities to computational frameworks of a predetermined framework.
1 1. The method according to claim 10, wherein said computation metrics comprising CPU load, memory consumption and/or latency of the computational environment.
12. An apparatus for automated generation of a data processing topology, comprising:
a planning unit (2), adapted to receive definitions of data processing requirements, establish an environmental model for a distribution of data sources based on the received definitions of data processing requirements, compute a plurality of subtasks for data processing based on the received definitions of data processing requirements and the established environmental model, assign the plurality of computed subtasks into framework entities, and deploy the framework entities to computational frameworks of a predetermined framework.
13. The apparatus according to claim 12, further comprising a memory for storing predefined environmental models and/or a memory for storing a plurality of predefined submodules.
14. A measurement system for analyzing a plurality of data streams, comprising: a first interface for receiving a plurality of data streams;
a second interface for outputting processed data streams;
a processor adapted to process the data streams received from said first interface and to forward the processed data streams to said second interface; and
an apparatus for automated generation of a data processing topology according to claims 12 or 13,
wherein said apparatus for automated generation of a data processing topology is adapted to set up said processor according to received definitions of data processing requirements.
15. A computer program product adapted to perform the method according to any of claims 1 to 11.
PCT/RU2015/000222 2015-04-08 2015-04-08 Method and apparatus for automated generation of a data processing topology WO2016163903A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/RU2015/000222 WO2016163903A1 (en) 2015-04-08 2015-04-08 Method and apparatus for automated generation of a data processing topology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2015/000222 WO2016163903A1 (en) 2015-04-08 2015-04-08 Method and apparatus for automated generation of a data processing topology

Publications (1)

Publication Number Publication Date
WO2016163903A1 true WO2016163903A1 (en) 2016-10-13

Family

ID=54540149

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2015/000222 WO2016163903A1 (en) 2015-04-08 2015-04-08 Method and apparatus for automated generation of a data processing topology

Country Status (1)

Country Link
WO (1) WO2016163903A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170134415A1 (en) * 2015-08-31 2017-05-11 Splunk Inc. Network Security Threat Detection by User/User-Entity Behavioral Analysis
US10205735B2 (en) 2017-01-30 2019-02-12 Splunk Inc. Graph-based network security threat detection across time and entities
CN111049900A (en) * 2019-12-11 2020-04-21 中移物联网有限公司 Internet of things flow calculation scheduling method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120084789A1 (en) * 2010-09-30 2012-04-05 Francesco Iorio System and Method for Optimizing the Evaluation of Task Dependency Graphs
US20120174105A1 (en) * 2011-01-05 2012-07-05 International Business Machines Corporation Locality Mapping In A Distributed Processing System
US20130346988A1 (en) * 2012-06-22 2013-12-26 Microsoft Corporation Parallel data computing optimization
US20140215471A1 (en) * 2013-01-28 2014-07-31 Hewlett-Packard Development Company, L.P. Creating a model relating to execution of a job on platforms
US20150052530A1 (en) * 2013-08-14 2015-02-19 International Business Machines Corporation Task-based modeling for parallel data integration

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120084789A1 (en) * 2010-09-30 2012-04-05 Francesco Iorio System and Method for Optimizing the Evaluation of Task Dependency Graphs
US20120174105A1 (en) * 2011-01-05 2012-07-05 International Business Machines Corporation Locality Mapping In A Distributed Processing System
US20130346988A1 (en) * 2012-06-22 2013-12-26 Microsoft Corporation Parallel data computing optimization
US20140215471A1 (en) * 2013-01-28 2014-07-31 Hewlett-Packard Development Company, L.P. Creating a model relating to execution of a job on platforms
US20150052530A1 (en) * 2013-08-14 2015-02-19 International Business Machines Corporation Task-based modeling for parallel data integration

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10476898B2 (en) 2015-08-31 2019-11-12 Splunk Inc. Lateral movement detection for network security analysis
US11575693B1 (en) 2015-08-31 2023-02-07 Splunk Inc. Composite relationship graph for network security
US20170134415A1 (en) * 2015-08-31 2017-05-11 Splunk Inc. Network Security Threat Detection by User/User-Entity Behavioral Analysis
US10038707B2 (en) 2015-08-31 2018-07-31 Splunk Inc. Rarity analysis in network security anomaly/threat detection
US10063570B2 (en) 2015-08-31 2018-08-28 Splunk Inc. Probabilistic suffix trees for network security analysis
US10069849B2 (en) 2015-08-31 2018-09-04 Splunk Inc. Machine-generated traffic detection (beaconing)
US10110617B2 (en) 2015-08-31 2018-10-23 Splunk Inc. Modular model workflow in a distributed computation system
US10135848B2 (en) * 2015-08-31 2018-11-20 Splunk Inc. Network security threat detection using shared variable behavior baseline
US10560468B2 (en) 2015-08-31 2020-02-11 Splunk Inc. Window-based rarity determination using probabilistic suffix trees for network security analysis
US10581881B2 (en) 2015-08-31 2020-03-03 Splunk Inc. Model workflow control in a distributed computation system
US10015177B2 (en) 2015-08-31 2018-07-03 Splunk Inc. Lateral movement detection for network security analysis
US10003605B2 (en) 2015-08-31 2018-06-19 Splunk Inc. Detection of clustering in graphs in network security analysis
US10389738B2 (en) 2015-08-31 2019-08-20 Splunk Inc. Malware communications detection
US10587633B2 (en) 2015-08-31 2020-03-10 Splunk Inc. Anomaly detection based on connection requests in network traffic
US11470096B2 (en) 2015-08-31 2022-10-11 Splunk Inc. Network security anomaly and threat detection using rarity scoring
US11258807B2 (en) 2015-08-31 2022-02-22 Splunk Inc. Anomaly detection based on communication between entities over a network
US10904270B2 (en) 2015-08-31 2021-01-26 Splunk Inc. Enterprise security graph
US10911470B2 (en) 2015-08-31 2021-02-02 Splunk Inc. Detecting anomalies in a computer network based on usage similarity scores
US11343268B2 (en) 2017-01-30 2022-05-24 Splunk Inc. Detection of network anomalies based on relationship graphs
US10609059B2 (en) 2017-01-30 2020-03-31 Splunk Inc. Graph-based network anomaly detection across time and entities
US10205735B2 (en) 2017-01-30 2019-02-12 Splunk Inc. Graph-based network security threat detection across time and entities
CN111049900A (en) * 2019-12-11 2020-04-21 中移物联网有限公司 Internet of things flow calculation scheduling method and device and electronic equipment
CN111049900B (en) * 2019-12-11 2022-07-01 中移物联网有限公司 Internet of things flow calculation scheduling method and device and electronic equipment

Similar Documents

Publication Publication Date Title
JP7314186B2 (en) Plant communication system, plant control and communication system, process plant communication system, method for processing data, and method for configuring a data processing pipeline
Truong et al. Composable cost estimation and monitoring for computational applications in cloud computing environments
EP3495951A1 (en) Hybrid cloud migration delay risk prediction engine
EP2040170B1 (en) System configuration parameter set optimizing method, program, and device
US10452048B2 (en) Control system and control device
EP2797034A2 (en) Event analyzer and computer-readable storage medium
CN111181773B (en) Delay prediction method for multi-component application of heterogeneous border cloud collaborative intelligent system
Kroß et al. Stream processing on demand for lambda architectures
CN110896357B (en) Flow prediction method, device and computer readable storage medium
CN109240755A (en) A kind of configuration file comparison method and configuration file Compare System
WO2016163903A1 (en) Method and apparatus for automated generation of a data processing topology
EP3118784A1 (en) Method and system for enabling dynamic capacity planning
CN106528281A (en) Satellite telemetry data offline unified rapid processing system
CN111680085A (en) Data processing task analysis method and device, electronic equipment and readable storage medium
Koziolek Architecture-driven quality requirements prioritization
CN113553036A (en) Software design modeling method and system based on business layering
KR101158637B1 (en) Experimental frame for efficient simulation data collection, and simulation method and system using it
US20210279393A1 (en) Tool and method for designing and validating a data flow system by a formal model
Al-Zuheri et al. The role of randomness of a manual assembly line with walking workers on model validation
CN114037673B (en) Hardware connection interface monitoring method and system based on machine vision
CN116257336A (en) Operator intelligent parallelization stream processing method and device under fluctuation data stream scene
CN112132544B (en) Inspection method and device of business system
Kryzhko et al. Evaluation of technical efficiency of regional innovation system on the basis of DEA modeling
Thomas et al. SIM-PIPE DryRunner: An approach for testing container-based big data pipelines and generating simulation data
GB2610969A (en) Optimized deployment of analytic models in an edge topology

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15793921

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15793921

Country of ref document: EP

Kind code of ref document: A1