WO2013174451A1 - Method for executing processes on a worker machine of a distributed computing system and a distributed computing system - Google Patents

Method for executing processes on a worker machine of a distributed computing system and a distributed computing system Download PDF

Info

Publication number
WO2013174451A1
WO2013174451A1 PCT/EP2012/059911 EP2012059911W WO2013174451A1 WO 2013174451 A1 WO2013174451 A1 WO 2013174451A1 EP 2012059911 W EP2012059911 W EP 2012059911W WO 2013174451 A1 WO2013174451 A1 WO 2013174451A1
Authority
WO
WIPO (PCT)
Prior art keywords
processes
worker
resource usage
executed
machine
Prior art date
Application number
PCT/EP2012/059911
Other languages
French (fr)
Inventor
Sachin Agarwal
Krisantus SEMBIRING
Andreas Beyer
Original Assignee
Nec Europe Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nec Europe Ltd. filed Critical Nec Europe Ltd.
Priority to PCT/EP2012/059911 priority Critical patent/WO2013174451A1/en
Priority to US14/402,409 priority patent/US20150113539A1/en
Publication of WO2013174451A1 publication Critical patent/WO2013174451A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5019Workload prediction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/504Resource capping
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to a method for executing processes, preferably media processes on a worker machine of a distributed computing system, with a plurality of worker machines, comprising the steps of a) Selecting one of the worker machines out of the plurality of worker machines for execution of a process to be executed in the distributed computing system and transferring said process to the selected worker machine,
  • the invention relates also to a distributed computing system for executing processes, preferably media processes, and preferably for execution with the method according to one of the claims 1-14, comprising a plurality of worker machines for execution of processes in the distributed computing system and an inserter for selecting one of the worker machines out of the plurality of worker machines for execution of a process to be executed and for transferring said process to the selected worker machine for execution.
  • Distributed computing systems provide computation, network and storage resources for applications for example via the ability to spawn worker machines on demand.
  • these resources in particular each process been executed and each transmitted data for the execution may be measured and also be billed. Therefore an efficient resource usage is important for a cost-affective distributed computing system.
  • Resources like central processing unit, network bandwidth, IP addresses, storage and/or memory need to be efficiently provisioned in order to balance a certain quality of service with a minimum possible resource usage.
  • load balancing distributes work load across multiple worker machines, network links, central processing units or other resources to achieve efficient resource utilization. Processes or "jobs" are distributed among worker machines for execution.
  • load-balancing simply counts the number of processes on each worker machine and deploys only additional processes if a predefined threshold for the number of processes to be executed is not exceeded.
  • load balancing has certain significant drawbacks.
  • One of the drawbacks is, that load balancing assigns processes to be executed on worker machines by simply counting the number of processes on or for each worker machine. If a process terminates on one worker machine a process to be executed is transferred to that worker machine even if for example it is one of a slower type of working machines and the process to be executed would need a very long time to be completed.
  • Another drawback is, that if multiple worker machines for execution of a process are available, a process is transferred to one of the worker machines even if another one of the available worker machines would be more suitable for execution of this process.
  • a further drawback is, that the following situation might occur: A number of processes is executed in the distributed computing system, however one process per worker machine. This might result in an inefficient resource usage, when for example processes only need two percent of the resources of each worker machine.
  • load balancing is inflexible since different kind of processes, for example streaming a video and a scientific process calculating floating point algorithms would not be taken into account. This leads to a bad quality-of-service and inefficient resource usage.
  • the aforementioned objectives are accomplished by a method of claim 1 , a system of claim 15 and a use according to claim 18.
  • the method for executing processes, preferably media processes on a worker machine of a distributed computing system, with a plurality of worker machines comprising the steps of a) Selecting one of the worker machines out of the plurality of worker machines for execution of a process to be executed in the distributed computing system and transferring said process to the selected worker machine,
  • the method according to claim 1 is characterized in that statistical information of resource usage of the process to be executed on one of the worker machines is collected and that the selection of the one worker machine is based on a probability resource usage qualifier, wherein the probability resource usage qualifier is extracted from combined statistical information of the process to be executed and already executed and/or executing processes on the worker machine.
  • the distributed computing system for executing processes, preferably media processes, and preferably for execution with the method according to one of the claims 1 -14, comprising a plurality of worker machines for execution of processes in the distributed computing system and an inserter for selecting one of the worker machines out of the plurality of worker machines for execution of a process to be executed and for transferring said process to the selected worker machine for execution.
  • the distributed computing system is characterized by the inserter being operable to select one worker machine based on a probability resource usage qualifier, wherein the probability resource usage qualifier is extracted from combined statistical information of the process to be executed and already executed and/or executing processes.
  • the method according to one of the claims 1-14 and/or the distributed computing system according to one of the claims 15-17 is used for executing media processes.
  • the term "probability resource usage qualifier" means preferably in the description in particular in the claims a value representing a value for a probability of a usage of a resource of the distributed computing system by a process to be executed on the distributed computing system, for example the probability of a resource usage of 50% of a process on a worker machine.
  • the present invention enables to continuously learn process characteristics from historical data and use this data in selecting a worker machine for future processes to be executed resulting in precise predictions of resource usage of processes.
  • the process is classified into a process class according to statistical information of resource usage of a process to be executed, wherein processes in the same process class have similar resources usage characteristics.
  • a faster and more efficient way and an even more precise determination of a process to be executed according to similar resource usage characteristics is provided. This leads further to an efficient and a fast selection of a worker machine to execute the classified process.
  • processes in the same process class have the same type of content data, preferably media content data, to be processed.
  • Processes in particular defined by same repetitive tasks to be performed are grouped and therefore allowing a fast and precise selection of worker machines for execution of these processes.
  • worker machines having no processes to execute are either shutdown or assigned to be slack machine, wherein a slack machine is kept idle for execution of future processes. This enhanced the overall redundancy of the distributed computing system: An incoming process for execution in the distributed computing system can be executed immediately be moving this process to a slack machine. Since only a fraction of slack machines are needed the other worker machines having no processes to execute are shutdown for energy saving.
  • a process on a first worker machine having a first probability resource usage qualifier is moved to a second worker machine having a second probability resource usage qualifier when the second probability resource usage qualifier is higher than the first probability resource usage qualifier.
  • a moving of a process to another worker machine is checked when another process on a worker machine is terminated on that worker machine.
  • probability resource usage qualifiers when a process has terminated, i.e. finally removed from the worker machine probability resource usage qualifiers and/or a number of slack machines may change. Therefore the overall resource allocations in the distributed computing system also change.
  • a process is only moved when another process is terminated enables with a minimum of checking an even more efficient resource usage in the distributed computing system. Further working machines may be freed of all processes so that they may be turned of or idled saving energy.
  • a moving of a process to another worker machine is only performed if the number of worker machines having an inefficient resource usage is below a fragment threshold.
  • a fragment threshold is a constant specifying for example the number of worker machines that may be allowed to have inefficient resource or process allocations. This prevents processes moving across the worker machines in the distributed computing system too frequently. Thus, data transmission between worker machines is minimized and frequent pausing of processes while being moved from one worker machine to another is reduced.
  • the number of slack machines is determined according to a parameter, wherein the parameter is dependent on process classes, statistical information of processes to be executed and/or executing and/or the rate of processes to be executed and/or terminated on the distributed computing system.
  • process classes are determined by K-means clustering, hierarchical clustering, using of neural networks and/or using of support vector machine.
  • Neural networks for example provide self learning and therefore self definition of process classes based on processes already executed in the distributed computing system, whereas K-means clustering enables a fast finding of a center of statistical information forming then a basis of a process class.
  • resource usage of processes executed on the distributed computing system are collected periodically. This allows a data analysis of actual resource usage of processes in real time to discover more efficiently statistical patterns of resource usage on a per-process basis and an enhanced classification of processes into different process classes. Further a selection of worker machines for execution of processes is more precise.
  • the probability resource usage qualifier represents a probability of a process exceeding a predefined maximum usage on a worker machine. This enables a flexible way to avoid a usage of worker machine over its maximum resource capability. Another advantage is that this enables a precise definition to determine how much resources of the worker machine should be assigned to a process or processes on a worker machine.
  • the combined statistical information of the process to be executed and already executed and/or executing processes on a worker machine are weighted for extraction of the probability resource usage qualifier. This enables to adapt the calculation of the probability resource usage qualifier in a flexible way. For example current and/or more recent executed processes can be preferred when calculating the probability resource usage qualifier avoiding that old, i.e. information of processes finished long ago, have still significant influence on the classification into a process class and on the selection of a worker machine for execution of a process.
  • the probability resource usage qualifier for media processes following a Gaussian distribution for resource usage is based on the corresponding Q-function. This enables a precise forecast of resource usages of media processes to be executed.
  • Media processes include for example demultiplexing, decoding, mixing, resampling, encoding and/or multiplexing of audio/video content and perform continuous repetitive actions. For example audio or video data composed of frames is processed and a written out continuously on a frame-by-frame basis.
  • the Q-function which is proportional to the complementary error function computes a probability for a variable, for example specifying cpu or memory allocation on a worker machine, wherein resource usage of the processes executed and/or executing are Gaussian distributed with a certain means and a certain variance, having a value greater than a predetermined maximum allowed cpu or memory allocation. Since the Q- function is well computable respectively calculated a fast and efficient determination of the probability resource usage qualifier for media processes is provided.
  • the worker machine having the highest probability resource usage qualifier below certain threshold is selected for execution of the process.
  • This enables an easy determination, which worker machine to be used for execution of incoming as well as existing processes, wherein the latter could then be moved to the worker machine with the highest probability resource usage qualifier.
  • a further advantage is that a moving of a process to a worker machine less suitable for execution of the process is avoided.
  • the distributed computing system comprises a classifier for classifier the process to be executed into a process class, wherein processes in the same process class have similar resource usages.
  • a faster and more efficient way and an even more precise determination of a process to be executed according to similar resource usage characteristics is provided. This leads further to an efficient and a fast selection of a worker machine to execute the classified process.
  • the distributed computing system comprises a model generator for generating and/or updating a model for different process classes, preferably based on data from executed and/or executing processes.
  • the model generator generates a model from historical records of all processes already executed and enables to train the model for classifying new processes by the classifier. This training of the model may be then be periodically performed to update statistical information of process classes as more training data is available by executing and/or executed processes.
  • Fig. 1 shows a distributed computing system according to a first embodiment of the present invention
  • Fig. 2 shows a flowchart of a part of a method according to a second embodiment of the present invention
  • Fig. 3 shows a flowchart of a part of a method according to a third embodiment of the present invention
  • Fig. 4 shows a flowchart of a part of a method according to a fourth embodiment of the present invention.
  • Fig. 1 shows a distributed computing system according to a first embodiment of the present invention.
  • Fig. 1 a schematic view of a high level component architecture of a system according to the present invention is shown. Arrows with dash lines indicate main interactions and associations between different components shown in Fig. 1.
  • Reference sign 101 denotes resources.
  • Resources 101 may be computational, network or storage resources.
  • Resources 101 may be part of a processing cloud offering central processing unit resources.
  • These resources 101 include worker machines 1 11 and slack machines 1 12.
  • Worker machines 1 1 1 are actual computing nodes which execute processes in a distributed computing system 1. Each worker machine 1 1 1 has at least one process assigned to it by an inserter 105. When all processes on a worker machine terminate or moved to other worker machines the worker machine 1 1 1 becomes a slack machine 1 12.
  • Slack machines 1 12 are spawned by a slack controller to maintain a controlled redundancy in the distributed computing system 1. These can be assigned processes and made worker machines 1 1 1 when needed or they can be shut down if more slack machines 112 are running than intended by the slack controller 109.
  • Reference 102 denotes processes that use the resources 101 to perform tasks. Processes 102 may arrive for processing at different times during their course of execution and can terminate at any time.
  • Reference sign 103 denotes parameters wherein the parameters 103 control a functioning of components of the distributed computing system 1 . These parameters 103 are provided by a user. If for example an incoming process 104, which is an instance of a process that needs to be run in the distributed computing system 1 the incoming process 104 arrives at the inserter 105 of the distributed computing system 1.
  • the inserter 105 receives the incoming processes 104 and uses information from a process classifier 106 to ascertain resource usage characteristics of the incoming process 104.
  • the inserter 105 holds a global view of all resources 101 available for each worker machine 1 1 1.
  • the inserter 105 further maps the incoming process 104 to an appropriate worker machine 1 1 1 based on the Q-function of the sum of processes running on the worker machines 1 1 and further parameters.
  • the classifier 106 may be provided in form of a machine learning means based on SVM or K-means clustering to map the incoming process 104 to a predefined or known process class.
  • the classifier 106 also reports summary statistical information like mean and variance of the process class to the inserter 105. If the classifier 106 is unable to classify a process with a certain level of confidence, the classifier 106 may also signal the inability to classify the process to the inserter 105.
  • the inserter 105 may then invoke routines to deal with a non-classifiable, i.e. unknown type of process.
  • a data collector 108 collects per-process resource usage statistics and generates per-process summary statistics for subsequent use by the model generator 107 and the inserter 105 to update these statistics parallel processing methods such as incremental map-reduce may be used.
  • the data collector 108 periodically collects these data and updates the underlying statistical information of the process classes as more statistical information is available and collectable by already executed processes. These collected statistical information for the process classes may be appropriately weighted assigned to newer and older statistical data of executed processes.
  • a slack controller 109 is responsible for maintaining adequate spare capacity in the distributed computing system 1 so that new processes to be executed via the distributed computing system 1 do not have to wait for worker machines 1 11 to be turned on before they are deployed on them.
  • a compactor 110 is provided for performing a moving of already running processes between worker machines 1 1 1 in order to "de-fragment" the worker machines 1 1 1 : If the execution of a process on a worker machine 1 1 1 is terminated processes may become distributed in a suboptimal way across the worker machines 1 1 1. The compactor 1 10 moves then processes from one worker machine 1 1 1 to another worker machine 1 1 1 so that some worker machines may be freed from all running processes.
  • the compactor 1 10 decides when and how processes are moved. These decisions are based on statistical information of the processes running on the other worker machines and in potential moving processes and on parameters, specifying for example the maximum number of worker machines shutdown and/or idled.
  • Fig. 2-4 the following variables or parameters are used.
  • the set of worker machines is changing dynamically.
  • the number of worker machines N may vary as worker machines may be booted up or turned off.
  • is a set of M processes - > VM
  • the set of processes running changes dynamically; M may change as processes start or terminate.
  • a quality-of-service constant q is introduced which is the maximum acceptable value of the Q-function for the desired quality-of-service, wherein the Q-function is defined as follows
  • fTM9TMentedy.achineCtr is the actua
  • the slack s is a positive integer specifying the number of worker machines to be kept idle in anticipation of a future processing demand.
  • Fig. 2 shows a flowchart of a part of a method according to a second embodiment of the present invention.
  • a new process is inserted into the distributed computing system 1 .
  • statistical information of its resource usage for example mean ⁇ p ⁇ and standard deviation ⁇ ⁇ ⁇ ⁇ of its CPU usage or another parameter such as network bandwidth usage is obtained in a first step S1 .
  • statistical training data from previous processes is used to classify the process into a certain process class and reference statistical information from the training data is obtained.
  • Such classification may be performed via machine learning approaches such as K- means clustering, hierarchical clustering, Neural networks, SVM or the like.
  • a worker machine J is searched such that the processes running on this worker machine set Pj ⁇ ⁇ 1 ' p 2 ' "' ⁇ satisfy aTotal 1 (A)
  • V* Total ff2 CPi) + ff2 CPjl) + ff 2 (Pj2) + - + ff2 (Pjfi)
  • Fig. 3 shows a flowchart of a part of a method according to a third embodiment of the present invention.
  • a first step S1 the worker machine 3 where process Fi is executing is identified.
  • a fourth step S4 if fragmentedMachineCtr > fragmentedThreshold, steps S1 -S5 of Fig. 4 are performed in order to compact the processes running on the worker machines.
  • Fig. 4 shows a flowchart of a part of a method according to a fourth embodiment of the present invention.
  • Fig. 4 a flowchart is shown for moving already running processes between different working machines.
  • a second step S2 the set of all processes in are written into a list and is sorted in descending order of the corresponding Q-functions of each process.
  • ⁇ Tatai ⁇ 3 ⁇ 43 + ff 2 CPji) + ff 2 Cp j2 ) + - + ff 2 CP jB )
  • step S3b if multiple worker machines can satisfy the above conditions (step S3a) then the worker machine with the largest J among these is chosen. ln the third third substep S3c if Vj E V - k ⁇ is found, Pi to V ⁇ is moved, and P ⁇ and k are updated; else it is continued on to the next process in list with repetition of steps S3, S3a, S3b and S3c. S4 if ' _ 0 and otherwise, worker
  • Fig. 1 -4 all M processes ⁇ 1 ' ⁇ 2 ' periodically report their resource usage, preferably per unit time to the data collector 108.
  • This resource usage data may be first aggregated on each worker machine 1 1 1 for processes running in that worker machine 1 1 1 and then reported to the data collector 108 periodically.
  • the model generator 107 periodically obtains statistical information from the data collector 108.
  • a specific media-process description parser converts process descriptions of process to be executed into a corresponding feature set.
  • the processes are then classified into the process classes with similar features using machine learning algorithms such as SVM or K-means clustering.
  • the average statistics preferably mean and variance, of members of each process class are then calculated. Newer and older data may be weighted differently while calculating these average statistics.
  • the classifier 106 uses the model generated by the model generator 107 to report the average statistics of a process class to the inserter 105.
  • the compactor 1 10 may also query the model generator 107 to obtain the mean ⁇ ⁇ ⁇ ⁇ and standard deviation ° ⁇ Pi - of a particular process pl directly if this process has been running for a long time as opposed to relying on classifier-based statistics.
  • the mean anc j standard deviation are calculated directly from this processes previous resource usage information stored in the data collector 108.
  • the present invention enables using both average and variability of resource demands in decision making, significantly improving the ability to tradeoff quality-of-service control with required resources. Further the present invention enables data analyses of actual resource usage of processes in particular media processes in real time to discover statistical patterns of resource usage on per- process basis and clustering/classification of processes. The present invention further enables conversion of raw resource usage data into preferably media, process-specific statistical determination used subsequently in decision making about process placement and optimization within the distributed computing system. The present invention further enables an extraction of feature sets from processed description preferably media process descriptions as an enabler for unique classification and clustering into process categories. Even further the present invention provides real time learning loop for continuous updating of models used to a certain new process resource requirement and making compaction component more tuned to new data as time progresses.
  • the present invention enables squeezing out inefficient process allocation when processes terminate and worker machines under utilized and the ability to control an amount of slack, i.e. redundant, machines for a fast start up of processes based on parameters that may depend on the previous history of process arrival into the distributed computing system.
  • the present invention guaranties quality of service for processes, in particular media processes in SLA-like terms, for example "this is ... not permit a worker machine to run over 100% CPU usage over 1 % of the time". Average and variation of processes are taken into consideration. Further the present invention provides a trade-off between the number of working machines and quality-of-services and the ability to learn process characteristics from historical data and use this to influence resource allocation decisions in distributed computing systems. The present invention further allows a more precise determination of resource usage of processes by consideration of statistical information. The present invention allows to accurately predict a fraction of time when a working machine would be over its maximum capability when multiple processes are running on the worker machine.
  • the present invention has the advantage that the user is provided with a fine grained tool by using q-values threshold to determine how much resource usage should be assigned to the processes in the distributed computing system.
  • the present invention further provides fast process start up by maintaining slack capacity so that processes do not have to wait for working machines to start up.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Factory Administration (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a method for executing processes, preferably media processes on a worker machine of a distributed computing system, with a plurality of worker machines, comprising the steps of a) Selecting one of the worker machines out of the plurality of worker machines for execution of a process to be executed in the distributed computing system and transferring said process to the selected worker machine, b) Executing the transferred process on the selected worker machine, and c) Removing the executed process from the selected worker machine after finishing of the execution of the process, wherein statistical information of resource usage of the process to be executed on one of the worker machines is collected and that the selection of the worker machine is based on a probability resource usage qualifier, wherein the probability resource usage qualifier is extracted from combined statistical information of the process to be executed and already executed and/or executing processes on the worker machine. The invention relates also to a system and a use.

Description

METHOD FOR EXECUTING PROCESSES ON A WORKER MACHINE OF A DISTRIBUTED COMPUTING SYSTEM AND A DISTRIBUTED COMPUTING SYSTEM The present invention relates to a method for executing processes, preferably media processes on a worker machine of a distributed computing system, with a plurality of worker machines, comprising the steps of a) Selecting one of the worker machines out of the plurality of worker machines for execution of a process to be executed in the distributed computing system and transferring said process to the selected worker machine,
b) executing the transferred process on the selected worker machine, and
c) removing the executed process from the selected worker machine after finishing of the execution of the process.
The invention relates also to a distributed computing system for executing processes, preferably media processes, and preferably for execution with the method according to one of the claims 1-14, comprising a plurality of worker machines for execution of processes in the distributed computing system and an inserter for selecting one of the worker machines out of the plurality of worker machines for execution of a process to be executed and for transferring said process to the selected worker machine for execution.
Distributed computing systems provide computation, network and storage resources for applications for example via the ability to spawn worker machines on demand. However, these resources, in particular each process been executed and each transmitted data for the execution may be measured and also be billed. Therefore an efficient resource usage is important for a cost-affective distributed computing system. Resources like central processing unit, network bandwidth, IP addresses, storage and/or memory need to be efficiently provisioned in order to balance a certain quality of service with a minimum possible resource usage. One approach to address this problem is so called load balancing. Load balancing distributes work load across multiple worker machines, network links, central processing units or other resources to achieve efficient resource utilization. Processes or "jobs" are distributed among worker machines for execution. To avoid an overload of a certain worker machine by deploying too many processes on that machine, load-balancing simply counts the number of processes on each worker machine and deploys only additional processes if a predefined threshold for the number of processes to be executed is not exceeded. However, load balancing has certain significant drawbacks. One of the drawbacks is, that load balancing assigns processes to be executed on worker machines by simply counting the number of processes on or for each worker machine. If a process terminates on one worker machine a process to be executed is transferred to that worker machine even if for example it is one of a slower type of working machines and the process to be executed would need a very long time to be completed.
Another drawback is, that if multiple worker machines for execution of a process are available, a process is transferred to one of the worker machines even if another one of the available worker machines would be more suitable for execution of this process. A further drawback is, that the following situation might occur: A number of processes is executed in the distributed computing system, however one process per worker machine. This might result in an inefficient resource usage, when for example processes only need two percent of the resources of each worker machine.
An even further disadvantage is that load balancing is inflexible since different kind of processes, for example streaming a video and a scientific process calculating floating point algorithms would not be taken into account. This leads to a bad quality-of-service and inefficient resource usage.
It is therefore an objective of the present invention to provide a method and a distributed computing system for executing processes which are more flexible with regard to different kind of processes and the variability of resource usage. It is a further objective of the present invention to provide a method and a distributed computing system for executing processes enabling a more efficient selection of worker machines for different processes with regard to resource usage while providing a certain quality-of-service level.
According to the invention the aforementioned objectives are accomplished by a method of claim 1 , a system of claim 15 and a use according to claim 18. According to claim 1 the method for executing processes, preferably media processes on a worker machine of a distributed computing system, with a plurality of worker machines, comprising the steps of a) Selecting one of the worker machines out of the plurality of worker machines for execution of a process to be executed in the distributed computing system and transferring said process to the selected worker machine,
b) executing the transferred process on the selected worker machine, and
c) removing the executed process from the at least one worker machine after finishing of the execution of the process.
According to the invention the method according to claim 1 is characterized in that statistical information of resource usage of the process to be executed on one of the worker machines is collected and that the selection of the one worker machine is based on a probability resource usage qualifier, wherein the probability resource usage qualifier is extracted from combined statistical information of the process to be executed and already executed and/or executing processes on the worker machine.
According to claim 15 the distributed computing system for executing processes, preferably media processes, and preferably for execution with the method according to one of the claims 1 -14, comprising a plurality of worker machines for execution of processes in the distributed computing system and an inserter for selecting one of the worker machines out of the plurality of worker machines for execution of a process to be executed and for transferring said process to the selected worker machine for execution. According to claim 15 the distributed computing system is characterized by the inserter being operable to select one worker machine based on a probability resource usage qualifier, wherein the probability resource usage qualifier is extracted from combined statistical information of the process to be executed and already executed and/or executing processes.
According to claim 18 the method according to one of the claims 1-14 and/or the distributed computing system according to one of the claims 15-17 is used for executing media processes. The term "probability resource usage qualifier" means preferably in the description in particular in the claims a value representing a value for a probability of a usage of a resource of the distributed computing system by a process to be executed on the distributed computing system, for example the probability of a resource usage of 50% of a process on a worker machine.
According to the invention it has first been recognized, that the ability to trade-off quality-of-service control with required resources is significantly enhanced by taking into account the variability of resource demands of processes across time, thus providing a dynamic and adaptive method and system.
According to the invention it has further been first recognized that the present invention enables to continuously learn process characteristics from historical data and use this data in selecting a worker machine for future processes to be executed resulting in precise predictions of resource usage of processes.
Further features, advantages and preferred embodiments are described in the following subclaims. According to a preferred embodiment the process is classified into a process class according to statistical information of resource usage of a process to be executed, wherein processes in the same process class have similar resources usage characteristics. A faster and more efficient way and an even more precise determination of a process to be executed according to similar resource usage characteristics is provided. This leads further to an efficient and a fast selection of a worker machine to execute the classified process.
According to a further preferred embodiment processes in the same process class have the same type of content data, preferably media content data, to be processed. Processes in particular defined by same repetitive tasks to be performed are grouped and therefore allowing a fast and precise selection of worker machines for execution of these processes. According to a further preferred embodiment worker machines having no processes to execute are either shutdown or assigned to be slack machine, wherein a slack machine is kept idle for execution of future processes. This enhanced the overall redundancy of the distributed computing system: An incoming process for execution in the distributed computing system can be executed immediately be moving this process to a slack machine. Since only a fraction of slack machines are needed the other worker machines having no processes to execute are shutdown for energy saving.
According to a further preferred embodiment a process on a first worker machine having a first probability resource usage qualifier is moved to a second worker machine having a second probability resource usage qualifier when the second probability resource usage qualifier is higher than the first probability resource usage qualifier. This enables a squeezing out of inefficient processes and allocations. Another advantage is, that this enables an easy decision of which worker machine is to be chosen for a new process and/or of processes already executed without reducing the quality-of-service.
According to a further preferred embodiment a moving of a process to another worker machine is checked when another process on a worker machine is terminated on that worker machine. By comparing probability resource usage qualifiers when a process has terminated, i.e. finally removed from the worker machine probability resource usage qualifiers and/or a number of slack machines may change. Therefore the overall resource allocations in the distributed computing system also change. Thus, a process is only moved when another process is terminated enables with a minimum of checking an even more efficient resource usage in the distributed computing system. Further working machines may be freed of all processes so that they may be turned of or idled saving energy.
According to a further preferred embodiment a moving of a process to another worker machine is only performed if the number of worker machines having an inefficient resource usage is below a fragment threshold. Such a fragment threshold is a constant specifying for example the number of worker machines that may be allowed to have inefficient resource or process allocations. This prevents processes moving across the worker machines in the distributed computing system too frequently. Thus, data transmission between worker machines is minimized and frequent pausing of processes while being moved from one worker machine to another is reduced.
According to a further preferred embodiment the number of slack machines is determined according to a parameter, wherein the parameter is dependent on process classes, statistical information of processes to be executed and/or executing and/or the rate of processes to be executed and/or terminated on the distributed computing system. This enables a flexible and dynamic adjustment of the number of slack machines to be provided for future execution of processes by the distributed computing system.
According to a further preferred embodiment process classes are determined by K-means clustering, hierarchical clustering, using of neural networks and/or using of support vector machine. Neural networks for example provide self learning and therefore self definition of process classes based on processes already executed in the distributed computing system, whereas K-means clustering enables a fast finding of a center of statistical information forming then a basis of a process class. According to a further preferred embodiment resource usage of processes executed on the distributed computing system are collected periodically. This allows a data analysis of actual resource usage of processes in real time to discover more efficiently statistical patterns of resource usage on a per-process basis and an enhanced classification of processes into different process classes. Further a selection of worker machines for execution of processes is more precise.
According to a further preferred embodiment the probability resource usage qualifier represents a probability of a process exceeding a predefined maximum usage on a worker machine. This enables a flexible way to avoid a usage of worker machine over its maximum resource capability. Another advantage is that this enables a precise definition to determine how much resources of the worker machine should be assigned to a process or processes on a worker machine.
According to a further preferred embodiment the combined statistical information of the process to be executed and already executed and/or executing processes on a worker machine are weighted for extraction of the probability resource usage qualifier. This enables to adapt the calculation of the probability resource usage qualifier in a flexible way. For example current and/or more recent executed processes can be preferred when calculating the probability resource usage qualifier avoiding that old, i.e. information of processes finished long ago, have still significant influence on the classification into a process class and on the selection of a worker machine for execution of a process.
According to a further preferred embodiment the probability resource usage qualifier for media processes following a Gaussian distribution for resource usage is based on the corresponding Q-function. This enables a precise forecast of resource usages of media processes to be executed. Media processes include for example demultiplexing, decoding, mixing, resampling, encoding and/or multiplexing of audio/video content and perform continuous repetitive actions. For example audio or video data composed of frames is processed and a written out continuously on a frame-by-frame basis. The Q-function which is proportional to the complementary error function computes a probability for a variable, for example specifying cpu or memory allocation on a worker machine, wherein resource usage of the processes executed and/or executing are Gaussian distributed with a certain means and a certain variance, having a value greater than a predetermined maximum allowed cpu or memory allocation. Since the Q- function is well computable respectively calculated a fast and efficient determination of the probability resource usage qualifier for media processes is provided.
According to a further preferred embodiment the worker machine having the highest probability resource usage qualifier below certain threshold is selected for execution of the process. One of the advantages is that this enables an easy determination, which worker machine to be used for execution of incoming as well as existing processes, wherein the latter could then be moved to the worker machine with the highest probability resource usage qualifier. A further advantage is that a moving of a process to a worker machine less suitable for execution of the process is avoided.
According to a preferred embodiment of the distributed computing system according to claim 15 the distributed computing system comprises a classifier for classifier the process to be executed into a process class, wherein processes in the same process class have similar resource usages. A faster and more efficient way and an even more precise determination of a process to be executed according to similar resource usage characteristics is provided. This leads further to an efficient and a fast selection of a worker machine to execute the classified process.
According to a preferred embodiment of the distributed computing the distributed computing system comprises a model generator for generating and/or updating a model for different process classes, preferably based on data from executed and/or executing processes. The model generator generates a model from historical records of all processes already executed and enables to train the model for classifying new processes by the classifier. This training of the model may be then be periodically performed to update statistical information of process classes as more training data is available by executing and/or executed processes. There are several ways how to design and further develop the teaching of the present invention in an advantageous way. To this end it is to be referred to the patent claims subordinate to patent claim 1 and claim 15 on the one hand and to the following explanation of preferred embodiments of the invention by way of example, illustrated by the figure on the other hand. In connection with the explanation of the preferred embodiments of the invention by the aid of the figure, generally preferred embodiments and further developments of the teaching will we explained.
In the drawings
Fig. 1 shows a distributed computing system according to a first embodiment of the present invention;
Fig. 2 shows a flowchart of a part of a method according to a second embodiment of the present invention;
Fig. 3 shows a flowchart of a part of a method according to a third embodiment of the present invention and
Fig. 4 shows a flowchart of a part of a method according to a fourth embodiment of the present invention.
Fig. 1 shows a distributed computing system according to a first embodiment of the present invention.
In Fig. 1 a schematic view of a high level component architecture of a system according to the present invention is shown. Arrows with dash lines indicate main interactions and associations between different components shown in Fig. 1.
Reference sign 101 denotes resources. Resources 101 may be computational, network or storage resources. Resources 101 may be part of a processing cloud offering central processing unit resources. These resources 101 include worker machines 1 11 and slack machines 1 12. Worker machines 1 1 1 are actual computing nodes which execute processes in a distributed computing system 1. Each worker machine 1 1 1 has at least one process assigned to it by an inserter 105. When all processes on a worker machine terminate or moved to other worker machines the worker machine 1 1 1 becomes a slack machine 1 12. Slack machines 1 12 are spawned by a slack controller to maintain a controlled redundancy in the distributed computing system 1. These can be assigned processes and made worker machines 1 1 1 when needed or they can be shut down if more slack machines 112 are running than intended by the slack controller 109.
Reference 102 denotes processes that use the resources 101 to perform tasks. Processes 102 may arrive for processing at different times during their course of execution and can terminate at any time. Reference sign 103 denotes parameters wherein the parameters 103 control a functioning of components of the distributed computing system 1 . These parameters 103 are provided by a user. If for example an incoming process 104, which is an instance of a process that needs to be run in the distributed computing system 1 the incoming process 104 arrives at the inserter 105 of the distributed computing system 1. The inserter 105 receives the incoming processes 104 and uses information from a process classifier 106 to ascertain resource usage characteristics of the incoming process 104. The inserter 105 holds a global view of all resources 101 available for each worker machine 1 1 1. The inserter 105 further maps the incoming process 104 to an appropriate worker machine 1 1 1 based on the Q-function of the sum of processes running on the worker machines 1 1 1 and further parameters.
The classifier 106 may be provided in form of a machine learning means based on SVM or K-means clustering to map the incoming process 104 to a predefined or known process class. The classifier 106 also reports summary statistical information like mean and variance of the process class to the inserter 105. If the classifier 106 is unable to classify a process with a certain level of confidence, the classifier 106 may also signal the inability to classify the process to the inserter 105. The inserter 105 may then invoke routines to deal with a non-classifiable, i.e. unknown type of process.
To obtain statistical information, a data collector 108 collects per-process resource usage statistics and generates per-process summary statistics for subsequent use by the model generator 107 and the inserter 105 to update these statistics parallel processing methods such as incremental map-reduce may be used. The data collector 108 periodically collects these data and updates the underlying statistical information of the process classes as more statistical information is available and collectable by already executed processes. These collected statistical information for the process classes may be appropriately weighted assigned to newer and older statistical data of executed processes.
To control the assignment of worker machines 1 1 1 and slack machines 1 12 a slack controller 109 is responsible for maintaining adequate spare capacity in the distributed computing system 1 so that new processes to be executed via the distributed computing system 1 do not have to wait for worker machines 1 11 to be turned on before they are deployed on them. Further a compactor 110 is provided for performing a moving of already running processes between worker machines 1 1 1 in order to "de-fragment" the worker machines 1 1 1 : If the execution of a process on a worker machine 1 1 1 is terminated processes may become distributed in a suboptimal way across the worker machines 1 1 1. The compactor 1 10 moves then processes from one worker machine 1 1 1 to another worker machine 1 1 1 so that some worker machines may be freed from all running processes. These worker machines 1 1 1 can then be subsequently turned off or idled. The compactor 1 10 decides when and how processes are moved. These decisions are based on statistical information of the processes running on the other worker machines and in potential moving processes and on parameters, specifying for example the maximum number of worker machines shutdown and/or idled.
In the Fig. 2-4 the following variables or parameters are used. is the set of worker machines labeled fl, ¾ ,"'¾. The set of worker machines is changing dynamically. The number of worker machines N may vary as worker machines may be booted up or turned off. ^ is a set of M processes - > VM The set of processes running changes dynamically; M may change as processes start or terminate.
17- E V P- I j . _ P = P
Each worker machine J executes a subset of processes 1 s.t. U k-U k ' with PQ n pb = 0 ¥ a≠¾
A quality-of-service constant q is introduced which is the maximum acceptable value of the Q-function for the desired quality-of-service, wherein the Q-function is defined as follows
Figure imgf000013_0001
with x representing for example a maximum allowed CPU usage on a worker machine. fragmentedrhreshold -s g constant specifying the number of worker machines that may be allowed to have inefficient process allocations, wherein fragmim edTkreshBld ≤ N Thjs avoids g toQ frequent process movj ng acrQSS the worker machines. f™9™entedy.achineCtr is the actua| number of worker machines with sub-optimum process allocations which is initialized to 0 at start and varies during the distributed computing system's operation.
The slack s is a positive integer specifying the number of worker machines to be kept idle in anticipation of a future processing demand. The corresponding slack worker machine set is defined by 5 = ίνι> ν& -> ν^> ν n s = 0 - s can be a constant or a parameter dependent on the statistical information of the processes or an incoming process rate.
Fig. 2 shows a flowchart of a part of a method according to a second embodiment of the present invention. ln Fig. 2 a new process is inserted into the distributed computing system 1 . When a new process /?/ arrives, statistical information of its resource usage, for example mean ^p^ and standard deviation σ^ρ·^ of its CPU usage or another parameter such as network bandwidth usage is obtained in a first step S1 . If the process can be recognized based on its features, e.g. its source code, then statistical training data from previous processes is used to classify the process into a certain process class and reference statistical information from the training data is obtained. Such classification may be performed via machine learning approaches such as K- means clustering, hierarchical clustering, Neural networks, SVM or the like.
If the process cannot be classified, the process Pi in vi S \$ run,
S = S— f½ 1 V = V U f½ 1
1 1J' 1 1 is updated in a second step S2 and insertion of the process is completed. The resource usage of process Pi will be recorded and its statistical information is computed as time progresses, allowing ¾ to be included in steps S1 -S5 of Fig. 4.
In a third step S3 a worker machine J is searched such that the processes running on this worker machine set Pj ~ ^1' p 2' "'^^ satisfy aTotal 1 (A)
Wherein
^Tatm = βίΡί) + μ(Ρ}ΐ + β(Ρ}2) + "' + ί /ίϊ)
V* Total = ff2CPi) + ff2CPjl) + ff 2(Pj2) + - + ff2(Pjfi)
(^TotQ i, a T»tai) are the mean and variance of the total sum of Gaussian distributed resource usages of al processes, if process 'Pi is executed in worker machine V}, and if applicable along with R other processes already executed by the worker machine.
If multiple worker machines can satisfy the above condition (A) then the worker
o ·
machine with the largest ¾J among these is chosen in a fourth step S4. If such a worker machine 1 is found where the q threshold is satisfied in a first fourth substep S4a the process Pj in virtual machine v> is started and sets p and
J'are updated and insertion of the process is completed. If no worker machine found, the process ?' in l E S is run and s = S- M, = v u M js updated and if I5| < s then a new worker machine v, {Q
S, 5 = u ^ is added in a second fourth step S4b.
Fig. 3 shows a flowchart of a part of a method according to a third embodiment of the present invention.
In Fig. 3 steps for removing a terminated process are shown in a flowchart.
When a process pi with mean μ--ρ^ and standard deviation terminates, the following steps are performed:
In a first step S1 the worker machine 3 where process Fi is executing is identified.
p In a second step S2 this process is removed from the worker machine, and J updated accordingly.
IP- 1 = 0
In a third step S3 J is checked. If 151 < s ' worker machine V) to S, 5 = 5 U ^ in a first third substep S3a is added otherwise, the worker machine J , J is shutdown.
In a second third substep S3b the variable frwnentedMactineCtr js incremented if
In a fourth step S4 if fragmentedMachineCtr > fragmentedThreshold, steps S1 -S5 of Fig. 4 are performed in order to compact the processes running on the worker machines. Fig. 4 shows a flowchart of a part of a method according to a fourth embodiment of the present invention.
In Fig. 4 a flowchart is shown for moving already running processes between different working machines.
In a first step S1 Vi (Q-function) for each worker machine 1 is computed. p J J
In a second step S2 the set of all processes in are written into a list and is sorted in descending order of the corresponding Q-functions of each process.
In a third step S3 for process Pi running on worker machine v'k wherein I being the index counter of the list L, starting from 1 and going up to M, the following three substeps S3a-S3c are performed.
In the first third substep S3a a worker machine Vj G 1 - fk} is searched such that for the processes in set Pj = ¾~'pJ2' such that x- want J ; and
% > ¾ js satisfied.
Where, ^Tatai = σ¾3 + ff2CPji) + ff2Cpj2) + - + ff2CPjB)
This is the mean and variance of the sum of Gaussian distributions that would result should process 'Pi be executed in worker machine V} if applicable along with other processes already executing in the worker machine.
In the second third substep S3b if multiple worker machines can satisfy the above conditions (step S3a) then the worker machine with the largest J among these is chosen. ln the third third substep S3c if Vj E V - k} is found, Pi to V} is moved, and P} and k are updated; else it is continued on to the next process in list with repetition of steps S3, S3a, S3b and S3c. S4 if ' _ 0 and otherwise, worker
Figure imgf000017_0001
In a fifth step S5 fra9mentedMachi eCt7 \s set to 0
According to Fig. 1 -4 all M processes ^1' ^2' periodically report their resource usage, preferably per unit time to the data collector 108. This resource usage data may be first aggregated on each worker machine 1 1 1 for processes running in that worker machine 1 1 1 and then reported to the data collector 108 periodically. The model generator 107 periodically obtains statistical information from the data collector 108. A specific media-process description parser converts process descriptions of process to be executed into a corresponding feature set. The processes are then classified into the process classes with similar features using machine learning algorithms such as SVM or K-means clustering. The average statistics preferably mean and variance, of members of each process class are then calculated. Newer and older data may be weighted differently while calculating these average statistics. The classifier 106 uses the model generated by the model generator 107 to report the average statistics of a process class to the inserter 105. The compactor 1 10 may also query the model generator 107 to obtain the mean μ^ρ^ and standard deviation °^Pi- of a particular process pl directly if this process has been running for a long time as opposed to relying on classifier-based statistics. In this case the mean ancj standard deviation are calculated directly from this processes previous resource usage information stored in the data collector 108.
In summary the present invention enables using both average and variability of resource demands in decision making, significantly improving the ability to tradeoff quality-of-service control with required resources. Further the present invention enables data analyses of actual resource usage of processes in particular media processes in real time to discover statistical patterns of resource usage on per- process basis and clustering/classification of processes. The present invention further enables conversion of raw resource usage data into preferably media, process-specific statistical determination used subsequently in decision making about process placement and optimization within the distributed computing system. The present invention further enables an extraction of feature sets from processed description preferably media process descriptions as an enabler for unique classification and clustering into process categories. Even further the present invention provides real time learning loop for continuous updating of models used to a certain new process resource requirement and making compaction component more tuned to new data as time progresses. Even further the present invention enables squeezing out inefficient process allocation when processes terminate and worker machines under utilized and the ability to control an amount of slack, i.e. redundant, machines for a fast start up of processes based on parameters that may depend on the previous history of process arrival into the distributed computing system.
The present invention guaranties quality of service for processes, in particular media processes in SLA-like terms, for example "this is ... not permit a worker machine to run over 100% CPU usage over 1 % of the time". Average and variation of processes are taken into consideration. Further the present invention provides a trade-off between the number of working machines and quality-of-services and the ability to learn process characteristics from historical data and use this to influence resource allocation decisions in distributed computing systems. The present invention further allows a more precise determination of resource usage of processes by consideration of statistical information. The present invention allows to accurately predict a fraction of time when a working machine would be over its maximum capability when multiple processes are running on the worker machine. Further, the present invention has the advantage that the user is provided with a fine grained tool by using q-values threshold to determine how much resource usage should be assigned to the processes in the distributed computing system. The present invention further provides fast process start up by maintaining slack capacity so that processes do not have to wait for working machines to start up. Many modifications and other embodiments of the invention set forth herein will come to mind the one skilled in the art to which the invention pertains having the benefit of the teachings presented in the foregoing description and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

Claims
1. Method for executing processes preferably media processes (102) on a worker machine (111) of a distributed computing system (1), with a plurality of worker machines (111, 112),
comprising the steps of
a) Selecting one of the worker machines (111) out of the plurality of worker machines (111, 112) for execution of a process (102) to be executed in the distributed computing system (1) and transferring said process (102) to the selected worker machine (111),
b) Executing the transferred process (102) on the selected worker machine (111), and
c) Removing the executed process (102) from the selected worker machine (111) after finishing of the execution of the process (102), characterized in that
statistical information of resource usage of the process (102) to be executed on one of the worker machines (111) is collected and that the selection of the one worker machine (111) is based on a probability resource usage qualifier, wherein the probability resource usage qualifier is extracted from combined statistical information of the process (102) to be executed and already executed and/or executing processes (102) on the worker machine (111). 2. The method according to claim 1, characterized in that the process (102) is classified into a process class according to statistical information of resource usage of a process (102) to be executed, wherein processes (102) in the same process class have similar resource usage characteristics. 3. The method according to claim 2, characterized in that processes (102) in the same process class have the same type of content data, preferably media content data, to be processed. The method according to one of the claims 1 -3, characterized in that working machines (1 1 1 , 1 12) having no processes (102) to execute are either shutdown or assigned to be a slack machine (1 12), wherein a slack machine (1 12) is kept idle for execution of future processes (102).
The method according to one 1 -4, characterized in that a process on a first worker machine (1 1 1 ) having a first probability resource usage qualifier is moved to a second worker machine (1 1 1 ) having a second probability resource usage qualifier when the second probability resource usage qualifier is higher than the first probability resource usage qualifier.
The method according to claim 5, characterized in that a moving of a process to another worker machine (1 11 ) is checked when another process (102) on a worker machine (1 1 1) is terminated on that worker machine (1 1 1 ).
The method according to one of the claims 5-6, characterized in that a moving of a process to another worker machine (1 1 1) is only performed if the number of worker machines (1 1 1 ) having an inefficient resource usage is below a fragment threshold.
The method according to one of the claims 4-7, characterized in that the number of slack machines (1 12) is determined according to a parameter, wherein the parameter is dependant on process classes, statistical information of processes (102) to be executed and/or executing and/or the rate of processes to be executed and/or terminated on the distributed computing system (1 ).
The method according to one of the claims 1 -8, characterized in that process classes are determined by K-means clustering, hierarchical clustering, using of neural networks and/or using of a support vector machine.
10. The method according to one of the claims 1 -9, characterized in that resource usage of processes executed on the distributed computing system (1 ) are collected periodically.
1 1. The method according to one of the claims 1 -10, characterized in that the probability resource usage qualifier represents a probability of a process (102) exceeding a predefined maximum resource usage on a worker machine (1 1 1 ).
12. The method according to one of the claims 1 -1 1 , characterized in that the combined statistical information of the process (102) to be executed and already executed and/or executing processes (102) on the worker machine (1 1 1 ) are weighed for extraction of the probability resource usage qualifier.
13. The method according to one of the claims 1 -12, characterized in that the probability resource usage qualifier for media processes (102) following a Gaussian distribution for resource usage is based on the corresponding Q- function.
14. The method according to claims 1-13, characterized in that the worker machine (1 1 1 ) having the highest probability resource usage qualifier below a certain threshold is selected for execution of the process (102).
15. Distributed computing system (1 ) for executing processes (102), preferably media processes and preferably for execution with the method according to one of the claims 1 -14, comprising
a plurality of worker machines (1 1 1 , 1 12) for execution of processes (102) in the distributed computing system (1 ) and
an inserter (105) for selecting one of the worker machines (1 1 1 , 1 12) out of the plurality of worker machines (1 1 1 , 1 12) for execution of a process (102) to be executed and transferring said process (102) to the selected worker machine (1 1 1 ) for execution,
characterized by the inserter (105) being operable to select one worker machine (1 1 1 ) based on a probability resource usage qualifier, wherein the probability resource usage qualifier is extracted from combined statistical resource usage information of the process (102) to be executed and already executed and/or executing processes.
16. The distributed computing system according to claim 15, characterized by a classifier (106) for classifying the process (102) to be executed into a process class, wherein processes (102) in the same process class have similar resource usages.
17. The distributed computing system according to claim 15 or 16, characterized by a model generator (107) for generating and/or updating a model for different process classes, preferably based on data from executed and/or executing processes (102).
18. Use of the method according to one of the claims 1 -14 and/or the distributed computing system according to one of the claims 15-17 for executing media processes.
PCT/EP2012/059911 2012-05-25 2012-05-25 Method for executing processes on a worker machine of a distributed computing system and a distributed computing system WO2013174451A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/EP2012/059911 WO2013174451A1 (en) 2012-05-25 2012-05-25 Method for executing processes on a worker machine of a distributed computing system and a distributed computing system
US14/402,409 US20150113539A1 (en) 2012-05-25 2012-05-25 Method for executing processes on a worker machine of a distributed computing system and a distributed computing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2012/059911 WO2013174451A1 (en) 2012-05-25 2012-05-25 Method for executing processes on a worker machine of a distributed computing system and a distributed computing system

Publications (1)

Publication Number Publication Date
WO2013174451A1 true WO2013174451A1 (en) 2013-11-28

Family

ID=46331227

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2012/059911 WO2013174451A1 (en) 2012-05-25 2012-05-25 Method for executing processes on a worker machine of a distributed computing system and a distributed computing system

Country Status (2)

Country Link
US (1) US20150113539A1 (en)
WO (1) WO2013174451A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130067261A1 (en) * 2008-03-10 2013-03-14 Ted A. Carroll System and method for computer power control
EP3399431A1 (en) * 2017-05-05 2018-11-07 Servicenow, Inc. Shared machine learning
US11620571B2 (en) 2017-05-05 2023-04-04 Servicenow, Inc. Machine learning with distributed training

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8990826B2 (en) * 2012-06-06 2015-03-24 General Electric Company System and method for receiving analysis requests and configuring analytics systems
EP3103070B1 (en) * 2014-02-07 2023-09-13 Cylance Inc. Application execution control utilizing ensemble machine learning for discernment
US10733023B1 (en) 2015-08-06 2020-08-04 D2Iq, Inc. Oversubscription scheduling
US9946574B2 (en) * 2016-01-04 2018-04-17 Wal-Mart Stores, Inc. System for adaptive determination of computing resources and method therefor
US10614018B2 (en) * 2016-05-28 2020-04-07 International Business Machines Corporation Managing a set of compute nodes which have different configurations in a stream computing environment
US10001983B2 (en) * 2016-07-27 2018-06-19 Salesforce.Com, Inc. Rolling version update deployment utilizing dynamic node allocation
US10860545B2 (en) 2017-03-24 2020-12-08 Microsoft Technology Licensing, Llc Measuring usage of computing resources
US10725979B2 (en) * 2017-03-24 2020-07-28 Microsoft Technology Licensing, Llc Measuring usage of computing resources by storing usage events in a distributed file system
US11561836B2 (en) * 2019-12-11 2023-01-24 Sap Se Optimizing distribution of heterogeneous software process workloads

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060010449A1 (en) * 2004-07-12 2006-01-12 Richard Flower Method and system for guiding scheduling decisions in clusters of computers using dynamic job profiling
US20080313637A1 (en) * 2007-06-13 2008-12-18 Hee Yong Youn Prediction-based dynamic thread pool management method and agent platform using the same
US20090172168A1 (en) * 2006-09-29 2009-07-02 Fujitsu Limited Program, method, and apparatus for dynamically allocating servers to target system

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6560717B1 (en) * 1999-12-10 2003-05-06 Art Technology Group, Inc. Method and system for load balancing and management
US7209438B2 (en) * 2001-05-22 2007-04-24 Mitsubishi Electric Research Laboratories, Inc. Method and system for assigning circuits to a new service request in a communications network
US7093250B1 (en) * 2001-10-11 2006-08-15 Ncr Corporation Priority scheduler for database access
US8713179B2 (en) * 2005-10-04 2014-04-29 International Business Machines Corporation Grid computing accounting and statistics management system
WO2008132924A1 (en) * 2007-04-13 2008-11-06 Nec Corporation Virtual computer system and its optimization method
US8312460B1 (en) * 2007-05-22 2012-11-13 Hewlett-Packard Development Company, L.P. Allocating computer resources to workloads using utilization based probability distributions
US8918761B1 (en) * 2008-12-05 2014-12-23 Amazon Technologies, Inc. Elastic application framework for deploying software
US9207984B2 (en) * 2009-03-31 2015-12-08 Amazon Technologies, Inc. Monitoring and automatic scaling of data volumes
US8560465B2 (en) * 2009-07-02 2013-10-15 Samsung Electronics Co., Ltd Execution allocation cost assessment for computing systems and environments including elastic computing systems and environments
US8250582B2 (en) * 2009-09-25 2012-08-21 International Business Machines Corporation Chargeback reduction planning for information technology management
US8443373B2 (en) * 2010-01-26 2013-05-14 Microsoft Corporation Efficient utilization of idle resources in a resource manager
US8661136B2 (en) * 2011-10-17 2014-02-25 Yahoo! Inc. Method and system for work load balancing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060010449A1 (en) * 2004-07-12 2006-01-12 Richard Flower Method and system for guiding scheduling decisions in clusters of computers using dynamic job profiling
US20090172168A1 (en) * 2006-09-29 2009-07-02 Fujitsu Limited Program, method, and apparatus for dynamically allocating servers to target system
US20080313637A1 (en) * 2007-06-13 2008-12-18 Hee Yong Youn Prediction-based dynamic thread pool management method and agent platform using the same

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130067261A1 (en) * 2008-03-10 2013-03-14 Ted A. Carroll System and method for computer power control
US8972756B2 (en) * 2008-03-10 2015-03-03 Aptean Systems, Llc System and method for computer power control
EP3399431A1 (en) * 2017-05-05 2018-11-07 Servicenow, Inc. Shared machine learning
US10380504B2 (en) 2017-05-05 2019-08-13 Servicenow, Inc. Machine learning with distributed training
US10445661B2 (en) 2017-05-05 2019-10-15 Servicenow, Inc. Shared machine learning
US11620571B2 (en) 2017-05-05 2023-04-04 Servicenow, Inc. Machine learning with distributed training

Also Published As

Publication number Publication date
US20150113539A1 (en) 2015-04-23

Similar Documents

Publication Publication Date Title
WO2013174451A1 (en) Method for executing processes on a worker machine of a distributed computing system and a distributed computing system
CN111124689B (en) Container resource dynamic allocation method in cluster
CN108270805B (en) Resource allocation method and device for data processing
CN107193652B (en) The flexible resource dispatching method and system of flow data processing system in container cloud environment
CN109981744B (en) Data distribution method and device, storage medium and electronic equipment
TWI725744B (en) Method for establishing system resource prediction and resource management model through multi-layer correlations
CN112685153A (en) Micro-service scheduling method and device and electronic equipment
US10216543B2 (en) Real-time analytics based monitoring and classification of jobs for a data processing platform
Xue et al. Spatial–temporal prediction models for active ticket managing in data centers
CN112799817A (en) Micro-service resource scheduling system and method
CN109460301B (en) Method and system for configuring elastic resources of streaming data load
CN111046091A (en) Operation method, device and equipment of data exchange system
CN114661482B (en) GPU (graphics processing Unit) computing power management method, medium, equipment and system
CN114625500A (en) Method and application for scheduling micro-service application based on topology perception in cloud environment
CN110191015B (en) CPI index-based cloud service performance intelligent prediction method and device
CN112000460A (en) Service capacity expansion method based on improved Bayesian algorithm and related equipment
CN114911613A (en) Cross-cluster resource high-availability scheduling method and system in inter-cloud computing environment
CN106506229B (en) A kind of SBS cloud application adaptive resource optimizes and revises system and method
CN113608875B (en) High-throughput cloud computing resource recovery system
CN115858155A (en) Dynamic capacity expansion and contraction method and device for application resources of computing power network platform
CN114978913A (en) Service function chain cross-domain deployment method and system based on chain cutting
CN110427217B (en) Content-based publish-subscribe system matching algorithm lightweight parallel method and system
CN114296872A (en) Scheduling method and device for container cluster management system
CN113254172A (en) Task scheduling method and device
CN107729150A (en) A kind of addressing method of isomeric group safety supervision equipment least energy consumption node

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12729018

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 14402409

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12729018

Country of ref document: EP

Kind code of ref document: A1