GB2604640A

GB2604640A - Performing a processing task instructed by an application

Info

Publication number: GB2604640A
Application number: GB2103414.5A
Authority: GB
Inventors: Yucel Mehmet; Saá-Garriga Albert; Power Luther
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2022-09-14
Also published as: WO2022191668A1; GB202103414D0

Abstract

A computer-implemented method of performing a processing task instructed by an application executing on a computing device, wherein the processing task generates a task output based on a task input. The method obtains (302) a neural network trained to perform the processing task and comprising a plurality of intermediate output layers. The method further obtains (304) a decision-making data structure comprising data representing a plurality of states related to performance of the processing task, and data representing a plurality of selectable actions. The method obtains (306) requirements data describing requirements of the application and uses (308) the requirements data and the decision-making data structure to select an action to perform in relation to a layer of the neural network, and performs (310) the selected action to generate the task output for the application. The neural network comprises a convolutional neural network with encoder and decoder layers. The decision-making data structure comprises a reinforcement learning model, Q-learning Q-table. The plurality of states comprises operational states of the device such as temperature and load. The processing task comprises image processing such as depth estimation.

Description

Performing a Processing Task Instructed by an Application The present invention relates to performing a processing task instructed by an application.

In a majority of on-device Artificial Intelligence (Al) solutions, Machine Leaming (ML) models are responsible for computational bottlenecks. Once deployed, ML architectures are typically static and operate under a fixed budget in terms of processing requirements, such as inference time. This static nature of such ML models can create a non-adaptive compute environment, leading to suboptimal performance. A conventional single model/single inference path requires full execution of the model to generate an output and only one output is generated.

It is not usually known how long it will take to obtain the output in a particular instance. Thus, the application that has instructed the ML model to perform a processing task will have to wait for the neural network to finish, possibly causing problems such as freezing a user interface. The time when the expected output is available will be unknown as it can depend on the status of the computing device. Also, the minimum possible inference time is unknown because it will depend on factors such as device status and network size. The maximum time it may take to generate the task output is also unknown because that can also depend on the device status and network size.

Conventionally, many ML models useable to perform a processing task (e.g. depth estimation or image segmentation) are trained for specific devices, even though all these models are targeted to solve the same processing task. Training all these different models take up resources, such as many hours of GPU time. Further, all the different models also require storage space.

Embodiments of the present invention aim to address at least one of the above problems. Embodiments can adapt neural network inference to device processor and/or application requirements, offering a new inference pipeline that works for various ML models. It can enable multiple outputs with varying accuracy and inference speed profiles. Executing a specific set of layers of the neural network can allow the processing task to be completed to meet the particular requirements of the application that instructs the performance of the processing task. For instance, in a case where the processing task comprises an image processing task, executing certain layers of the neural network (as determined by an embodiment) may result in different types of results that can meet different requirements, e.g. "best quality result" in a 100ms inference time, a "fast" result in a 60 ms inference time, and a "Fastest" result in a 20 ms inference time. In another example, the types of results can comprise lowest quality/lowest latency, low quality/low latency, and good quality/high latency.

Embodiments may train one ML model/neural network for performing a processing task and that model can be used for various computing devices and/or applications. Execution of the model may be adapted (and does not need to be pre-adapted) to the requirements of the device processor and/or the instructing application. A decision taking algorithm can decide the best network path (i.e. which layers of the neural network to execute) according to device state and/or application requirements. Embodiments can therefore enable control over compute requirements and allow adaptive allocation of compute resources. Embodiments can selectively execute only specific layers of a trained neural network and may store (or may otherwise provide to an application) one or more output of one or more the intermediate layer of the network, unlike in a conventional model where all layers of the network are executed in sequence, regardless of application requirements, to generate a single output.

Embodiments can train one model that is usable for performing a processing task and execution of the model can be adapted for different devices or application instances.

Embodiments can provide an output to the instructing application as soon as possible (if required by the application) according to the decisions taken by an embodiment. Embodiments can produce intermediate results (according to decisions learned that take into consideration the state of the device and application requirements, in particular) and may obtain the best possible output quality given a particular time budget. In embodiments the expected output may be generated in a minimum time that a device's processors allow. The minimum inference time can be the time taken to execute the first intermediate layer of the model (and provide/store the intermediate output of that layer). The maximum time required to perform the processing task will not change because that will involve executing all layers of the model. Thus, embodiments can stop execution of the processing task by the neural network when the quality of the output currently generated by execution of the model is satisfactory. Alternatively, embodiments may follow multiple inference paths (e.g. various intermediate layers to various intermediate outputs, and/or a main path through all the layers to the final output) for varying performances of the task. Embodiments may therefore be used to select execution of layers of the neural network such that execution time and/or output quality can vary according to the requirements specified by the instructing application. Embodiments can be used for a wide range of on-device Al solutions.

According to a first aspect of the present invention there is provided a computer-implemented method of performing a processing task instructed by an application executing on a computing device, wherein the processing task generates a task output based on a task input provided by the application, the method comprising: obtaining a neural network trained to perform the processing task and comprising a plurality of intermediate output layers; obtaining a decision-making data structure comprising data representing a plurality of states related to performance of the processing task, and data representing a plurality of actions selectable for each said state; obtaining requirements data describing requirements of the application, the requirements comprising at least a time budget for completing the processing task, and a target quality for the task output; using the requirements data and the decision-making data structure to select a said action to perform in relation to a said layer of the neural network, and performing the selected action in relation to the layer of the neural network based on the task input to (directly or indirectly) generate the task output for the application.

The neural network may comprise a convolutional neural network having a plurality of encoder layers and a plurality of decoder layers. At least one of the plurality of decoder layers may have at least one associated said intermediate output layer.

The action may be selected from the plurality of actions comprising: stop after providing output generated by a current (intermediate output) layer of the neural network as the task output; stop after storing the output generated by the current layer; store the output generated by the current layer and continue to execute next said layer of the neural network; execute next said layer of the neural network (without the current layer generating an output).

The plurality of states may comprise: a remaining time (in relation to the time budget), a current output quality On relation to the target quality), and a currently executed said layer of the neural network. The decision-making data structure may be arranged to comprise a set of the states and the actions for each said layer of the neural network.

The decision-making data structure may be used to select a said action from amongst the plurality of actions for a said state based on a reward value. For instance, the action of stopping after executing a current layer of the neural network may be selected as a result of a high reward value being computed when an output generated by the current layer of the neural network is determined to be within the time budget and/or meets the target quality.

The decision-making data structure may be used to select the action of stopping after executing a current layer of the neural network when an output generated by the current layer of the neural network is determined to be within the time budget and/or meet the target quality.

The decision-making data structure may comprise, or be based on, a Machine Learning model, such as a Reinforcement Learning model. The decision-making data structure may comprise a Q-Learning Q-Table. The decision-making data structure may comprise a neural network trained using the Q-Table, e.g. trained to output actions corresponding to highest rewarded actions of the Q-Table.

The plurality of states in the decision-making data structure may further comprise at least one operational state (e.g. temperature, CPU/GPU load, etc) of the device executing the application. The method may further comprise: obtaining device operational state data describing at least one operational state of the device executing the application, and using the device state data and the decision-making data structure to select a said action to perform in relation to a said layer of the neural network (e.g. by inputting the device state data to the decision-making data structure).

The method may comprise selecting two or more of the actions and performing the two or more selected actions in parallel.

The method may comprise: generating a plurality of decision-making data structures on a respective plurality of computing devices, and transferring data representing the generated plurality of decision-making data structures to the computing device to update the decision-making data structure obtained by the computing device.

The CNN may comprise a plurality of encoder layers and a plurality of decoder layers corresponding to the plurality of intermediate output layers. The processing task may comprise an image processing task, such as depth estimation or another task, such as speaker recognition.

According to another aspect of the present invention there is provided a method of generating a (convolutional) neural network substantially as described herein. The method may comprise: inputting training data to an encoder module of the network; inputting output of the encoder module to a layer of a decoder of the network; checking whether the decoder layer is a final layer in the decoder, and, if it is not then checking whether the decoder layer has an associated intermediate output layer, and if there is an associated intermediate output layer then the intermediate output layer is applied to output of the decoder layer, and storing an output of the intermediate output layer for training the network.

The method may further comprise: (i) performing backwards propagation from the final layer of the decoder; (ii) setting a counter (N) as a number of intermediate output layers of the final decoder; (iii) calculating a loss of the intermediate layer corresponding to the counter (N); (iv) performing backwards propagation from the intermediate layer corresponding to the counter (N); (v) decrementing the counter (N) by 1; (vi) checking whether the counter (N) is 0 and if it is then processing a next input of the training data, otherwise, returning to step (Hi).

According to another aspect of the present invention there is provided a method of generating a decision-making data structure substantially as described herein. The method may comprise training the decision-making data structure comprising an ML data structure, such as a Q-Table, using the CNN trained for performing the processing task.

According to yet another aspect of the present invention there is provided a computing device configured to perform any of the methods described herein.

According to yet another aspect of the present invention there is provided a computing system comprising a first computing device configured to generate a neural network substantially as described here, and at least one further computing device configured to generate a decision-making data structure substantially as described herein and/or perform a processing task instructed by an application substantially as described herein.

According to another aspect of the present invention there is provided a computer readable medium storing a computer program to operate any of the methods described herein.

According to the present invention, there is provided a method and apparatus as set forth in the appended claims. Other features of the invention will be apparent form the dependent claims, and the description which follows.

For a better understanding of the invention, and to show how embodiments of the same may be carried into effect, reference will now be made, by way of example, to the accompanying diagrammatic drawings in which: Figure 1 is a block diagram of a computing device configurable to execute embodiments of the invention, Figure 2 is a flowchart illustrating steps related to embodiments of the invention; Figure 3 is a flowchart showing steps performed during an inference step of Figure 2; Figure 4 is a block diagram showing a software configuration of the computing device where the processing task comprises image depth estimation; Figure 5 is a flowchart showing steps performed by the embodiment of Figure 4; Figure 6 is a high-level diagram of the inputs and outputs of a decision-making data structure in an embodiment where a Q-Learning Q-Table is used; Figure 7 is a flowchart illustrating training of the Q-Table; Figure 8 illustrates an example of the Q-Table and related reward function; Figure 9 schematically illustrates an alternative embodiment where decisions are shared using multiple devices connected via a network; Figure 10 is a flowchart illustrating steps that can be performed to generate a neural network useable by embodiments; Figure 11 schematically illustrates an alternative embodiment where a CNN is used as the basis for the decision-making data structure, rather than the Q-Table; Figure 12 schematically illustrates an embodiment where the processing task comprises depth estimation; Figure 13 illustrates the details of the neural network for the example embodiment of Figure 12; Figure 14 is a diagram view of the architecture of Figure 13, and Figure 15 schematically compares performance of a depth estimation processing task using a known technique and according to an embodiment.

Figure 1 is a block diagram of a computing device 100 configurable to execute embodiments of the invention. The device will normally comprise, or be associated with, at least one processor 102, memory 104 and a communications interface 106. Examples of computing devices include desktop personal computers, servers, as well as mobile computing devices such as mobile telephones/smartphones and tablet computers. In some cases the device may further include a user component interface 108, such as a touchscreen. Other components and features of the device will be well-known to the skilled person and need not be described herein in detail.

Figure 2 is a flowchart illustrating steps related to embodiments of the invention. At step 202 an artificial neural network is trained to perform a processing task. Embodiments typically use a deep learning neural network, such as a convolutional neural network (CNN). The trained network will comprise an input layer and an output layer, as well as a plurality of intermediate (output/prediction) layers. The plurality of intermediate output layers will normally be interconnected in an ordered sequence. Each of the plurality of intermediate output layers can be configured to receive an input that is directly or indirectly based on the task input (i.e. from an input layer of the network that receives data representing the task input, or an output generated by a previous intermediate layer of the network) to produce an intermediate output. An example of how the neural network can be trained will be detailed below.

Neural networks can be trained to perform a wide range of different processing tasks and the term "processing task" used herein can be interpreted broadly, in general to cover any processing task that can be implemented using Al techniques. These can include tasks in the realm of image or speech processing. A non-exhaustive list of specific examples includes depth estimation, image segmentation, image classification/place recognition tasks and speaker recognition. Depth estimation is a dense prediction task and so embodiments can be produced to perform any other dense prediction task in computer vision, e.g. style-transfer, salient object detection, semantic segmentation, background/foreground estimation, surface normal estimation, etc. These alternative processing tasks are compatible for adaption of the depth estimation embodiment detailed below (i.e. encoder/decoder architecture).

B

At step 204 a decision-making data structure that is usable to select, based on a set of inputs, an action to perform in relation to a layer of the trained neural network is generated. The decision-making data structure will typically be generated using an ML technique and examples are detailed below. In some embodiments the decision-making data structure comprises a Reinforcement Learning (RL) model that can predict the best combination of intermediate layers of the network to use, based on a reward (e.g. quality v performance) and, in some cases, inputs representing device configurations/states, such as CPU/GPU usage, etc. At step 206 a computing device uses the trained neural network and the decision-making data structure to perform a processing task that has been instructed by an application executing on the device. The device uses the decision-making data structure to predict the best intermediate layers to be executed based on the input (i.e. on-device inference of the RL model). Typically, the computing device will obtain (e.g. receive and store locally, or remotely access via a wireless network or the like) the trained neural network and the decision-making data structure.

The device uses the requirements of the application with the decision-making data structure which to decide which layers of the trained neural network should be executed in order to fulfil those requirements. The output generated by the last layer selected using the decision-making data structure (which need not be the output layer or final intermediate layer of the network) can be returned to the application as the output of the task. The application can then use this task output for its intended purpose, e.g. to display a processed image.

Step 202 will typically be executed by a powerful computing system (e.g. desktop computer, a cloud service, etc) to generate data representing the trained neural network. That computing system will typically be owned/operated by a developer or service provider. The generated data (or a modified version thereof) can then be obtained by a separate user computing device (usually a less powerful computing device, such as a smartphone) that can perform steps 204 and 206. In some embodiments, step 204 can be at least partially performed by the powerful computing system and a simpler statistical model corresponding to the decision-making data structure is used by the user device, although it is also possible for the user device to execute the complex Deep Learning model.

Figure 3 is a flowchart showing steps performed by means of software instructions being executed by the computing device 100 during step 206 of Figure 2. It will be appreciated that at least some of the steps shown in the Figures herein may be re-ordered or omitted. One or more additional steps may be performed in some cases. Further, although the steps are shown as being performed in sequence in the Figures, in alternative embodiments at least some of them may be performed concurrently. It will also be understood that embodiments can be implemented using any suitable software, programming language, data editors, etc, and may be represented/stored/processed using any suitable data structures and formats.

At step 302 the computing device 100 can obtain the trained artificial neural network, e.g. as generated at step 202. The network will comprise a plurality of intermediate output layers usable for performing the processing task. At step 304 the device 100 can obtain the decision-making data structure, e.g. as generated at step 204.

The computing device 100 can execute at least one application, which will typically be executed based on user interaction with the device. During its execution the application can instruct (require/request) the performance of the processing task that uses the neural network.

This may be the result of user input or some other event. For instance, the application may be coded such that it detect when it needs to optimize performance/accuracy, e.g. at some point during execution the application can recognize that resources are becoming limited and then instruct the embodiment of Figure 3, e.g. when requiring a depth estimation processing task to be completed, to prioritize runtime over output quality of the task. The result of this could be, for example, that the first intermediate layer prediction output is returned as the task output and the whole neural network is not executed. As neural networks are essentially computational graphs, as long as the application is coded in a way such that it can dynamically call the whole or parts of this computational graph, it can have fine-grained control over the performance. In some cases an embodiment of the method can be a served as a library and applications can use it through an API with the required device states/requirements. The instruction will typically provide task input data that is to be processed and the application will expect task output data to be returned.

At step 306 the device 100 can obtain requirements data describing the requirements of the application in relation to the performance of the processing task. The requirements can comprise at least a time budget for completing the processing task, and a target quality for the task output. The requirements may be decided during the design process of the application (e.g. which features to take into account such as the time budget, output quality, etc). The features can then be monitored by the application during execution and in some cases an interface may be provided in the application to enable a user to modify (or add/delete) at least one requirement, e.g. indirectly by allowing background apps to be hibernated, etc. The application requirements may also be fetched from the device's operating system. The requirements can be represented by data in any suitable manner, e.g. a duration in ms for the time budget, an expected end time related to the time budget, a numerical (e.g. X%) or other indicator (e.g. "low", "good" or "high" quality) related to the target quality, and so on. Data representing the requirements is transferred to the software that performs the method shown in Figure 3.

At step 308 the device 100 uses the decision-making data structure (e.g. using a decoder component as described below in some embodiments) in combination with the requirements data to select an action to perform in relation to a layer of the trained neural network as part of performing the processing task.

At step 310 the device 100 performs the selected action (e.g. using a decoder component as described below in some embodiments) on the layer of the neural network based on the task input provided by the application. The task input will typically be initially inputted to the input layer of the neural network before being passed to the first of the plurality of intermediate layers. If directed by the decoder, based on the decision-making data structure, the output of the first intermediate layer may be passed to the next one (or more) intermediate layer(s). The output of one or more of the intermediate layers (or the output layer of the network, depending on the decision-making data structure) may be returned to the application as the task output.

Figure 4 is a block diagram showing a software configuration of the computing device 100 when operating an embodiment of the method where the processing task comprises an image processing task, such as image depth estimation. Input 400 is provided by an image processing application 402 being executed on the device and will typically comprise an image. The processing task can involve an encoder-decoder structure for mapping a colour image to a pixel-wise depth representation. The input image 400 is provided to the encoder 404, which can perform dense feature extraction and its output is passed to the decoder 406 that can predict a desired depth. The decoder comprises a plurality of decoder layers, each of which is implemented based on one of the intermediate layers of the trained neural network.

The embodiment can utilise a multi-output neural network that, given an input image and according to the current device state (CPU/GPU load, device temperature, CPU/GPU frequency, etc.) and application requirements (time budget, minimum quality), may maximize the output quality and the number of generated intermediate outputs in a given time budget. The embodiment can maximize the number of generated outputs and their quality on a given time budget, where outputs of the latest decoder layers (as selected using the decision-making data structure) will be normally be preferred over earlier ones.

The decision-making data structure 408 is used by the device 100 to decide what action to perform in relation to an initial/current decoder layer. The method may stop and return the output of that intermediate output layer to the application, or it may continue to execute a subsequent/next layer (with or without storing the output of the current layer). In the example the action will be selected from a set of possible actions comprising: stop after providing the output generated by the current decoder layer of the neural network as the task output; stop after storing output generated by an intermediate output layer associated with the current decoder layer; store output generated by an intermediate output layer associated with the current decoder layer, and continue to execute the next decoder layer of the neural network; and execute the next decoder layer of the neural network (without an intermediate output layer generating an output).

As previously mentioned, embodiments can use the requirements data provided by the application 402 in combination with the decision-making data structure 408 to select the action. Optionally, embodiments may use other information/requirements. In the illustrated embodiment, information regarding the current state of the processor(s) 410 of the device 100 can also be provided and used in combination with the decision-making data structure to select the action. The application requirements and device state inputs are collectively shown in the Figure as state input 412. As detailed below, in some embodiments the decision-making data structure comprises a 0-Learning Q-Table and the states can be used in combination with the Q-Table.

Figure 5 is a flowchart showing steps performed by the embodiment of Figure 4. The flowchart schematically illustrates how the embodiment proceeds to process a task input based on decisions taken using the decision-making data structure 408, e.g. the actions selected as a result of providing state information to a 0-Table. Steps having a shaded (originally blue) background are novel to the embodiments disclosed herein compared to the non-shaded steps, which form part of a known depth estimation encoder-decoder pipeline (e.g. as disclosed in Han Lee et al, From Big to Small: Multi-Scale Planar Guidance for Monocular Depth Estimation, arXiv: 1907.10326, 2019).

The input batch 402 is initially passed to the encoder 404 and the encoders output is then passed to the first decoder layer 406, which processes the data. The method uses a counter X (initially set to 0 or 1) for tracking the decoder layer that is currently processing data based on the task input. In the known pipeline the output of the first decoder layer would be used as an input to the second decoder layer, and the output of the second decoder layer would be used an input to the third decoder layer, and so on, until the final layer in the decoder has been reached, at which point the final output is returned. However, according to embodiments, the decision-making data structure 408 can be used to selectively stop decoder processing after the execution of any of the decoder layers. Thus, the requirements data and the decision-making data structure may be used to select a said action to perform in relation to a decoder layer or an intermediate layer of the neural network. This is illustrated by step 504, where a question is asked whether the processing task should stop. If the answer is yes then control passes to step 506, where the output of the current layer X is returned to the application 402 as the task output. If the answer to the question is no then control passes to step 508, where a question is asked whether the current decoder layer X is the final layer in the decoder. If the answer is yes then at step 510 the output of that layer can be stored for future use and control then passes to step 522. The output will be saved in a temporary memory of the device allocated by the application (i.e. device RAM instead of the hard-drive/SSD memory of the device) to facilitate fast access by the application, which can recognise when the value of the stored output will be useful, e.g. when a similar state is reached or similar data is input. If the current layer X is not the final decoder layer then control passes to step 512. In some embodiments the save output function might replace, average current with previous, blend, etc. Execution of an intermediate layer might run in parallel to continued execution of the method, e.g. being output to a critical region of memory that will be controlled by a mutex system. This can allow the main network to continue with the input processing.

At step 512 a question is asked whether decoder layer X comprises an intermediate output layer of the neural network. A decoder layer and an intermediate layer are created using the same building blocks: a deconvolution (also called a transpose convolution or upsampling layer) layer but generally with different upsampling factors. All decoder layers do not necessarily have intermediate prediction layers connected to them (for instance, in Figure 14 detailed below there are only three intermediate prediction layers, but there are 6 decoder layers). Thus, step 512 checks if the current model architecture has an intermediate prediction layer connected to the current decoder layer X. If the answer to the question of step 512 is no then control passes to step 522. If the answer is yes then control passes to step 514, where a question is asked whether the output of the intermediate layer is required for the processing task according to the decision-making data structure. If the answer is yes then at step 516 the intermediate layer is applied to the data and that output is saved at step 518. Thus, decoder layer X will process the data at step 406 and if there is an intermediate prediction layer attached to the current decoder layer and the application requires an intermediate output, the intermediate prediction layer X will process the current decoder layer's output (step 516) and save the resulting intermediate output at step 518. Following this, at step 520 a question is asked whether the processing should stop. If the answer is yes then control passes to step 506 where the output is returned to the application as the task output; otherwise, control passes to step 522. If the answer to the question of step 514 is no then control passes to step 522.

At step 522 the layer counter X is incremented by 1 and control passes to step 406, where decoder layer X is applied to the data.

Figure 6 is a high-level diagram of the inputs and outputs of the decision-making data structure 600 in an embodiment where a Q-Learning Q-Table is used. Q-Learning (see, for example, Christopher JCH Watkins and Peter Dayan. Q-Learning. Machine Learning, 8(3-4): 279-292,1992) is based on adaptive ML heuristics and learns based on experience as well as trial and error. A benefit of Q-Learning is that it does not require previously created training data; it is a "decision making" and "updating" ML algorithm that can be used in scenarios requiring a system to automatically learn and self-adjust without a previously generated training set. 0-Learning takes decisions or actions based on previous experiences and selects the best decision in order to reach or get closer to a specific goal. Actions lead to changes in system states. When there are no experiences, a randomly-selected action is performed. High rewards are given to good actions, whilst zero or low value rewards are given to bad actions. Experiences are stored in a data structure called the 0-Table. The 0-Table is updated when new experiences are gained from the environment. This feedback loop, with rewards and actions, allows the system to learn.

As well as the application requirements discussed above, some embodiments can optionally use other state descriptors related to the current state of the computing device 100 that is executing the application for prediction. In the illustrated example 0-Table 600 the inputs/variables for the states comprise: * Currently-processed layer number (based on counter X of Figure 5) * Device temperature * Budget time (represents the remaining time according to a target time budget and is related to the budget time requirement provided by the application 402) * Current quality reached (related to the target output quality requirement provided by the application 402) * CPU/GPU load It will be understood that these are merely exemplary and variations are possible. For instance, a state may relate to the condition of a hardware or software component, e.g. the temperature of a component or outside environment; load of a processor; a value (e.g. a variable used by an application) stored in a register or other data store, and so on. The state may be obtained/computed directly by the device (e.g. by reading/processing data from its storage), or it may involve use of at least one sensor, network connection, etc. The four possible actions of the example Q-Table are the same as described above.

Figure 7 is a flowchart illustrating training of the Q-Table, which may be used when generating the decision-making data structure at step 204 of Figure 2, for example. At step 702 the 0-Table is initialised using the trained CNN. It can optimize the best possible output/quality point with given requirements (e.g. determined by a developer) and device states. Q-Table initialization may also be done randomly or in other ways.

At step 704 an action is selected from the 0-Table according to a policy. The policy is typically a compromise of exploration and exploitation. There is a random possibility of not taking the best Q value for each state (this probability decreases over time) and so all possibilities can be explored. At step 706 the chosen action is performed. At step 708 a reward value associated with performing the action is measured according to a reward function. At step 710 the 0-Table is updated with the calculated reward value for the state/performed action, and control can then pass back to step 704.

Figure 8 illustrates an example of the 0-Table 802 and related reward function 804. The format of each line/entry in a typical Q-Table is as follows: (State, Actions) 0-Value For simplicity, Figure 8 only shows 3 input states: currently-processed layer number, remaining time in relation to budget time and current output quality reached by the processing so far. In the table, light shaded actions 803A represent impossible states, medium shaded actions 803B represent termination points, and dark shaded actions 803C represent impossible actions. In the example the number of states is as follows: Layer Num(3), Budget Time(3), Current Quality reached(3) (= 27 states) Actions: Stop, Get intermediate output & continue, Get intermediate output & stop, Continue (= 4 actions) Table Size = 108 elements The function objective is to take the best action A that maximizes the cumulative reward.

The 0-Learning process selects the best action based on the current state of the environment with the aim of reaching or getting closer to a specific goal. The action to be selected for a given state may be determined in various ways, e.g. using the known Algorithmic Temperature Function or Boltzmann Probability: P(s.a) C' T Qe, Ebepe T ' When there are no relevant experiences stored in the 0-Table, a randomly-selected action may be performed.

Performing actions leads to changes in system/environment states. The new state of the system is directly or indirectly the result of the action taken. The 0-Table is updated when new experiences are gained from the environment. A reward value can be calculated for each state based on a satisfaction formula. These rewards in effect are determined by the results of a given action. High rewards are given to good results, whilst zero or low rewards are given to bad results. The satisfaction formula represents the ideal results to be obtained by the system. The update formula updates 0-Values with new data obtained from the satisfaction formula. 1. 2. 3. 4.

Requirements for setting up a 0-Learning system normally include: Define Actions (a) 1. Actions are what the agent does in the environment 2. Example: Move, sell, buy, update a particular device or application setting, etc.. Define Satisfaction or Reward formula (r) 1. Rewards indicate if the results are good or bad 2. Rewards tell the learning algorithm the outcome of the actions 3. Example: Giving a robot positive reward for moving in straight line and a negative reward for hitting the wall; the robot will learn to move in straight line, and avoid walls Define States (s) 1. States represent the status of the system at any given time (t) 2. Example: Current position in space for a robot, speed of a driving car, etc 3. States can also be calculated values; for example average speed of a car, fuel average fuel consumption in a car, battery level of a robot, etc. Define Learning Rate (a) and Discount Factor (y) 1. Learning rate (a) will define how quickly an agent learns (longer learning could be more accurate and vice-versa) 2. Discount Factor (y) defines the importance given to future rewards -Should the agent look for Future high reward or short-term smaller rewards? An example function 804 for updating the 0-Table is: The skilled person will appreciate that the above formula is orientative and that alternative 0-Learning implementations can have multiple different variations of the formula. Provided that the 0-Value of the action gets updated by a suitable formula that uses the reward value obtained, it can still be a 0-Learning implementation.

An example of a reward scheme 806 is shown in Figure 8. The best reward is obtained when three outputs are produced; however, it will be penalized for each additional time/ms and step. An example calculation 808 is also shown beneath the function 804.

An example of a possible pipeline path where the decisions are taken by the trained 0-Table will now be described. At the start of inference (stage 1) device resources are available and the states are as follows: Layer number = 0; Device temperature = 35; Budget time = 200ms Current quality = 0; 10 CPU/GPU load = 40% free The action chosen based on the 0-Table (MaxQ) is to continue without saving output of the current layer (answer no at step 514 of Figure 5, leading to step 522).

Thus, X is incremented (step 522) and processing continues (stage 2) to layer 1 (step 406). Device resources are still available, but it is best to save the intermediate layer output as resources may be short in the future. The states are as follows at this point: Layer number = 1; Device temperature = 50; Budget time = 150ms Current quality = 0; 20 CPU/GPU load = 35% free The action chosen based on the 0-Table (MaxQ) is to continue and save the output of the current layer (answer yes at step 514, leading to steps 516, 518, 522).

Thus, X is incremented (step 522) and processing continues (stage 3) to layer 2. The states are as follows at this point: Layer number = 2; Device temperature = 55; Budget time = 50ms; Current quality = 2; CPU/GPU load = 15% free The action chosen based on the 0-Table (MaxQ) is to continue (answer no at step 514, leading to step 522).

Thus, X is incremented (step 522) and processing continues (stage 4) to layer 3. The states are as follows at this point: Layer number = 3; Device temperature = 60. Budget time = 10ms* Current quality = 2; CPU/GPU load = 5% free The action chosen based on the 0-Table (MaxQ) is to continue (answer yes at step 504, leading to step 506, where the output of layer 3 is returned as the task output).

In the above example use case, the quality of the task output obtained was 2, which was available since stage 2 of the inference. To demonstrate an alternative outcome, below is an example where the following state resulted at stage 4: Layer number = 3; Device temperature = 60; Budget time = 30ms; Current quality = 2; CPU/GPU load = 10% free In this case the action chosen based on the 0-Table (MaxQ) is to continue: answer yes at step 504, leading to step 508, where X is determined as the final layer of the network, leading to saving of the output of layer at step 510, with that being returned as the task output at step 506. Thus, the best quality of output (3) was obtained at stage 4. Additionally, output quality 2 was obtained at stage 2, which can mean that a second application/process could use that output at that point (no requirement to wait for quality 3).

Figure 9 schematically illustrates an alternative embodiment where decisions are shared using multiple devices 900A -900X connected via a network 902. In some cases it can be difficult to train the decision-making data structure for all possible application/device state combination because these can result in very large numbers. However, each device can store and update its own version of the decision-making data structure, e.g. 0-Table 904, to effectively fill in "missing states" whilst training. Updated state data can be transferred to another/new device 906, e.g. directly or via a server on the communications network, which can perform a short training to update its own stored version of the Q-Table.

Figure 10 is a flowchart illustrating steps that can be performed in order to generate the neural network, e.g. at step 202 of Figure 2. Dotted lines/arrows 1001 indicate where the processing diverges from that of a typical CNN training pipeline (shown by solid lines/arrows).

The training data/input batch 100 is initially passed to the encoder 404 and its output is then passed to the first decoder layer 406. The method uses a counter X (initially set to 0 or 1) for tracking a decoder/network layer that currently processes the data.

At step 1002 a question is asked whether decoder layer X is the final layer in the decoder.

If the answer is no then at step 1004 a question is asked whether the decoder layer has an intermediate output layer. If the answer is no then control passes to step 1006. If the answer if yes then control passes to step 1030, where the intermediate CAux" in the Figure) layer is applied to the data. Then, the output of the intermediate layer X is obtained at step 1032. Then, at step 1034, that output is added to the intermediate layer output and control then passes to step 1006. As further explanation, in the case of training a CNN without intermediate prediction layers, input is fed, output is obtained and this output is compared with the ground-truth annotation for the input image and then loss is calculated based on this comparison. This loss is then backpropagated through the network and network weights are updated. In the case of a CNN with intermediate prediction/output layers, the same process is followed, but for multiple outputs: the input is fed into the network and all outputs are obtained (intermediate and final). Each output is compared with the ground-truth annotation of the input image, losses are calculated and then summed to form the final loss, and then this loss is backpropagated through the network and weights are updated. In step 1034 these intermediate output predictions are stored so that they can be used to train the model (as shown in steps 1016, 1018, 1020). Figure 12 (described below) shows another example training process, which indicates which intermediate output prediction affects which layer's weights during training.

If the answer to the question of step 1002 is yes then control passes to step 1008, where the output of the decoder layer X is obtained. Then, at step 1010, the output of the decoder layer's intermediate layer is obtained. Then, at step 1012, the loss of the final layer is calculated. Then, at step 1014, backwards propagation from the final layer is performed. Then, at step 1016, a counter N is set as the number of intermediate output layers of the final decoder. Then, at step 1018, the loss of intermediate layer N is calculated. Then, at step 1020, backwards propagation from the intermediate layer N is performed. Then, at step 1022, N is decremented by 1.Then, at step 1024 a question is asked whether N is O. If the answer is yes then control passes back to 1000, where the next input is processed. The answer is no then control passes to step 1018.

At step 1006 the layer counter X is incremented by 1 and control passes to step 406, where decoder layer X is applied to the data.

Figure 11 schematically illustrates an alternative where a CNN is used as the basis for the decision-making data structure, rather than the Q-Table. The CNN only needs to be trained to behave like a trained Q-Table by using the Q-Table as ground truth and estimating the values of the Q-Table. The input states are normalised and then applied to a series of fully-connected (FC) layers of the CNN to produce Q values. A data structure, e.g. a simple look-up table, containing the best action for each state can then be stored and used as the decision-making data structure in some cases. Unlike the Q-Table the CNN will need ground-truth pairs for input data and the embodiment will not be able to train itself with only the guidance of the rewards; it needs a dataset with well-defined input/output pairs. An example is as follows: 1. First, data is created and ground-truth annotation pairs. One example data point can be as follows: a. [Layer num=1, CPU Load 50%, Temp= 60 Celsius, Budget time=10 ms, Current quality= Medium] , [QVals = XXXX] b. This effectively means that the 0-CNN is show how it will get these reward values given such device states and actions (i.e. stop/continue to next layer, etc).

2 This can be done for the entire dataset and the CNN will learn to associate device state and best action to take in said device states. As the CNN required to replace the 0-Table can be less complex than the actual CNN that is trained to perform the processing task, this CNN can be a network with only a few layers. In alternative embodiments, other types of ML data structures may be used instead of a CNN, such as a simple multi-layer percepton based model (i.e. convolutional layers can be replaced with fully connected layers).

Figure 12 schematically illustrates a specific embodiment where the processing task comprises depth estimation and shows the training process for the neural network in this case. The embodiment can provide on-demand inference for interactive time response. If the device is busy then a lower quality/faster result (t) is obtained using the decision-making data structure. If the device is not busy then a high quality/slow result (t+3) can be obtained. Every loss term as a computational graph: L3: Al, 1, 0 + encoder. The embodiment can "sum" loss terms so that each term can retain its path. The embodiment can backpropagate the summed loss term, with each loss term updating its own path. Therefore: Total Loss.backward0 = Lo.backward() + + Ln.backwardo. In the Figure the "forward" lines 1202 (which include arrows) show what happens during the prediction phase. The "backprop" lines 1204 (lacking arrows) show the backpropagation of the losses and thus the training of the network. Anything related to the losses should be mentioned as the training phase. Each loss is obtained with one prediction (intermediate or final) and then these losses are summed. When these losses are summed, each term actually updates related layers, as shown in this Figure.

Figure 13 illustrates the details of the neural network for the example embodiment of Figure 12. Vertical lines 1302 on the left indicate skip (residual) connections and horizontal lines 1304 below indicate intermediate layers.

In the Figure "Con(' means convolutional layer. For instance, Cony 3x3x3x32/s=2 means this layer takes in 3 channel input, has kernel size of 3x3 and outputs 32 channels. It has a stride of 2. Default stride is 1. The convolution may be parameterized by (IC x KSxKSx0C), s and IC: Input channel count, KS: Kernel size (width x height), s: stride.

In Figure 13 "DW Cony" is the depth-wise convolution operator. For instance, DWConv 32x3x3x32 means this layer takes in 32 channels, has a kernel size of 3x3 and outputs 32 channels. It has a default stride of 1, unless indicated (with s=X). The layer can be considered the same as convolution, but only the number of channels is changed. It can be formed of a convolution layer, batch normalization and a ReLu activation. The layer may be parameterized by (IC, KSxKS, OC), s, and IC: Input channel count, KS: Kemel size (width x height), OC: Output channel count, s: stride. Stride is 1, unless explicitly stated.

"DW Res" is the depthwise residual bottleneck layer. For instance, DW Res 16,96,24/3x3/s=2 means it takes in 16 channels as input, generates 96 channels, and reduces to 24 channel output. It has a kernel size of 3x3 and a stride of 2. Default stride is 1, unless indicated (with s=X). The layer can comprise a depthwise convolution and a pointwise convolution (two combined are also known as depthwise-separable convolution). Pointwise convolution is convolution operator with kernel size of 1x1 and a batch norm (optionally and a ReLu activation). The layer may be parameterized by (IC, GC, OC, KSxKS,), s, and IC: Input channel count, GC: Generated Channel count, OC: Output channel count; KS: Kernel size (width x height), s= stride. Unless explicitly indicated, stride is 1.

"DW Ups" is the depthwise upsampling layer. For instance, DW Ups 160, 480"64 /5x5 / NN means it takes in 160 channels as input, generates 480 channels, and reduces to 64 channel output. It has a kernel size of 5x5. It uses nearest neighbor upsampling. Default stride is 1, unless indicated (with s=X). BIL means bilinear upsampling.The layer can comprise a depthwise convolution, interpolation (bilinear or nearest neighbor), depthwise convolution and a pointwise convolution. The layer may be parameterized by (IC, GC, OC, KSxKS, NN), s, and IC: Input channel count, GC: Generated channel count, OC: Output channel count; KS: Kernel size (width x height). NN: upsampling method (Nearest neighbor(NN) or bilinear upsampling(BIL)). S= stride, 1 unless stated.

In some embodiments the network architecture is based on MobileNetv2 encoder + FBNet decoder layers. The known depth-in-the-wild (DIVV) dataset is an example of suitable training data. ReLu activation, after nearly every convolutional operator, is an example of a suitable activation function. Some embodiments were implemented starting with le-4 as the learning rate, which decayed by half every epoch. Training was for 5 epochs and Adam optimizer was used to optimize the network.

Figure 14 is a diagram view of the architecture of Figure 13. Lines 1402 indicate skip connections. The skip connection paths are listed below: * DW Res1 to Conv3 * DW Res2 to DW Ups 4 * DW Res4 to DW Ups 3 * DW Res7 to DW Ups 2 * DW Res16 to DW Ups 1 * Auxiliary layers are formed of DW Ups layers.

The intermediate layers are formed of DW Ups layers. These directly upsample to original resolution and output 1 channel depth maps.

Figure 15 schematically compares a depth estimation processing task using a known technique and according to an embodiment. Diagram 1502 shows the known technique RGB2Depth technique (rXiv: 2005.01616v2), whilst diagram 1504 shows an adapted version where a decision-making data structure according to an embodiment of the present invention is used. As can be seen, in the conventional technique, the neural network produces just one high quality output at the end of its execution. The instructing application will have to wait to obtain it and does not know for how long will this wait be as it does not control the device status. On the other hand, when the embodiment is used, the neural network produces just intermediate outputs according to the requirements of the instructing application. Additionally, it can ensure the best possible quality that the application budget and device status allow.

The example embodiments described above relate to depth estimation, where an encoder-decoder structure is used to map a colour image to a pixel-wise depth representation. This allows up-sampling of the layer features at each decoding stage in order to fully encode an image and transform it into an output of the same width x height dimensions. Alternative embodiments can be produced to perform other image-based processing tasks, such as style transfer, using a very similar network structure.

Any number of intermediate layers can be added, making embodiments flexible to various requirements. When producing an embodiment intended to perform a given processing task, there are two factors that need to taken into account when designing the intermediate prediction layers and the architecture: 1) How many intermediate prediction layers would be suitable? 2) What upsampling factor (see above example) would be ideal? An answer to both questions will be related to the task itself and the network size. For instance, if there is a large network then several intermediate prediction layers may be provided to have fine-grained control over accuracy/speed tradeoff If there is a smaller network then fewer intermediate prediction layers will be needed because the network itself will be relatively fast anyway.

In dense prediction tasks (e.g. depth estimation, style transfer, etc), encoder/decoder architectures are used. For an input image of size 256x256 the encoder part "encodes" that into a smaller sized vector, e.g. 32x32. The decoder then upsamples it to the original resolution. When upsampling, deconvolution is generally performed, which can lead to a 32x32 in the bottleneck of the network. Assuming there are 3 deconvolution layers in the decoder, where each layer upsamples by a factor of 2 (32x32-> 64x64 -> 128x128 -> 256x256), the outputs of the intermediate prediction layers can integrated using one of these deconvolution layers, but with a larger upsampling factor instead of 2. Thus, in the decoder, there will be 32x32-> 64x64> 128x128 -> 256x256 (final output), but with intermediate predictions. In the case of one intermediate prediction layer the result would be 32x32-> 64x64-> 128x128 -> 256x256 (final output), or 256x256 (intermediate output). Therefore, in this case, there would be two branches performing upsampling after 64x64: one in the "main path" that upsamples by a factor of 2, and another "intermediate prediction layer" that upsamples by a factor of 4, directly reaching the original input size.

Whilst the above examples use image-based inputs, it should be understood that embodiments are not limited to such. For instance, alternative embodiments can provide a solution for speaker recognition, which predicts whether speech corresponds to a specific person (i.e. a certain class). Such embodiments can process to windows over the speech signal, providing multiple samples as an input sequence. As long as the transformations of the input each have an associated auxiliary layer, a network of the form shown in Figure 4. Speech may be represented in columns of frequencies, over short time windows, and presented as if it was an image. An embodiment adapted for a processing task in the form of speaker recognition can map variable length speech to fixed-size vectors (e.g. around 100-300 dimensions in size) or embeddings, and provide these as input examples (each with a speaker class) to a baseline neural network without auxiliary layers. This can be considered equivalent to the encoder 404 in Figure 4. The remainder of the neural network can be set up in a standard manner with convolutional or fully-connected layers in order to predict the class/identity of the speaker via a per-class probability prediction (the most probable speaker would be taken as the answer). This would not be a decoder in the exact sense of Figure 4, which maps an image at one resolution to another at the same resolution (generate new image), but, instead, uses the encoded speech embedding and maps it to a single value (discriminate between inputs). The remainder of the neural network would consist of, for example, 4 layers, ending with a Softmax output per-class to give probabilities.

To adjust the amount of computation that the network performs, the output of the second layer may be obtained earlier on, which might consist of M=1000 x N=256 values (256 outputs for each of the 1000 inputs to it). For the intermediate layer (of that second layer), each of set of outputs from the second layer may be aggregated via a pooling operation (take mean and standard deviation of each set of 256 outputs) to represent the distribution of the outputs and perform a weighted sum + non-linearity to produce a probability. This can be similar to how a probability is produced at the output of the baseline network. A similar setup can be applied to each of the remaining layers 1 and 3 (and layer 4 produces the original output! class probability predictions). To decide which layer to output from, factors such as power usage, battery life metrics, time budget, etc, may be used to determine whether the entire network should be executed, or only some of the layers. In general, the neural network will be similar to the decoder shown in Figure 4, but instead of upsampling before each auxiliary layer to produce a higher-resolution new image, a technique such as a pooling/aggregation process may be used instead to produce a smaller-scale class prediction/probability.

Attention is directed to any papers and documents which are filed concurrently with or previous to this specification in connection with this application and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference.

All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive.

Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.

The invention is not restricted to the details of the foregoing embodiment(s). The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed.

Claims

CLAIMS1. A computer-implemented method of performing a processing task instructed by an application executing on a computing device, wherein the processing task generates a task output based on a task input provided by the application, the method comprising: obtaining (302) a neural network trained to perform the processing task and comprising a plurality of intermediate output layers; obtaining (304) a decision-making data structure comprising data representing a plurality of states related to performance of the processing task, and data representing a plurality of actions selectable for each said state; obtaining (306) requirements data describing requirements of the application, the requirements comprising at least a time budget for completing the processing task, and a target quality for the task output; using (308) the requirements data and the decision-making data structure to select a said action to perform in relation to a said layer of the neural network, and performing (310) the selected action in relation to the layer of the neural network based on the task input to generate the task output for the application.
2. A method according to claim 1, wherein the neural network comprises a convolutional neural network having a plurality of encoder layers and a plurality of decoder layers, wherein at least one of the plurality of decoder layers has an associated said intermediate output layer.
3. A method according to claim 1 or 2, wherein the action is selected from the plurality of actions which comprise: stop (506) after providing output generated by a current layer (406) of the neural network as the task output; stop after storing (518) output generated (516) by a said intermediate output layer; store output generated (516) by a said intermediate output layer and continue (520) to execute a next said layer of the neural network; execute (522) a next said layer of the neural network without a said intermediate output layer generating an output.
4. A method according to any preceding claim, wherein the plurality of states comprise: a remaining time in relation to the time budget; a current output quality in relation to the target quality, and a currently executed said layer of the neural network.
5. A method according to any preceding claim, wherein the decision-making data structure is used to select a said action from amongst the plurality of actions for a said state based on a reward value.
6. A method according to claim 5, wherein the action of stopping after executing a current layer of the neural network may be selected as a result of a high said reward value being computed when an output generated by the current layer of the neural network is determined to be within the time budget and/or meets the target quality.
7. A method according to any preceding claim, wherein the decision-making data structure comprises a Reinforcement Learning model.
8. A method according to claim 7, wherein the decision-making data structure comprises a Q-Learning Q-Table, or a neural network trained using a Q-Table.
9. A method according to any preceding claim, wherein the plurality of states in the decision-making data structure further comprise at least one operational state of the device executing the application, and the method further comprises: obtaining device operational state data describing at least one operational state of the device executing the application, and using the device state data and the decision-making data structure to select a said action to perform in relation to a said layer of the neural network.
10. A method according to claim 9, wherein the operational states comprise temperature of the computing device and a load of a CPU/GPU of the computing device.
11. A method according to any preceding claim, comprising selecting two or more of the actions and performing the two or more selected actions in parallel.
12. A method according to any preceding claim, comprising generating a plurality of decision-making data structures on a respective plurality of further computing devices, and transferring data representing the generated plurality of decision-making data structures to the computing device to update the decision-making data structure obtained by the computing device.
13. A method according to any preceding claim, wherein the processing task comprises an image processing task, such as depth estimation.
14. A computer readable medium storing a computer program to operate a method according to any preceding claim.
15. A computing device (100) configured to perform a method according to any of claims 1 to 13.