GB2587032A

GB2587032A - Method for designing accelerator hardware

Info

Publication number: GB2587032A
Application number: GB1913353.7A
Authority: GB
Inventors: Pong Chau Chun; Bhattacharya Sourav; Lee Royson; Dudziak Lukasz; S Abdelfattah Mohamed; Kim Hyeji
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2019-09-16
Filing date: 2019-09-16
Publication date: 2021-03-17
Anticipated expiration: 2039-09-16
Also published as: KR20210032266A; GB2587032B; GB201913353D0

Abstract

Disclosed is a method of designing accelerator hardware for implementing a parametrizable algorithm, the method comprising: selecting a first paired accelerator hardware and parametrizable algorithm; obtaining a reward value using evaluated accuracy and efficiency metrics for implementing the parametrizable algorithm on the accelerator hardware; using the reward value to select a second paired accelerator hardware and parametrizable algorithm, and repeating the preceding steps until a final paired accelerator hardware and parametrizable algorithm are found. The efficiency metrics may include latency, area, or power. The accelerator hardware is preferably a field-programmable gate array (FPGA) and the parametrizable algorithm is preferably a convolutional neural network (CNN). Also claimed is a system designing a parametrizable algorithm, the system comprising a controller configured to select a parametrizable algorithm, a cut-off module configured to determine whether the selected parametrizable algorithm meets hardware criteria, and an evaluator configured to obtain a reward value for the selected parametrizable algorithm when hardware criteria are met. The selection of accelerator hardware and parametrizable algorithms may use neural architecture search (NAS).

Description

Method for Designing Accelerator Hardware

Field

[1] The present application generally relates to a method for designing accelerator hardware, e.g. an FPGA accelerator, in particular for use as a platform for convolutional neural networks (CNNs) or other parametrizable algorithms.

Background

[2] One example of an accelerator hardware is a Field-Programmable Gate-Array (FPGA) accelerator which is a flexible platform for implementing deep neural networks24.

FPGAs especially shine at low-batch DNN inference tasks13, in custom hardware (HVV) configurations9, and when tailored to specific properties of a DNN such as sparsityw or custom precision5.30. One of the FPGA unique strengths is that the HW design cycle is relatively short when compared to custom application-specific integrated circuits (ASICs).

However, this strength comes with an interesting side effect: FPGA accelerator HW is typically designed after the algorithm (DNN in our case) is decided and locked down. This sequential design of DNN-then-accelerator is schematically illustrated in Fig. la. As shown, the first step 5100 is to design the DNN and the second step 5102 is to design the FPGA accelerator.

[003] Even if the accelerator is software-programmable, its HW is usually overoptimized for a specific DNN to maximize its efficiency. As a result, different DNNs are typically inefficient with the same HW. To circumvent this "overoptimization" problem, FPGA designs are typically configurable at the HW level. In this case, when a new DNN is discovered, the accelerator parameters can be tuned to the new DNN to maximize the HW efficiency. Even with the HW configurability, FPGA accelerators have the disadvantage of always needing to catch up to new DNNs.

[004] As schematically shown in Fig. 1 b, there may be two components for designing a DNN: a controller selects a DNN from a search space and an evaluator which implements the DNN to find accuracy. Based on accuracy, a reward is created which updates the controller search algorithm. Iteratively, this influences the controller to propose higher-accuracy models as the search progresses. This sequential design may be automated and may be termed neural architecture search (NAS)8.33. NAS has been successful in discovering DNN models that achieve state-of-the-art accuracy on image classification23, super-resolution7, speech recognition22 and machine translation25.

[005] A further development termed FNAS is described in "Accuracy vs. Efficiency: Achieving Both Through FPGA-Implementation Aware Neural Architecture Search" by Jiang et al, published in arXiv e-prints (Jan 2019). FNAS is a HVV-aware NAS which has been used in an attempt to discover DNNs that minimize latency on a given FPGA accelerator. FNAS is successful in discovering convolutional neural networks (CNNs) that are suited to a particular FPGA accelerator. Other HW-aware NAS adds latency to the reward function so that discovered models optimize both accuracy and inference latency, for example, when running on mobile devices'', 28' 29' 31.

[006] It is also noted that for CPUs and GPUs, it is the other way around: the algorithm is optimized to fit the existing I-IVV20, and for successful ASICs, it is necessary to build-in a lot of flexibility and programmability to achieve some future-proofing6,18 accuracy.

[007] The present applicant has recognised the need for a new method of designing FPGAs.

Summary

[8] In a first approach of the present techniques, there is provided a computer-implemented method for designing accelerator hardware for implementing a parametrizable algorithm, the method comprising selecting a first paired accelerator hardware and parametrizable algorithm; evaluating accuracy and efficiency metrics for implementing the parametrizable algorithm on the accelerator hardware; obtaining a reward value for the first paired accelerator hardware and parametrizable algorithm using the evaluated metrics; using the obtained reward value to select a second paired accelerator hardware and parametrizable algorithm, and repeating the evaluating, obtaining and using steps until a final paired accelerator hardware and parametrizable algorithm are selected.

[9] The accuracy metrics may reflect the accuracy of the parametrizable algorithm when implemented on the paired accelerator hardware. The efficiency metrics may reflect the efficiency of the accelerator hardware when implementing the parametrizable algorithm. The efficiency metrics may include at least one of latency, area and power of the accelerator hardware. By selecting a paired accelerator hardware and parametrizable algorithm and by evaluating metrics for both the accelerator hardware and parametrizable algorithm, the accelerator hardware and parametrizable algorithm may be considered to be co-designed.

The final paired accelerator hardware and parametrizable algorithm may be a pair in which both the accelerator hardware and parametrizable algorithm are optimised to work together, e.g. to perform a particular task such as image classification with specific efficiency and accuracy constraints. In other words, the method may be a method of optimizing accelerator hardware for implementing a parametrizable algorithm. An optimized pairing may have a higher reward value than other pairs. Alternatively, the final accelerator hardware and parametrizable algorithm may be the pair which is output after a certain number of iterations or after a certain time has elapsed. The method may comprise implementing (e.g. making) the final paired accelerator hardware and parametrizable algorithm.

[10] Obtaining a reward value may comprise calculating a reward function using the evaluated metrics. For example, the reward value may be obtained from a weighted sum of the evaluated metrics. The evaluated metrics may be normalized before obtaining the weighted sum. The reward function may use threshold values of the evaluated metrics to constrain the reward function. For example, the reward function may be defined as where m is the vector of evaluated metrics, w is the vector of their weights and th is the vector of thresholds of the evaluated metrics. More specifically when considering the evaluated metrics of latency, area and accuracy, the reward value may be calculated from R(ar, lat, acc) = w1((-ar) + w2(-fat) + w3K(acc) max R (-ar, -la t, arc) sES where ar is area, lat is latency, acc is accuracy), wi, w2, wa are the set of weights for each of area, latency and accuracy and the optimisation is performed over the search space sr.S such that the evaluator output E(s)=m satisfies given constraints (e.g. latency below a certain value).

[11] The accelerator hardware may be a field programmable gate array (FPGA accelerator, an application specific integrated circuit (ASIC) or a system on chip (SoC).

Selecting the paired accelerator hardware and parametrizable algorithm may comprise selecting an accelerator hardware from an accelerator hardware sub-search space which comprises a plurality of accelerator hardware. The accelerator hardware search space may be well-defined, e.g. by at least one configurable parameter of the accelerator hardware.

For example, when selecting an FPGA, the FPGA may comprise one or more buffers (e.g. one or more of input, output and weight buffers), one or more convolution engines, and an optional pooling engine. The at least one configurable parameter may be selected from parallelisation parameters (e.g. parallel output features or parallel output pixels), buffer depths (e.g. for the input, output and weights buffers), a memory interface width parameter, a pooling engine usage parameter and a convolution engine ratio parameter. Selecting the accelerator hardware may comprise deciding a value for each configurable parameter based on a policy function and obtaining a probability of selecting the accelerator hardware having the decided value for each configurable parameter.

[012] A parametrizable algorithm may be defined as an algorithm that may be implemented using different computational blocks which may be assembled in many different ways, but still perform the same function or may be an algorithm which has many parameters (e.g. options) which influence the operation of its constituent parts. The parametrizable algorithm may be a neural network, for example a deep neural network such as a convolution neural network, CNN. For example, a CNN may be composed of different computational blocks selected from convlxl, conv3x3 and pool3x3. Putting them together in different ways would still produce a CNN that performs a function or task such as image classification. Another example is GZIP compression which is an algorithm comprising two main computational blocks: LZ77 compression and Huffman encoding. The LZ77 computational block contains parameters such as compression window size and maximum compression length. The Huffman computational block has parameters such as Huffman tree size, tree update frequency etc. These parameters affect the final result of the GZIP compression algorithm and there is usually a trade-off of compression ratio vs compression speed.

[013] Selecting the paired accelerator hardware and parametrizable algorithm may comprise selecting a CNN from a CNN sub-search space. The CNN sub-search space may comprise a plurality of CNNs and may be well-defined e.g. by at least one configurable parameter of the CNN. For example, each of the plurality of CNNs within the CNN sub-search space may have the same number of stacks (e.g. three) and each stack comprises the same number of cells (e.g. three). Within the CNN sub-search space, each cell may be limited to a maximum number of operations (e.g. seven) and/or a maximum number of connections (e.g. nine). The at least one configurable parameter may be selected from the operation and the connection of each cell within the CNN. Selecting the CNN may comprise deciding a value for each configurable parameter based on a policy function and obtaining a probability of selecting the CNN having the decided value for each configurable parameter.

[14] Selecting the paired accelerator hardware and parametrizable algorithm may comprise searching a search space which is a Cartesian product of a parametrizable algorithm sub-search space and an accelerator hardware sub-search space. The search space may be defined as S = SNN X SFGpA where SNN is the sub-search space for the parametrizable algorithm and SFpGA is the sub-search space for the accelerator hardware, e.g. an FPGA. Such an approach may be termed a fully combined search.

[15] As an alternative to the fully combined search, selecting the paired accelerator hardware and parametrizable algorithm may comprise an algorithm search phase in which a parametrizable algorithm sub-search space comprising a plurality of parametrizable algorithms is searched to select a parametrizable algorithm and an accelerator hardware search phase in which an accelerator hardware sub-search space comprising a plurality of accelerator hardware is searched to select an accelerator hardware. In the algorithm search phase, the accelerator hardware parameters are fixed and vice versa. The algorithm search phase and the accelerator hardware sub-search phases may be interleaved whereby after selecting a parametrizable algorithm, the paired accelerator hardware for that selected parametrizable algorithm is selected. The next algorithm search phase then searches for another parametrizable algorithm which is a better match to the selected accelerator hardware. The algorithm search phase may be longer than the accelerator hardware search phase, e.g. 1000 compared to 200 iterations. Such a phased search may still be considered to be co-designing because after each respective phase a respective pair of accelerator hardware and parametrizable algorithms is selected.

[16] The reward function may be used to update parameters and/or weights of a policy function which is used to select the paired accelerator hardware and parametrizable algorithm. Selecting may comprise determining distributions from the policy function and sampling a sequence from the distribution. When the combined search is used, the sequence may comprise both parameters of the hardware accelerator and the parametrizable algorithm. When the phase search is used, the sequence may comprise one of the parameters of the hardware accelerator or the parametrizable algorithm depending on the phase being conducted. By updating the parameters and/or weights of the policy function, the policy function may be considered to learn to choose a sequence that maximises reward. In other words, the search may be iterative and may converge on an optimal pair having the highest reward.

[17] When separate search phases are used, hardware metrics may be introduced to improve the algorithm search phase. For example, after the algorithm search phase, the method may comprises assessing whether the selected parametrizable algorithm meets hardware criteria before implementing the accelerator hardware search phase. When the hardware criteria are not met, the algorithm search phase may be repeated and when the hardware criteria are met, the accelerator hardware search phase may be implemented. The hardware criteria may include thresholds on latency and memory footprint which must not be exceeded when implementing the selected parametrizable algorithm on an accelerator hardware. The hardware criteria may be updated as the search progresses.

[18] Assessing whether the selected parametrizable algorithm meets hardware criteria may comprise implementing the selected parametrizable algorithm on a hardware device and measuring the hardware performance. It will be appreciated that this may be time consuming. Accordingly, assessing whether the selected parametrizable algorithm meets hardware criteria may comprise using a model to predict hardware performance of the selected parametrizable algorithm on a target accelerator hardware and comparing the predicted hardware performance to the hardware criteria. The modelling may be done using a statistical model. Before the comparing step, the method may comprise determining a level of confidence in the predicted hardware performance. When the level of confidence is below a confidence threshold, the selected parametrizable algorithm may be implemented on a hardware device and the hardware performance may be measured. In this case, the measured performance may be used to update the model. The model may be updated at any time, e.g. after a certain time has elapsed or after a certain number of iterations, by comparing the predicted hardware performance with measured hardware performance. When using a model to predict hardware performance as part of the co-design method, the model includes the parameters of both the accelerator hardware and the parametrizable algorithm.

[19] It will be appreciated that assessing whether a selected parametrizable algorithm meets hardware criteria using a model to predict hardware performance may be used as a stand-alone method. Such a process may be useful in cases where there is no co-design.

There may be a fixed target hardware (e.g. a mobile phone) and the model may be iteratively refined as the search for the optimal parametrizable algorithm progresses.

[20] In a related approach of the present techniques, there is provided a non-transitory data carrier carrying processor control code to implement the methods described herein.

[21] In a further approach of the present techniques, there is provided a system for implementing the methods described herein. The system may comprise a controller which is configured to select a first paired accelerator hardware and parametrizable algorithm and an evaluator which is configured to evaluate accuracy and efficiency metrics for implementing the parametrizable algorithm on the accelerator hardware, obtain a reward value for the first paired accelerator hardware and parametrizable algorithm using the evaluated metrics; send the obtained reward value to the controller to select a second paired accelerator hardware and parametrizable algorithm, and repeat the evaluating, obtaining and sending steps until a final paired accelerator hardware and parametrizable algorithm is selected by the controller.

[22] The system may also be configured to design the parametrizable algorithm using knowledge of the hardware. For example, the system may comprise a controller which is configured to select a parametrizable algorithm, a cut-off module which is configured to determine whether the selected parametrizable algorithm meets hardware criteria and an evaluator which is configured to obtain a reward value for the selected parametrizable algorithm when the hardware criteria are met. When the hardware criteria are met, the controller may be configured to select a different parametrizable algorithm for evaluation by the cut-off module. The cut-off module may comprise a model module which is configured to use a model to predict hardware performance of the selected parametrizable algorithm. The cut-off module may comprise a discriminator module which is configured to determine a confidence level for the predicted hardware performance. The cut-off module may comprise a deployment module which is configured to deploy the selected parametrizable algorithm to a hardware module to measure hardware performance. The measured hardware performance may be used to update the model. The deployment module may be configured to deploy the selected parametrizable algorithm to a hardware module when the confidence level is below a confidence threshold and/or after a predetermined time or number of iterations have lapsed.

[23] As will be appreciated by one skilled in the art, the present techniques may be embodied as a system, method or computer program product. Accordingly, present techniques may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.

[24] Furthermore, the present techniques may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.

[25] Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object-oriented programming languages and conventional procedural programming languages. Code components may be embodied as procedures, methods or the like, and may comprise sub-components which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction set to high-level compiled or interpreted language constructs.

[26] Embodiments of the present techniques also provide a non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out any of the methods described herein.

[27] The techniques further provide processor control code to implement the above-described methods, for example on a general-purpose computer system or on a digital signal processor (DSP). The techniques also provide a carrier carrying processor control code to, when running, implement any of the above methods, in particular on a non-transitory data carrier. The code may be provided on a carrier such as a disk, a microprocessor, CD-or DVD-ROM, programmed memory such as non-volatile memory (e.g. Flash) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier. Code (and/or data) to implement embodiments of the techniques described herein may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as Python, C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog (RTM) or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, such code and/or data may be distributed between a plurality of coupled components in communication with one another. The techniques may comprise a controller which includes a microprocessor, working memory and program memory coupled to one or more of the components of the system.

[028] It will also be clear to one of skill in the art that all or part of a logical method according to embodiments of the present techniques may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the above-described methods, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application-specific integrated circuit. Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media.

[29] In an embodiment, the present techniques may be realised in the form of a data carrier having functional data thereon, said functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable said computer system to perform all the steps of the above-described method.

Brief description of drawings

[30] Implementations of the present techniques will now be described, by way of example only, with reference to the accompanying drawings, in which: [31] Fig is is a flow chart of a known method for sequential designing of a DNN and then an FPGA; [032] Fig lb is a schematic illustration of a system of designing a DNN which may be used in Fig 1 a; [33] Fig 2a is a flow chart of a method for co-designing a CNN and an FPGA; [34] Fig 2b is a schematic illustration of a system for implementing the method of Fig 2a; [35] Fig. 2c is a schematic illustration of an example controller for use in the system of Fig. 2b; [36] Fig. 3 schematically illustrates a well-defined CNN search space which can be used in the method of Fig. 2a; [37] Fig. 4 is a schematic illustration of components of an FPGA accelerator; [38] Fig 5a plots area against resource usage for two types of accelerator architecture; [039] Fig. 5b plots latency per image against parallelism for the types of accelerator architecture shown in Fig. 5a; [40] Fig. 6 plots latency numbers against size and pixel_par; [41] Fig. 7 plots some Pareto-optimal points for accuracy, latency and area; [42] Fig. 8 plots accuracy against latency for the Pareto-optimal points shown in Fig. 7; [043] Figs. 9a to 9d plot the accuracy-latency Pareto frontier for single and dual convolution engines at area constraints of less than 55 mm2, less than 70 mm2, less than 150 mm2 andless than 220 mm2 respectively; [44] Fig. 10a plots accuracy against latency with a constraint imposed; [45] Figs. 10b and 10c show two example arrangements of a CNN selected from Fig. 10a; [046] Fig. 10d compares the execution schedule for the CNN in Figure 10c run on its codesigned accelerator and a different accelerator; [47] Fig. 11 plots accuracy against latency to show the overall landscape of Paretooptimal points with respect to the parameter ratio_conv_engines; [48] Fig. 12 illustrates an alternative architecture which may be used to implement phased 20 searching; [49] Fig. 13a plots accuracy against latency and highlights the top search results for an unconstrained search; [50] Fig. 13b plots accuracy against latency and highlights the top search results for a search with one constraint; [051] Fig. 13c plots accuracy against latency and highlights the top search results for a search with two constraints; [52] Figures 14a to 14c show the reward values for each of the separate, combined and phased search strategies in the unconstrained and constrained searches of Figs. 13a to 13c; [53] Fig. 15 plots top-1 accuracy against perf/area for various points searched using the combined search; [54] Figs. 16a and 16b show two example arrangements of a CNN selected from Fig. 15; [55] Figs. 17 and 18 illustrate alternative architectures which may be used with the method of Fig. 2a or to perform a stand-alone search; [56] Fig. 19 is a flowchart of a method which may be implemented on the architecture of Fig. 18; and [57] Fig. 20 is a flowchart of an alternative method which may be implemented on the architecture of Fig. 18.

Detailed description of drawings

[58] Figure 2a schematically illustrates a method of codesigning an accelerator hardware and a parametrizable algorithm, which in this example are an FPGA accelerator and a CNN. It will be appreciated that the method may be adapted for other types of hardware, e.g. an ASIC or SoC and for other types of algorithms, e.g. other neural networks and thus the more general terminology may be used interchangeably with the more specific terminology below.

As shown in Figure 2a, a convolution neural network (CNN) architecture is selected from a CNN search space (step S200) and an accelerator architecture is selected from an accelerator design space (step S202). The next step is to implement the selected CNN on the selected accelerator (step 5204). Metrics which are indicative of accuracy of the implementation and efficiency are then evaluated (step S206). The efficiency metrics may include latency, area and power. A multi-objective reward is then obtained based on the evaluated metrics (step S208). The reward is then used to update the selection of the CNN and FPGA pair (step S210). The process then iterates through the implementation, evaluation and obtaining steps again to refine the pairing of CNN and FPGA until an optimal CNN and FPGA pair are selected.

[59] Figure 2b schematically illustrates a system for implementing the method of Figure 2a. A controller 200 selects the CNN and the FPGA from the relevant search spaces and sends them to an evaluator 202 for the implementation, evaluation and obtaining steps. The evaluator 202 then returns the multi-objective reward to the controller 200 for the controller to update the search. The method may be described as a reinforcement learning system to jointly optimise the structure of a CNN with the underlying FPGA accelerator. Thus, NAS and the configurability of an FPGA are used to codesign both the CNN and a corresponding FPGA accelerator instead of tuning a CNN to a specific FPGA accelerator (as described in FNAS) or tuning the FPGA accelerator for a newly discovered CNN2.

[060] Figure 2c shows the detail of one arrangement of the controller. As shown in Figure 2c, the controller comprises a plurality of single long short-term memory (LSTM) cells followed by a corresponding specialized fully-connected (FC) layer; with one cell and one FC layer per output. Every proposed decision is sent to the next LSTM as an input. In this arrangement, the CNN (i.e. the operations and connections) are first proposed followed by the hardware parameters of the FPGA accelerator. Each configurable parameter of the CNN and the FPGA accelerator is treated as an output and has its own cell and FC layer.

Once all configurable parameters are complete, the resulting CNN and accelerator are sent to the evaluator for assessment.

[61] The controller shown in Figure 2c is an extension of a traditional RL-based NAS and may be termed an RL agent. The controller is therefore based on an LSTM cell. However, the controller may implement a completely different algorithm, for example a genetic algorithm and may thus have a different structure. The controller is responsible for taking a finite sequence of actions which translate to a model's structure. Each action may be called a decision like the examples illustrated in Figure 2c. Each decision is selected from a finite set of options and together with other decisions selected by the controller in the same iteration form a model structure sequence s. We call the set of all possible s a search space, which can be formally defined as: S = x 02 x Or, (1) where Of is the set of available options for the i-th decision. In each iteration t, the controller generates a structure sequence st.

[62] The sequence St is passed to the evaluator which evaluates the proposed structure and creates a reward rt generated by the reward function R(s) based on evaluated metrics. The reward is then used to update the controller such that (as t-,,o) it selects sequences sr which maximize the reward function.

[063] Different approaches exist to the problem of updating the controller --in deep RL a DNN is used as a trainable component and it is updated using backpropagation. Specifically, in REINFORCE, which is used in the method outlined above in Figure 2a, the controller DNN (a single LSTM cell in our case) implements a policy function u which produces a sequence of probability distributions, one per decision, which are sampled in order to select elements from their respective 0 sets and therefore decide about a sequence s. The network is then updated by calculating the gradient of the product of the observed reward r and the overall probability of selecting the sequence s. Formally: V( gp (s ID)) (2) where D= D2, , D,) is the set of probability distributions for each decision. Since s is generated from a sequence of independently sampled decisions Si, sz, , sn, the overall probability p(sID) can be easily calculated as: p (s I D) = p(s tin() (3) [064] RL-based algorithms are convenient because they do not impose any restrictions on what s elements are (what the available options are) or how the reward signal is calculated from s. Therefore, without the loss of generality, we can abstract away some of the details and, in practice, identify each available option simply by its index. The sequence of indices selected by the controller is then transformed into a model and later evaluated to construct the reward signal independently from the algorithm described in this section. Please note that in general different strategies can be used without undermining the base methodology. Following this property, throughout the paper we sometimes describe a search space using a shortened notation: S = k2, ,k,) k E N+ (4) which should be understood as a search space S as defined in Equation 1 with 101 = ki where k are the number of options available for each parameter.

[65] An overview of the generic algorithm is shown in the Algorithm below: [66] The REINFORCE algorithm or a similar algorithm may be used to conduct the search in conjunction with evaluating the metrics and generating the reward function. The algorithm may comprise a policy function that takes in weights/parameters and distributions Dt may be obtained from the policy function. A sequence sr from the distributions may then be sampled. When searching the combined space, a sequence contains both FPGA parameters and CNN parameters. The sequence is then evaluated by an evaluator, e.g. (by running the selected CNN on the selected FPGA, or simulating performance as described in more detail below). Metrics flit are measured by the evaluator such as latency, accuracy, area, power. These metrics are used as input to a reward function R(mt). The reward function, together with the probability of selecting that sequence, are used to update the n for 1-(A te a do parameters/weights of the policy function. This makes the policy function learn to choose a sequence that maximizes reward.

[067] The method shown in Figure 2a extends traditional NAS by including a number of decisions related to the design choices of an FPGA accelerator. The search space is thus defined as a Cartesian product of a neural network sub-search space (SNN) with an FPGA sub-search space (SFpGA), formally: S = SNN x SFGPA (5) Where SNN is the search space from equation 4 and SFpGA is the extending part related to the FPGA accelerator design.

[068] Please note that the search space defined like that is not fundamentally different from the definition provided in Equation 4 and does not imply any changes to the search algorithm. However, since the search domain for the two parts is different, we find it helpful to explicitly distinguish between them and use that differentiation to talk about their synergy. Each sub-search space is discussed in more detail below.

[069] Figure 3 schematically illustrates a well-defined CNN search space which can be used in the method of Figure 2a. It will be appreciated that this is just one example of a well-defined search space which may be used. The search space is described in detail in "NASBench 101: Towards Reproducible Neural Architecture Search" by Ying et al published in arXiv e-prints (Feb 2019) and may be termed NASBench. Figure 3 shows the structure of the CNNs within the search space. As shown, the CNN comprises three stacks 302 each of which comprises three cells 304. Each stack uses the same cell design but operates on data with different dimensionality due to downsampling modules which are interleaved with the stacks. Specifically, each stack's input data is x2 smaller in both X and Y dimensions but contains x2 more features compared to the previous one, which is a standard practice for classification models. This skeleton is fixed with the only varying part of each model being the inner-most design of a single cell.

[070] The search space for the cell design is limited to a maximum of 7 operations (with the first and last fixed) and 9 connections. The operations are selected from the following available options: 3x3 or 1x1 convolutions, and 3x3 maximum pooling, all with stride 1, and connections are required to be "forward" (i.e. an adjacency matrix of the underlying computational graph needs to be upper-triangular). Additionally, concatenation and element-wise addition operations are inserted automatically when more than one connection is incoming to an operation. As in equation (1), the search space is defined as a list of options (i.e. configurable parameters), in this case, the CNN search space contains 5 operations with 3 options each, and 21 connections that can be either true or false (2 options) --the 21 connections are the non-zero values of the adjacency matrix between the 7 operations. ScNN = (3. 3. ...3, 2, 2, ...2) (6) times 21 times [071] Please note that the search space does not directly capture the requirement of having at most 9 connections and therefore contains invalid points, i.e. points in the search space for which it is impossible to create a valid model. Additionally, a point can be invalid if the output node of a cell is disconnected from the input --we discuss how to deal with these problems in the later sections.

[072] Figure 4 is an illustration of an FPGA accelerator 400 together with its connected system-on-chip 402 and external memory 404. The FPGA accelerator 400 comprises one or more convolution engines 410, a pooling engine 412, an input buffer 414, a weights buffer 416 and an output buffer 418. A library for acceleration of DNNs on System-on-chip FPGAs such as the one shown in Figure 4 is described in "Chaidnn v2 -HLS based Deep Neural Network Accelerator Libray for Xilinx Ultrascale+ MPSoCs" by Xilinx Inc 2019 and is termed ChaiDNN library below.

[73] The search space for the FPGA accelerator is also well-defined and is defined by the configurable parameters for each of the key components of the FPGA accelerator. Ad described in more detail below, the configurable parameters which define the search space include parallelisation parameters (e.g. parallel output features or parallel output pixels), buffer depths (e.g. for the input, output and weights buffers), memory interface width, pooling engine usage and convolution engine ratio.

[74] The configurable parameters of the convolution engine(s) include the parallelisation parameters "filter_par" and "pixel_par" which determine the number of output feature maps and the number of output pixels to be generated in parallel, respectively. The parameter convolution engine ratio "ratio_conv_engines" is also configurable and is newly introduced in this method. The ratio determines the number of DSPs assigned to each convolution engine. When set to 1, this means that there is a single general convolution engine which runs any type of convolution and the value of 1 may be considered to be the default setting used in the ChaiDNN library. When set to any number below 1, there are dual convolution engines -for example one of them specialized and tuned for 3x3 filters, and the other for 1x1 filters.

[75] The configurable parameter for pooling engine usage is "pool_enable". If this parameter is true, extra FPGA resource is used to create a standalone pooling engine.

Otherwise the pooling functionality in the convolution engines is used.

[76] In the implementation shown in Figure 4, here are three buffers: an input buffer 414, a weights buffer 416 and an output buffer 418. Each of the buffers has a configurable depth and resides in the internal block memory of the FPGA. In the current CHaiDNN implementation, the buffers need to have enough space to accommodate the input feature maps, output feature maps and weights of each layer. Bigger buffer size allows for bigger images and filters without fetching data from slower external memory. As described below, feature and filter slicing may improve the flexibility of the accelerator.

[77] The FPGA communicates with the CPU and external DDR4 memory via an AXI bus. As in the CHaiDNN library, a configurable parameter allows for configuring the memory interface width to achieve trade-off between resource and performance.

[78] The following defines the FPGA accelerator search space for the parameters (filter_par, pixel_par, input, output, weights buffer depths, mem_interface_width, pool_en and ratio_conv_engines).

S _{FPGA} = (2, 5,4, 3, 3, 2, 2, 6) (7) [079] Considering the detail of the evaluator in more detail, it is noted that the area and latency of the accelerator are determined by parameters in the accelerator design space. Compiling all configurations in the design space to measure area and latency online during NAS is thus unlikely to be practical, since each compile takes hours and running CNN model simultaneously requires thousands of FPGAs. Accordingly, a fast evaluator may be useful to find efficiency metrics.

[80] For each accelerator architecture, step 5206 of Figure 2a may be completed in stages: first by using an area model. The FPGA resource utilization in terms of CLBs, DSPs and BRAMs may be estimated by using equations to model the CLB, DSP and BRAM usage for each subcomponent. An example subcomponent is a line buffer within the convolution engine that varies based on the size of the configurable parameters "filter_par" and "pixel_par". An equation uses these two variables as input and gives the number of BRAMs.

[81] When the configurable parameter "ratio_conv_engines" is set to smaller than 1, there are two specialized convolution engines. In this case, the CLBs and DSPs usage of the convolution engines is decreased by 25% compared to the general convolution engine. This is a reasonable estimate of potential area savings that can arise due to specialization, and much larger savings have been demonstrated in the literature's. In addition, when standalone pooling engine is used and the configurable parameter "pool_enable" is set to 1, a fixed amount of CLBs and DSPs are consumed.

[82] BRAMs buffer data for the convolution and pooling engines. The size of input, output and weight buffers are configurable via the depth. This data is double buffered and thus consumes twice the amount of BRAMs. Fixed number of BRAMs are also dedicated to pooling (if enabled), bias, scale, mean, variance and beta. The number of BRAMs are calculated assuming that each BRAM is 36 Kbits. Based on the FPGA resource utilisation, the next step is then to estimate the FPGA sizes in mm2 such that the area is quantified to a single number -silicon area. The area of each resource is scaled relative to CLB. Since this data is not available for the device that being used, data for similar devices is used from "Design Tradeoffs for Hard and Soft FPGA-based Network on Chips" by Abdelfattah et al publication in International Conference on Field Programmable Technology 95-103 (2012). Account for the smaller process node (20nm vs. 40nm) and the different block properties (8 LUTs per CLB instead of 10, and 36 Kbit per BRAM instead of 9 Kbit) is also taken. The table below shows the estimated block area of a device which may be used in the method.

Resource Relative area (CLB) Tile Area (mm2) CLB 1 0.0044 BRAM -36 Kbit 6 0.026 DSP 10 0.044 Total 64,922 286 [83] Figure 5a illustrates the area of various accelerator architectures. The lines plot the estimated resource usage by area for configurable parameters "filter_par=8 and "filter_par=16. Measurements of the area have also been calculated and are shown on the graph. The figure shows that the predictions of the area model is valid in respect to the real measurements. It is noted that the model has predicted the area of accelerator architecture currently not supported by CHaiDNN yet, for example the smallest architecture with configurable parameters "filter_par"=8, "pixel_par=4 is sized at 96.43 mm2 and the largest architecture with configurable parameters "filter_par=16, "pixel_par"=64 is sized at 218.62 mm2.

[84] Once the FPGA resource utilization in terms of CLBs, DSPs and BRAMs has been estimated. The latency may be estimated as part of step S206 of Figure 2a, e.g. using a latency model. It will be appreciated that in this example utilization is estimated before latency but the estimates may be undertaken in any order.

[085] The latency model may have two parts -1) latency lookup table of operations and 2) scheduler. From the NASBench search space, 85 operations are obtained including 3x3 and 1x1 convolutions, max pooling and element-wise addition operations of various dimensions. Running each operation on the FPGA accelerator with different configurations and using the performance evaluation API provided by CHaiDNN profiles the latency numbers which are then stored in a lookup table. The scheduler assigns operations to parallel compute units greedily and calculates the total latency of the CNN model using the latency of operations in the lookup table.

[86] The latency of convolution operation depends on the parallelism factors "filter_par" and "pixel _par". Since CHaiDNN does not support architectures "filter par=8", "pixel_par=4" and "filter_par=16", "pixel_par=64", their latency is interpolated using the measurements from the other architectures. In the case with dual convolution engines, one of them is specialized for 3x3 filters and the other for 1x1 filters. The performance of corresponding convolution is scaled in proportion to the number of engines available. For example, when the parameter ratio_conv engines = 0.75, the latency of 3x3 convolution increases by 1/0.75 and the latency of 1x1 convolution increases by 1/0.25.

[87] In the original CHaiDNN accelerator, the data buffers must be sized to fit the entire input, output and filter tensors to achieve the highest possible throughput. However, if the image resolution increases and the CNN becomes deeper, such an allocation scheme is infeasible and restricts the feasibility of the accelerator. In the method described in Figure 2a, a scheme where slices of the input tensor are fetched from external memory into the input buffer and processed independently by accelerator may be added. Furthermore, output layers and filter weights are spilled to external memory when the output and weight buffers are full, hence the performance is bounded by the memory bandwidth which depends on the configurable parameter "mem_interface_width".

[088] Some assumptions have thus been made when building the latency model due to the limitations on the current implementation of CHaiDNN. Firstly, the performance evaluation API does not support max pooling running on a standalone engine, thus the latency is modelled to be 2x faster than those running on the convolution engine. Secondly, the memory interface width cannot be configured independently. It is related to the DIET_CHAI_Z configuration which includes a set of parameters, and the memory interface width depends on the AXI bus which has reduced width when DIET_CHAI_Z is enabled. Without bringing all the parameters to the accelerator design space, the model assumes that the latency increases by 4% when the parameter "mem_interface_width" reduces from 512 bits to 256 bits. Lastly, the approach used in the model does not consider operation fusing which is used by the runtime of the accelerator to optimize latency.

[089] Figure 5b plots the results of the validation of the latency model. First the latency is estimated by the model for different accelerator architectures and the results are shown as lines in Figure 5b. Then we run the model on the FPGA accelerator and measure the end-toend latency as plotted in Figure 5b. The figure shows that the latency model is able to describe the trend of latency in respect to the level of parallelism despite the assumptions which may been made. It is noted that for Figures 5a and 5b, pooling of HW is enabled, the memory interface width is 512 bits, the buffer sizes are [8192,2048,2048], the batch size is 2 and the clock frequency is 200MHz.

[90] Figure 6 plots the extracted latency numbers of all the convolution operations from the lookup table relative to the parameters GFLOPS (size) and pixel_par. As shown, the latency increases with data size and decreases with more parallelism in the convolution engines.

[91] As shown in Figure 2a, a reward based on these metrics, e.g. latency, size and accuracy is generated (step 3208) and this is used to update the selection of the CNN and FPGA (step S210). As an illustration of the complexity of this implementation, Figure 7 plots some Pareto-optimal points for example as described in "Multiobjective Optimization, Interactive and Evolutionary Approaches" by Branke et al published by Springer 2008. The CNN accuracy in NASBench is precomputed and stored in a database, and the FPGA accelerator model described above runs quickly on a desktop computer. This allows the entire codesign search space to be enumerated with 3.7 billion data points. Pareto-optimal points within the 3.7 billion points are then located by iteratively filtering dominated points from the search space. Dominant points are points which are inferior to at least one other point on all 3 metrics (area, latency, accuracy). The remaining (non-dominated) points are optimal in at least one of our evaluation metrics (area, latency or accuracy). For our search space, there were only 3096 Pareto-optimal model-accelerator pairs and these are shown in Figure 7.

[92] As Figure 7 shows, there is a three-way trade-off between area, latency and accuracy --to improve one, the other two must degrade. As shown in the scatter plot, the search space consists approximately of concentric accuracy-latency trade-off curves, each at a different accelerator area. By modifying the CNN, we roughly move along the concentric accuracy-latency curves. By changing the accelerator hardware, we move across a horizontal line (thus affecting both latency and area).

[93] Figure 8 compares the performance of the co-designed CNN and FPGA with models and accelerators found using other methods such as GoogLeNet27, ResNetll and SqueezeNet12. ChaiDNN was hand-optimized to run both GoogLeNet and ResNet and as shown in Figure 8, the latency of GoogLeNet is very close to the Pareto Front (i.e. the method described above). However, for ResNet it is much farther away from the Pareto Front. Even though it improves on accuracy compared to GoogLeNet, it is three time away from the Pareto Front on latency as shown in Figure 8. This demonstrates the power of co-designing the model and accelerator compared to sequential design of model followed by accelerator.

[94] Figures 9a to 9d plot the accuracy-latency Pareto frontier for single and dual convolution engines at different area constraints. As described above, the configurable parameter ratio_conv_engines decides whether there are single or dual engines, and the ratio of DSPs allocated to each of the dual engines. This affects the speed at which 1x1 and 3x3 convolutions run. This accelerator parameter creates an interesting trade-off with the CNN search space. First, a CNN cell needs to be easily parallelizable to benefit from the paramter ratio_conv_engines being less than 1. Second, based on the ratio of 3x3:1x1 operations in the CNN cell, a different ratio_conv engines will be more efficient. For this parameter, we demonstrate how codesign leads to optimal results and finds the right combination of CNN and accelerator for the best accuracy and efficiency.

[95] Figures 9a to 9d show that dual engines are more efficient with tighter area constraints, while a single general engine is generally better when the area constraint is larger. This demonstrates that dual engines are indeed a useful accelerator feature --this is a non-obvious conclusion given the interaction between CNN model parallelism, the scheduling algorithm for dual engines, and the ratio of DSPs allocated to each type of convolution engine. Arriving at this conclusion would not be possible if we were studying this accelerator feature with a single CNN model, or even a handful of hand-designed models -dual engines may simply be unsuitable for these specific handful of hand-designed models. However, through codesign, we can search for the best model to fit a given accelerator feature among hundreds of thousands of CNN models.

[96] Having established that dual specialized engines can be a useful accelerator compute core, we take a closer look at the actual ratio of DSPs allocated to lx1 and 3x3 convolutions. In a realistic NAS search scenario, we may constrain area for a specific FPGA device, and look for the fastest model that beats a certain accuracy threshold. Figure 10a shows the results of these constraints, when searching through the Pareto-optimal points.

The top four models found for each different ratio_conv_engines value are highlighted. The discovered points demonstrate the indeterpendence between CNN model and accelerator architectures. For example, there are more conv1x1 operations in the CNN cell when the accelerator contains more compute for 1x1 convolutions and similarly for conv3x3.

[097] Figures 10b and 10c show the CNN cells corresponding to rafio_conv_engines equal to 0.33 and 0.67 respectively. As shown, when ratio_conv_engines=0.67, the best model had three 1x1 convolutions and four 3x3s, whereas for ratio_conv engines=0.33 the counts shifted to five 1x1s and two 3x3s.

[098] Figure 10d compares the execution schedule for the CNN in Figure 10c run on either its codesigned accelerator, or a "different" accelerator, i.e. the accelerator that was codesigned for the CNN in Figure 10c. Both designs were subject to the same area constraint. As the figure shows, latency on the codesigned accelerator is much lower (48 ms vs. 72 ms), and utilization of the convolution engines is much higher, whereas on the "different" accelerator it is clear that the lx1 engine is underutilized, while the 3x3 engine becomes the bottleneck.

[099] Figure 11 shows the overall landscape of Pareto-optimal codesigned CNN model accelerator pairs with respect to the parameter ratio_conv_engines. As the plot shows, when more DSPs are allocated for 1x1 convolutions (rafio=0.25), the Pareto-optimal designs have low accuracy. Conversely, when more compute is assigned to 3x3 convolutions (ratio=0.67), we get higher-accuracy points. Indeed, this likely follows from the fact that increased use of 3x3 convolutions leads to higher accuracy. Additionally, a single convolution engine seems to be superior for low latency designs. Furthermore, when ratio=0.5 or 0.33, we find similar points. We can continue to draw useful observations in this way to help guide the manual design of accelerators. However, as described above, the aim is to automate the search using NAS.

[100] A machine-learning task (e.g. image classification) can be represented as a DNN search space, and the hardware accelerator can be expressed through its parameters (forming an FPGA search space). As shown in Figure 2a, a reward based on metrics, e.g. latency, size and accuracy is generated (step 3208) and this is used to update the selection of the CNN and FPGA (step 3210). These steps may be carried out using multiobjective optimization (M00) of latency, accuracy and area, and different search algorithms for navigating the codesign search space as described below.

[101] As described above, there is a fundamental trade-off between the three metrics and thus, there is no trivial solution to the optimization problem. Additional steps must thus be taken in order to be able to define "better" and "worse" codesigns. Ultimately, we want a function which would take the metrics in interest and return a scalar value, interpreted as quality of the related codesign. We will use this function as our reward function R from the Algorithm REINFORCE shown above.

[102] Two standard approaches to the MOO problem are considered. The first one is to combine the three metrics using a weighted sum into one objective function as described in "Muttiobjective Optimization, Interactive and Evolutionary Approaches" by Branke et al published by Springer 2008. The second one is to only consider the set of points which have all but one metric below/above a certain threshold and later optimize for the remaining metric (6 -constraint method). We then also consider hybrid approaches where either fewer metrics are constrained and/or we also consider the constrained metrics when calculating the reward function. Formally, a generic MOO reward function we use in this work can be defined as follows: (8) where m is the vector of metrics we want to optimize for, w is the vector of their weights and th is the vector of thresholds used to constrain the function's domain.

[103] For cases where at least two metrics are summed together we normalize their values to make them more comparable between each other, as different metrics use different units and have values from different ranges. A similar effect could be achieved by adjusting their weights relatively to their absolute values but we found normalized values easier to reason about. That being said, even after normalization it is still not obvious how different metrics contribute to the objective function for a given set of weights.

[104] A small technicality we had to face is that the RL algorithms work by maximizing the reward function, but different metrics require different types of optimization (max for accuracy and min for area and latency). We deal with that by taking negative area and latency as our inputs to the reward function. Whenever we do a weighted sum, we also take care to produce positive values for all the metrics by handling negative values during their normalization.

[105] We explore three different normalization strategies which are described in more detail in "Function-Transformation Methods for multi-objective optimization" by Marlez et al published in Engineering Optimization 37, 6 (2005), 551-570. The first is max normalization which is one of the most common methods and normalizes values with respect to their achievable maximum. For negative values, we consider their absolute value and process them analogically. In that case, our normalization function can be formally defined as (9) [106] Another common normalization method is min-max normalization in which both the minimum and maximum of a metric are considered. This range is then mapped linearly to the [0,1] range. The specific function has the following form: k.rnin Dr m rt?),o (1p) [107] The third normalization method is standard deviation normalization in which values are normalized using their standard deviation. The equation takes the form: Af(x (11) [108] By combining the generic weighted sum equation (equation 8) with the chosen normalization function (one of equations 9 to 11, for example equation 10), the MOO problem can be defined as: R (ar, lat, acc) = w18(-ar) + w2(-/at) + w3(acc) max R (-ar,-fat, acc) (12) scs where ar is area, lat is latency, acc is accuracy), wi w2, w3 are the set of weights for each of area, latency and accuracy and the optimisation is performed over the search space scS such that the evaluator output E(s)=m satisfies given constraints (e.g. latency below a certain value).

[109] If a search point does not meet a specified constraint, a punishment function Rv is used as feedback for the controller to deter it from searching for similar points that fall below our requirements. Since the standard reward function is positive and we want to discourage the controller from selecting invalid points, a simple solution is to make the punishment function negative. We use the same function as the standard reward function R but with two changes: 1) instead of (ar, lat, acc), we take (ar-arth, lat-lattn, acc-accth) and 2) we take its opposite to make Rv negative thus informing the controller that this was a bad selection.

[110] Different weights for the MOO problem may also be considered to explore how their selection affects the outcome of the search. For example, the weights may be set to be equal for each metric, e.g. 1/3, or the weights may be set to prioritise one metric, e.g. by setting w1 to 0.5 and w2 and w3 i to 0.25 to prioritise area when solving the optimization problem. Each weight may be in the range [0,1] with the sum of the weights equal to 1.

[111] There are two approaches for updating the selection of the CNN and FPGA (step S210). In a first approach, both sub-search spaces may be considered together so that the algorithm is implemented directly on both spaces. Such an approach may be termed a combined search. This strategy has the ability to update both the CNN and the accelerator in each step, and is therefore able to make faster changes to adapt to the reward function. However, the combined search space (i.e. SNNXSFpGA) is much larger, which may make it more difficult to find the best points (i.e. best selections). Accordingly, each experiment is run for a maximum number of steps, e.g. 10,000 steps and the metrics are evaluated so that the reward function may be calculated.

[112] When running an actual search, it is important to consider invalid and constrained points which can be selected by the controller(s) as well as the appropriate reaction when such points are identified. This behaviour does not fit within the standard MOO formulation because MOO does not have the notion of exploration; rather it simply provides means of qualifying multi-dimensional points in a comparable way. However, when running a search, the reward function has additional meaning because it is directly used to guide the controller(s) towards desired outcomes. Therefore, simply ignoring invalid and constrained points can potentially lead to the situations when the controller's feedback is related to only one metric, which can later lead to the controller selecting more points which maximise it without considering the other two. Thus, it is preferred to provide a complementary reward function to use with invalid and constrained points whenever we use weights equal to zero for some of the metrics within the standard reward function. Otherwise, we risk the situation when the controller(s) simply does not consider some of the metrics when learning to navigate the space.

[113] As described above, the method co-designs the FPGA and CNN, for example by use of a combined search. As an alternative to a combined search, the search may have explicitly defined specialized phases during which one part (e.g. the FPGA design) is fixed or frozen so that the search focusses on the other part (e.g. the CNN design) or vice versa. Figure 12 illustrates an alternative architecture which may be used to implement the phased searching. As shown, there are two different controllers 1200, 1220 and a single evaluator 1222. A first controller 1200 learns to optimize CNN structure and a second controller 1220 to select the best combination of options for the FPGA design.

[114] When running such a search, the number of steps for each CNN phase may be greater than the number of steps for each FPGA phase, e.g. 1000 compared to 200 steps. The two phases are interleaved and repeated multiple times, until we hit the total number of steps (e.g. 10,000 steps). This phased solution is used to find a globally optimal solution.

This divide-and-conquer technique considers the two search spaces separately which may make it easier to find better locally-optimal points (per search space). However, mutual impact between the phases is limited, which may make it more difficult to adapt the CNN and accelerator to each other optimally, e.g. to perform a particular task.

[115] Figures 13a to 13c illustrate the top search results compared to the top 100 Pareto-optimal points. Each of the Figures shows the results of the combined and phased searches described above. As a baseline, these proposed searches are compared to a separate search strategy in which the CNN search space is first searched for a CNN and then the accelerator design space is searched, e.g. the sequential search method of the prior art. There are two separate phases and not multiple interleaved phases as described above.

The search for the CNN takes place in 8,333 steps and the FGPA in 1,334 steps. Each of the top search results shown in Figures 13a to 13c maximizes the reward function for one of three experimental variations. Each experiment is repeated ten times and thus there are a maximum of ten points for each strategy. A good search algorithm would be expected to produce results in the vicinity of the top Pareto optimal points.

[116] Figure 13a shows the results for the "unconstrained" experiment in which there are no constraints imposed in the reward function of equation 12 above. The weights are arbitrarily chosen as w(area, lat, acc) = (0.1, 0.8, 0.1). As shown in Figure 13a, this experiment may be useful to simply search for many good points to understand the co-design space. Figure 13b shows the results for the experiment in which a single constraint is imposed, namely latency is less than 100 ms. The weights are chosen as w(area, lat, acc) = (0.1, 0, 0.9). This experiment mimics the scenario in which an end-user may know the task and real-time requirements but is not sure which FPGA device to choose and the accuracy attainable at each device size may aid such a decision. Figure 13c shows the results for the experiment in which two constraints are imposed, namely accuracy is greater than 0.92 and the area is less than 100 mm2. The weights are chosen as w(area, lat, acc) = (0, 1, 0) to optimize latency. By imposing two constraints, the experiment as a single objective. Such an experiment may be useful when there is a maximum FPGA area budget and a minimum tolerated accuracy for the application.

[117] Figures 14a to 14c show the reward values for each of the separate, combined and phased search strategies in the three experimental scenarios. Figure 14a shows the results for the "unconstrained" experiment in which there are no constraints, Figure 14b shows the results for the experiment in which a single constraint is imposed, and Figure 14c shows the results for the experiment in which two constraints are imposed. Only the reward function R and not the punishment function IR, is shown on the plot.

[118] Figures 13a to 14c show that the separate search cannot consistently find good points within the constraints. This is because it searches for the most accurate CNN model without any context of the HW target platform. Figure 13b shows two "lucky" separate points that are superior to other searches and Figure 14b shows the higher reward. However, the plots do not show that the eight remaining points all have latencies that are much higher than the constraint. This is true for all of Figures 13a to 13c in which only a few separate points fit within the displayed axes and the rest of the points are generally high accuracy but very low efficiency. This shows the randomness of CNNs that are designed without HW context. They may or may not fall within efficiency constraints based on chance, further motivating the need for a joint co-design methodology.

[119] Figure 13a to 14c show that the phased and combined search strategies improve upon the separate search because they take the HW accelerator into account and more importantly, they consider all variants of the hardware accelerator and all variants of the CNN simultaneously. Figures 14a to 14c shows that the combined search strategy is generally better in the unconstrained experiment shown in Figure 14a whereas the phased search strategy achieves a higher reward for both the constrained experiments shown in Figures 14b and 14c. This is also shown in Figure 13c in which the phased search gets close to the ideal points. However, Figure 13c also shows a shortcoming of the phased search, namely it is more prone to missing the specified constraints, perhaps because there are only limited opportunities to switch from the CNN search phase to the FPGA search phase within the 10,000 steps limit of the experiment. Increasing the number of search steps should mean that the phased search is able to find points within the constraints but increased the run-time of the experiment.

[120] More generally, the phased search is slower to converge compared to the combined search. This is highlighted in Figures 14a to 14c which show that the phased search goes through a few exploration phases before finding its best result. Thus, both the phased and combined searches appear to have merits relative to one another. The combined search appears to work better when the search is unconstrained and is generally faster to converge to a solution. The phased search finds better points when there are constraints but typically requires more search steps to do so.

[121] As explained above with reference to Figure 3, the CNN search space used in the analysis described above is termed NASBench. In this search space, the CNNs have been trained to perform ImageNet classification. To validate the results shown above, we use the co-design method to discover a CNN model-accelerator pair which optimises a different task, e.g. Cifar-100 image classification. It is noted that Cifar-100 image classification is almost as difficult as ImageNet classification which is reflected by its Top-1 accuracy numbers being typically similar to ImageNet19. However, Cifar-100 has a much smaller training set (60K vs 1M) and thus training a CNN to perform Cifar-100 image classification is approximately two orders of magnitude faster than ImageNet classification. This makes it more feasible for the infrastructure available for the experiments described in this application.

[122] All the discovered CNNs must be trained from scratch to perform such a task. Nevertheless, the same search space ScNN which is described above may still be used.

Training such as that described in "NAS-Bench-101: Towards Reproducible Neural Architecture Search" by Ying et al published in Feb 2019 in arXiv e-prints. There are 108 epochs of training using standard data augmentation (padding, random crop and flipping), an initial learning rate of 0.1 with cosine decaying and weights decay of 10-4. Training each new CNN takes approximately 1-GPU hour, so to be able to train many models, we parallelize co-design NAS over six machines, each with eight Nvidia-1080 GPUs each allowing 48 models to be trained in parallel.

[123] The co-design search is run with two constraints combined into one. Specifically, latency and area are combined into a metric termed performance per area (pert/area) and this metric is constrained to a threshold value. Accuracy is then maximised under this constraint. The performance per area threshold is gradually increased according to (2, 8, 16, 30, 40) and the search is run for approximately 2300 valid points in total, starting with 300 points at the first threshold value and increasing to 1000 points for the last threshold value. This appeared to make it easier for the controller to learn the structure of high-accuracy CNNs. The combined search strategy described above is used because it is faster to converge on a solution.

[124] Figure 15 plots the top-1 accuracy and pert/area of various points searched using the combined search. The top 10 points among the model-accelerator points visited at each threshold value are plotted. The plot also shows the ResNet and GoogLeNet cells within the CNN skeleton shown in Figure 3 and these are paired with their most optimal accelerator in terms of pert/area. This is a difficult baseline to beat as we are comparing against two well-known high-accuracy CNN cells when implemented on their best possible corresponding accelerator in our FPGA search space. However, as the plot shows, we find many points that exceed both the accuracy and efficiency of both the ResNet and GoogLeNet baselines.

[125] The best two points are labelled Cod-1 and Cod-2 respectively. Their performance is shown in the table below: CNN Accuracy (%) Pert/Area (img/s/cm2) Latency (ms) Area (mm2) ResNet Cell 72.9 12.8 42.0 186 Cod-1 74.2 (+1.8%) 18.1 (+41%) 41.8 (-0.5%) 132 (-29%) GoogLeNet Cell 71.5 39.3 19.3 132 (-0.8%) Cod-2 72.0 (+0.7%) 40.6 (+3.3%) 18.5 (-4.2%) 133 [126] Cod-1 improves upon ResNet by 1.8% accuracy while simultaneously improving pert/area by 41%. These are considerable gains on both accuracy and efficiency. Cod-2 shows more modest improvements over GoogLeNet but still beats it on both efficiency and accuracy while running 4.2% faster in terms of absolute latency.

[127] Figures 16a and 16b show the model structure of Cod-1 and Cod-2 respectively and the table below lists the HW parameters.

HW parameter Cod-1 Cod-2 Filter_par, pixel _par (16, 64) (16, 64) Buffer depths (4K, 2K, 4K) (8K, 2K, 2K) Mem_interface_width 256 512 Pool_engine False False Ratio_conv_engines 0.33 0.25 [128] Cod-1 manages to beat ResNet accuracy but use an important ResNet feature: skip connections and element-wise addition as shown by the rightmost branch of the cell in Figure 16a. On the hardware side, both Cod-1 and Cod-2 use the largest convolution engine and avoid the use of a dedicated pooling engine. However, the other HW parameters are tailored for each CNN. For example, both the input buffer size and the memory interface width are smaller for Cod-1 than Cod-2. This may be due to the fact that the Cod-1 CNN uses a larger number of smaller convolutions compared to Cod-2.

[129] It is possible that there are better points than Cod-1 and Cod-2 because the search space has approximately 3.7 billion points int total. Only approximately 1000 points were explored before finding Cod-1 and approximately 2000 points before finding Cod-2. This highlights the speed of convergence at the controller when using the combined search. It is also effective at finding good designs, especially when properly tuned with representative reward functions and search strategies as described above.

[130] Figure 17 shows an alternative system which may be used to search the CNN search space as a stand-alone improvement to the arrangement shown in Figure la or incorporated in the arrangement of Figure 2a. In this arrangement, the controller 1300 proposes a model architecture for the CNN which is fed to a cut-off module 1312. The cut-off module uses hardware metrics, such as thresholds on latency and memory footprint, as a cut-off to provide quick feedback to the controller 1300. If the proposed model does not meet the hardware criteria, the controller receives feedback to discourage it from proposing similarly underperforming models. This will allow the controller 1300 to focus on proposing models that meet the hardware constraints. If the proposed model does meet the hardware criteria, the model is sent to the evaluator 1322 for a more detailed evaluation, e.g. to generate a reward function, as described above.

[131] The cut-off module 1312 may be dynamic so that the hardware metrics may change as the search progresses to improve the models which are located by the search. For example, if the initial latency threshold is 100ms but many models have a latency equal to 50 ms, the latency threshold may be updated on the fly (e.g. in real-time) to e.g. 60 ms. In this way, more models will be excluded from the search and the overall searching process will be expedited.

[132] As schematically illustrated, the cut-off module may simultaneously use a plurality of hardware devices, H/W 1, H/VV 2, ... H/VV N, to search for models that fit all devices.

[133] Figure 18 is a more sophisticated version of the system of Figure 17 in which the cutoff module 1412 comprises a hardware runtime estimator 1430 and a validator 1432. The hardware runtime estimator 1430 is used to predict the hardware performance, e.g. latency, of a proposed model by the controller on a target hardware platform(s). This is not a trivial task because the relationship between the total number of FLOPS needed for running a proposed model architecture or its parameter size has a non-linear relationship with latency on a specific hardware platform due to variations in on/off chip memory utilization, memory footprint, degree of parallelism, area usage, clock speed or any other relevant task or hardware metric.

[134] The hardware runtime estimator 1430 comprises a statistical model module 1440, a discriminator 1442, a theoretical hardware model module 1444 and a deployment module 1446. The statistical model module 1440 is used to predict (i.e. estimate) the hardware metrics and send these to the discriminator 1442. Initially, the statistical model is based on a theoretical model which is computed in the theoretical hardware model module 1444 to give a baseline prediction which is sent to the statistical model module 1440. The models may suffer from poor prediction quality, particularly the initial models. Accordingly, the discriminator 1442 monitors the confidence of the results from the statistical model.

[135] When the confidence in the estimated hardware metrics is low (e.g. below a confidence threshold), the proposed architecture may be sent to a deployment module 1446 for deployment on the target hardware, e.g. one of hardware devices, H/W 1, H/VV 2, ... H/VV N. The latency (or other hardware metric) is measured and this measurement is sent to the statistical model module 1440 to update the statistical model. This measurement is also sent to the discriminator 1442 to update the monitoring process within the discriminator. The actual measurement rather than the estimated value is then sent with the model to the validator 1432. When the confidence in the estimated hardware metrics is good (e.g. above a threshold), the model is sent straight to the validator 1432.

[136] Once the validator 1432 has received the model with its estimated hardware value(s) or measured hardware value(s), the validator 1432 checks if the proposed architecture meets all the hardware metrics. In other words, the validator 1432 may compare the hardware value(s) to the defined thresholds to determine if the hardware constraints are met.

If the proposed model does meet the hardware criteria, the model is sent to the evaluator 1422 for a more detailed evaluation, e.g. to generate a reward function, as described above. Accordingly, it is clear that in this arrangement, the controller 1400 sends all proposed model architectures for the CNN to the hardware runtime estimator 1430. Specifically, as shown in the Figure, the proposed model architectures are sent to the statistical model module 1440 and the discriminator 1442.

[137] The method described in Figure 18 may be used to model the steps of implementation and evaluation in Figure 2a (step 3204 and step 3206). This may result in a quicker run time because it is not necessary to pool hardware for every iteration. It is also noted that the overall search procedure may be configured by providing an overall GPU time budget. Thus, at the end of the computational budget, we get the best model meeting all the requirements.

[138] Figure 19 illustrates a method for continuously updating the statistical model used in the statistical model module. The method may be carried out in the run-time estimator using one or more of the modules therein. As shown in a first step, the proposed model of the CNN is received (step 31500), e.g. from the controller as described above. Before running the statistical model, there is a determination as to how many proposed models have previously been received (step S1502). If the process has run fewer than a threshold number, e.g. N, of iterations of the statistical model, the statistical model is applied to the received model to predict the hardware parameters such as latency which occur when the selected model is run on the FPGA (step S1504). The process then loops back to the start to repeat for the next received model.

[139] If there have already been more than N-iterations of the statistical model, the proposed model is run on actual hardware, e.g. by using the deployment module and one of the plurality of hardware modules shown in Figure 18, to provide real measurements of the hardware parameters (step 31506). The statistical model is also applied to predict the hardware parameters (step S1508). These steps are shown as sequential but it will be appreciated that they may be performed simultaneously or in the other order. If there is a discrepancy between the predicted and measured parameters, the measured parameters may be used to update the statistical model (step S1510). The process then loops back to the start to repeat for the next received model.

[140] Such a method allows scaling and improves run times when compared to a method which always uses actual hardware to determine performance. For example, multiple threads or processes may use the statistical model to search for new CNN models, whilst a single actual hardware device is used to update the statistical model infrequently. The statistical model is likely to be more accurate and up-to-date by using the regular measurements. A statistical model only performs as well as the training data from which it was created. As the searches for new CNN models are carried out, they may move into different search spaces including data on which the original model was not trained. Therefore, updating the statistical model with measurements helps to ensure that the statistical model continues to predict representative hardware metrics which in turn are used to guide the search. Any error between the predicted and measured hardware metrics may also be used to tune the number of iterations between implementing the CNN model on the hardware. For example, when the error increases, the number of iterations between polling the hardware may be reduced and vice versa.

[141] Figure 20 shows how a similar method to that shown in Figure 19 may be used by the discriminator of Figure 18 to help the discriminator learn how to distinguish between trustworthy predictions and invalid predictions. The proposed technique may improve the awareness of the hardware within the selection process by generating a much better statistical model without impacting significantly on the run time of the selection process.

[142] As shown in steps S1600 and S1602, the discriminator receives the proposed model, e.g. from the controller, and the predicted hardware metrics, e.g. from the statistical model module. These steps are shown in a particular order but it is appreciated that the information may be received simultaneously or in a different order. The discriminator determines whether or not the predicted hardware metrics may be trusted (step S1604) and in this method, when the discriminator determines that the predicted metrics can be trusted, there is an optional additional step of the discriminator determining whether or not the predicted metrics need to be verified (step 31606). The verification decision may be made according to different policies, e.g. after a fixed number of iterations, at random intervals or by assessing outputs of the system. If no verification is required, the predicted HW parameters are output (step S1608), e.g. to the validator to determine whether or not pass the model to the evaluator as described above.

[143] When the discriminator determines that the predicted metrics cannot be trusted, the proposed model is run on actual hardware to obtain measurements of the hardware metrics (e.g. latency) which are of interest (step S1610). As described above in Figure 19, when there is a discrepancy between the predicted and measured parameters, the measured parameters may be used to update the statistical model (step S1612). The measured HW parameters are output (step 31614), e.g. to the validator to determine whether or not pass the model to the evaluator as described above. Similarly, when the discriminator determines that the predicted metrics need to be verified, the steps of running the proposed model on hardware (step S1610), updating the statistical model as needed (step S1612) and outputting the measured parameters (step S1614). In all cases, once the measured or predicted parameters are output, the process then loops back to the start to repeat for the next received model.

[144] In the description above, the terms hardware metrics and hardware parameters are used interchangeably. It may be difficult to estimate or measure certain metrics, e.g. latency, and thus proxy metrics such as FLOPs and model size may be used as estimates for the desired metrics. The statistical models described above may be trained using hardware measurements which have been previously captured for particular types of CNN. The statistical models may be built using theoretical models which approximate hardware metrics (such as latency) from model properties (such as number of parameters, FLOPs, connectivity between layers, types of operations etc.). The theoretical models may have distinct equations for each layer type (e.g. convolution, maxpool, relu, etc.) with varying accuracy/fidelity for each layer. Theoretical models may be used instead of statistical models.

[145] In the description above, reference has been made to co-designing or designing a CNN and an FPGA controller. However, it will be appreciated that the method is not just applicable to CNNs but is readily extendable to any neural network using the techniques described above. The method is also more broadly applicable to any parametrizable algorithm which is beneficially implemented in hardware, e.g. compression algorithms and cryptographic algorithms. It is will be appreciated that for the method to work, it is necessary to be able to have a well-defined algorithm search space, e.g. the parametrizable algorithm must be definable by virtue of at least one configurable parameter. For example, in the method described above, the search space is defined by the use of the model described in relation to Figure 3. However, it will be appreciated that this model was merely illustrative and other models of parametrizable algorithms may be used. by setting the parameters of the neural network which are to be modelled. Similarly, it will be appreciated that the method may be applicable to other types of hardware and not just FPGA controllers.

[146] The controller(s), evaluator and other modules may include any suitable processing unit capable of accepting data as input, processing the input data in accordance with stored computer-executable instructions, and generating output data. The controller(s), evaluator and other modules may include any type of suitable processing unit including, but not limited to, a central processing unit, a microprocessor, a Reduced Instruction Set Computer (RISC) microprocessor, a Complex Instruction Set Computer (CISC) microprocessor, a microcontroller, an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), a System-on-a-Chip (SoC), a digital signal processor (DSP), and so forth. In addition, any of the functionality described as being supported by the controller(s), evaluator and other modules may be implemented, at least partially, in hardware and/or firmware across any number of devices.

[147] Certain aspects of the disclosure are described above with reference to block and flow diagrams of systems, methods, apparatuses, and/or computer program products according to example embodiments. It will be understood that one or more blocks of the block diagrams and flow diagrams, and combinations of blocks in the block diagrams and the flow diagrams, respectively, may be implemented by execution of computer-executable program instructions. Likewise, some blocks of the block diagrams and flow diagrams may not necessarily need to be performed in the order presented, or may not necessarily need to be performed at all, according to some embodiments. Further, additional components and/or operations beyond those depicted in blocks of the block and/or flow diagrams may be present in certain embodiments.

[148] Accordingly, blocks of the block diagrams and flow diagrams support combinations of means for performing the specified functions, combinations of elements or steps for performing the specified functions, and program instruction means for performing the specified functions. It will also be understood that each block of the block diagrams and flow diagrams, and combinations of blocks in the block diagrams and flow diagrams, may be implemented by special-purpose, hardware-based computer systems that perform the specified functions, elements or steps, or combinations of special-purpose hardware and computer instructions.

[149] Although specific embodiments of the disclosure have been described, one of ordinary skill in the art will recognize that numerous other modifications and alternative embodiments are within the scope of the disclosure. For example, any of the functionality and/or processing capabilities described wtth respect to a particular system, system component, device, or device component may be performed by any other system, device, or component. Further, while various illustrative implementations and architectures have been described in accordance with embodiments of the disclosure, one of ordinary skill in the art will appreciate that numerous other modifications to the illustrative implementations and architectures described herein are also within the scope of this disclosure. Those skilled in the art will recognise that present techniques have a broad range of applications, and that the embodiments may take a wide range of modifications without departing from any inventive concept as defined in the appended claims.

REFERENCES

[1] Mohamed S. Abdelfattah and Vaughn Betz. 2012. Design Tradeoffs for Hard and Soft FPGA-based Networks-on-Chip. In International Conference on Field-Programmable Technology (FPT). 95-103.

[2] Mohamed S. Abdelfattah, David Han, Andrew Bitar, et al. 2018. DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration. In 28th International Conference on Field Programmable Logic and Applications (FPL). 411-4117.

[3] Juergen Branke, Kalyan Deb, Kaisa Miettinen, and Slowinski Roman. 2008.

Multiobjective Optimization, Interactive and Evolutionary Approaches. Springer, Heidelberg, Germany.

[4] Han Cai, Ligeng Zhu, and Song Han. 2018. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. arXiv e-prints (Dec 2018). arXiv:1812.00332 [5] Zachariah Carmichael, Hamed F. Langroudi, Char Khazanov, et al. 2019. Performance-Efficiency Trade-off of Low-Precision Numerical Formats in Deep Neural Networks. arXiv e-prints (Mar 2019). arXiv:1903.10584 [6] Y. Chen, T. Yang, J. Emer, and V. Sze. 2019. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9,2 (June 2019), 292-308.

[7] Xiangxiang Chu, Bo Zhang, Hailong Ma, et al. 2019. Fast, Accurate and Light-weight 10 Super-Resolution with Neural Architecture Search. arXiv e-prints (Jan 2019). arXiv:1901.07261 [8] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2018. Neural Architecture Search: A Survey. arXiv e-prints (Aug 2018). arXiv:1808.05377 [9] Jeremy Fowers, Kahn Ovtcharov, Michael Papamichael, et al. 2018. A Configurable Cloud-scale DNN Processor for Real-time Al. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA '18). ACM, 1-14.

[10] Song Han, Junlong Kang, Huizi Mao, et al. 2017. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '17). ACM,75-84.

[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv e-prints (Dec 2015). arXiv:1512.03385 [12] Forrest N. landola, Song Han, Matthew W. Moskewicz, et al. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 8,1t;0.5MB model size. arXiv e-prints (Feb 2016). arXiv:1602.07360 [13] Xilinx Inc. 2019. Accelerating DNNs with Xilinx Alveo Accelerator Cards. https:// www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf [14] Xilinx Inc. 2019. Chaidnn v2 -HLS based Deep Neural Network Accelerator Library for Xilinx Ultrascale+ MPSoCs. https://github.com/Xilinx/CHaiDNN.

[15] Abhishek K. Jain, Xiangwei Li, Pranjul Singhai, Douglas L. Maskell, and Suhaib A. 30 Fahmy. 2016. DeCO: A DSP Block Based FPGA Accelerator Overlay with Low Overhead Interconnect. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 1-8. https://doi.org/10.1109/FCCM.2016.10 [16] Weiwen Jiang, Lei Yang, Edwin Sha, et al. 2019. Hardware/Software Co-Exploration of Neural Architectures. arXiv e-prints (Jul 2019). arXiv:1907.04650 [17] Weiwen Jiang, Xinyi Zhang, Edwin H. M. Sha, et al. 2019. Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search. arXiv e-prints (Jan 2019). arXiv:1901.11211 [18] Norman P. Jouppi, Cliff Young, Nishant Patil, et al. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, 1-12.

[19] Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images, Ph.D. Dissertation. University of Toronto.

[20] Yizhi Liu, Yao Wang, Ruofei Yu, et al. 2019. Optimizing CNN Model Inference on CPUs.

In 2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, 1025-1040.

[21] R. Timothy Maher and Jasbir S. Arora. 2005. Function-transformation methods for multi-objective optimization. Engineering Optimization 37, 6 (2005), 551-570.

[22] Aditya Rawal and Risto Miikkulainen. 2018. From Nodes to Networks: Evolving Recurrent Neural Networks. arXiv e-prints (Mar 2018). arXiv:1803.04439 [23] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2018. Regularized Evolution for Image Classifier Architecture Search. arXiv e-prints (Feb 2018).

arXiv:1802.01548 [24] Ahmad Shawahna, Sadiq M. Sait, and Aiman El-Maleh. 2019. FPGA-Based Accel-erators of Deep Learning Networks for Learning and Classification: A Review. IEEE Access 7(2019), 7823-7859.

[25] David R. So, Chen Liang, and Quoc V. Le. 2019. The Evolved Transformer. arXiv e-prints (Jan 2019). arXiv:1901.11117 [26] Naveen Suda, Vikas Chandra, Ganesh Dasika, et al. 2016. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '16). ACM, 16-25.

[27] Christian Szegedy, Wei Liu, Yangqing Jia, et al. 2014. Going Deeper with Convolutions. arXiv e-prints (Sep 2014). arXiv:1409.4842 [28] Mingxing Tan, Bo Chen, Ruoming Pang, et al. 2018. MnasNet: Platform-Aware Neural Architecture Search for Mobile. arXiv e-prints (Jul 2018). arXiv:1807.11626 [29] Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning. PMLR, 6105-6114.

[30] Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, et al. 2017. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '17). ACM, 65-74.

[31] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, et al. 2018. FBNet: Hardware-Aware Efficient ConyNet Design via Differentiable Neural Architecture Search. arXiv e-prints (Dec 2018).

a rXiv: 1812.03443 [32] Chris Ying, Aaron Klein, Esteban Real, et al. 2019. NAS-Bench-101: Towards Reproducible Neural Architecture Search. arXiv e-prints (Feb 2019). arXiv:1902.09635 [33] Barret Zoph and Quoc V. Le. 2016. Neural Architecture Search with Reinforcement Learning. arXiv e-prints (Nov 2016). arXiv:1611.01578

Claims

CLAIMS1. A method of designing accelerator hardware for implementing a parametrizable algorithm, the method comprising selecting a first paired accelerator hardware and parametrizable algorithm; evaluating accuracy and efficiency metrics for implementing the parametrizable algorithm on the accelerator hardware; obtaining a reward value for the first paired accelerator hardware and parametrizable algorithm using the evaluated metrics; using the obtained reward value to select a second paired accelerator hardware and parametrizable algorithm, and repeating the evaluating, obtaining and using steps until a final paired accelerator hardware and parametrizable algorithm are selected.
2. The method of claim 1, wherein the efficiency metrics include at least one of latency, area and power of the accelerator hardware when implementing the parametrizable algorithm on the accelerator hardware
3. The method of any preceding claim, wherein obtaining a reward value comprises obtained a weighted sum of the evaluated metrics.
4. The method of claim 3, comprising normalizing the evaluated metrics before obtaining the weighted sum.
5. The method of claim 3 or claim 4, wherein obtaining a reward value comprises a reward function which uses threshold values of the evaluated metrics to constrain the reward function.
6. The method of any one of claims 3 to 5, when dependent on claim 2, wherein obtaining a reward value comprises obtained a weighted sum of area, latency and accuracy.
7. The method of any preceding claim, wherein selecting the paired accelerator hardware and parametrizable algorithm comprises selecting an FPGA from an FPGA sub-search space comprising a plurality of FPGAs having at least one configurable parameter.
8. The method of claim 7, wherein the at least one configurable parameter is selected from: one or more parallelisation parameters, one or more buffer depths, a memory interface width parameter, a pooling engine usage parameter and a convolution engine ratio parameter.
9. The method of claim 8, wherein selecting the FPGA comprises deciding a value for each configurable parameter and obtaining a probability of selecting the FPGA having the decided value for each configurable parameter.
10. The method of any preceding claim, wherein the parametrizable algorithm is a neural network.
11. The method of claim 10, wherein the neural network is a convolution neural network, CNN.
12. The method of claim 11, wherein selecting the paired accelerator hardware and parametrizable algorithm comprising selecting a CNN from a CNN sub-search space comprising a plurality of CNNs having at least one configurable parameter.
13 The method of claim 12, wherein the at least one configurable parameter is selected from: the operations and the connections within the CNN.
14. The method of claim 12 or claim 13, wherein selecting the CNN comprises deciding a value for each configurable parameter and obtaining a probability of selecting the CNN having the decided value for each configurable parameter.
15. The method of any one of the preceding claims, wherein selecting the paired accelerator hardware and parametrizable algorithm comprises searching a search space which is a Cartesian product of a parametrizable algorithm sub-search space comprising a plurality of parametrizable algorithms and an accelerator hardware sub-search space comprising a plurality of accelerator hardwares.
16. The method of any one of claims 1 to 14, wherein selecting the paired accelerator hardware and parametrizable algorithm comprises an algorithm search phase in which a parametrizable algorithm sub-search space comprising a plurality of parametrizable algorithms is searched to select a parametrizable algorithm and an accelerator hardware search phase in which an accelerator hardware sub-search space comprising a plurality of accelerator hardware is searched to select an accelerator hardware and interleaving the algorithm search phase and the accelerator hardware sub-search phases.
17. The method of claim 16, further comprising assessing, after the algorithm search phase, whether the selected parametrizable algorithm meets hardware criteria before implementing the accelerator hardware search phase.
18. The method of claim 17, wherein assessing whether the selected parametrizable algorithm meets hardware criteria may comprise using a model to predict hardware performance of the selected parametrizable algorithm on a target accelerator hardware and comparing the predicted hardware performance to the hardware criteria.
19. The method of claim 18, further comprising comprise determining a level of confidence in the predicted hardware performance.
20. The method of claim 17 or claim 18, further comprising updating the model by comparing the predicted hardware performance with measured hardware performance.
21. A non-transitory data carrier carrying processor control code which when implemented on a system causes the system to carry out the method of any preceding claim.
22. A system for designing an accelerator hardware for implementing a parametrizable algorithm, the system comprising: a controller which is configured to select a first paired accelerator hardware and parametrizable algorithm and an evaluator which is configured to evaluate accuracy and efficiency metrics for implementing the parametrizable algorithm on the accelerator hardware, obtain a reward value for the first paired accelerator hardware and parametrizable algorithm using the evaluated metrics; send the obtained reward value to the controller to select a second paired accelerator hardware and parametrizable algorithm, and repeat the evaluating, obtaining and sending steps until a final paired accelerator hardware and parametrizable algorithm is selected by the controller.
23. A system designing a parametrizable algorithm, the system comprising a controller which is configured to select a parametrizable algorithm, a cut-off module which is configured to determine whether the selected parametrizable algorithm meets hardware criteria and an evaluator which is configured to obtain a reward value for the selected parametrizable algorithm when the hardware criteria are met.
24. The system of claim 23, further comprising a model module which is configured to use a model to predict hardware performance of the selected parametrizable algorithm and a discriminator module which is configured to determine a confidence level for the predicted hardware performance.
25. The system of claim 23 or claim 24, further comprising a deployment module which is configured to deploy the selected parametrizable algorithm to a hardware module to measure hardware performance.