CN111199509B

CN111199509B - Method and apparatus for neural networks

Info

Publication number: CN111199509B
Application number: CN201910490067.0A
Authority: CN
Inventors: S·王; 佟维; 曾树青; R·L·米利特
Original assignee: GM Global Technology Operations LLC
Current assignee: GM Global Technology Operations LLC
Priority date: 2018-11-16
Filing date: 2019-06-05
Publication date: 2024-04-16
Anticipated expiration: 2039-06-05
Also published as: US20200160125A1; DE102019115875A1; US11354888B2; CN111199509A

Abstract

The invention provides a method and apparatus for neural networks. The present invention provides a signal processing system including a Central Processing Unit (CPU) in communication with an accelerator and an instruction scheduler in communication with the accelerator. The first memory device includes a first set of instructions configured to operate the accelerator and a second set of instructions configured to operate the CPU, and the second memory device is configured to receive a data file. The accelerator includes a plurality of Processing Engines (PEs) and an instruction scheduler, the instruction set including a plurality of operators, and the instruction scheduler being configured to implement the operators in the accelerator with the PEs. The CPU analyzes the data file to extract features therefrom using the operators implemented in the accelerator.

Description

Method and apparatus for neural networks

Background

An artificial neural network is a computational framework that employs multiple machine learning algorithms together to process complex data files (e.g., visual images, audio files, etc.). Processor configurations for implementing artificial neural networks may have suboptimal performance. The performance of a computer or processor may be assessed in terms of instruction execution rate or throughput, which may be expressed in Millions of Instructions Per Second (MIPS), clock speed, bus size, resource utilization, memory size, latency, bandwidth, throughput, etc.

The artificial neural network includes an input layer, one or more hidden layers, and an output layer. Some embodiments of hardware that may be configured to execute a neural network may define a fixed configuration for each layer. This arrangement may be suboptimal because layers may have different parallelism, which is best served by different Performance Engines (PEs). Over-provisioning PEs can result in increased cost, reduced reliability, and excessive power consumption.

The hardware configuration for implementing the neural network may include a Central Processing Unit (CPU) that operates with an accelerator to process image files or other data captured on a memory device. Accelerators may utilize specialized hardware in the form of general purpose computations on graphics processing units (GPGPUs), multi-core processors, field Programmable Gate Arrays (FPGAs), and Application Specific Integrated Circuits (ASICs).

One embodiment of a neural network is a Convolutional Neural Network (CNN), which has proven to be an effective tool for performing image recognition, detection, and retrieval. CNNs may be scaled up and configured to support large marker data sets required by the learning process. Under these conditions, CNNs have been found to be successful in learning complex and robust image features. CNN is a feed-forward artificial neural network in which individual neurons are tiled in such a way that they respond to overlapping areas in the field of view.

Disclosure of Invention

A signal processing system includes a Central Processing Unit (CPU) in communication with an accelerator and an instruction scheduler in communication with the accelerator. The first memory device includes a first instruction set configured to operate the accelerator and a second instruction set configured to operate the CPU, and the second memory device is configured to receive the data file. The accelerator includes a plurality of Processing Engines (PEs) and an instruction scheduler, the instruction set including a plurality of operators, and the instruction scheduler being configured to implement the operators in the accelerator with the PEs. The CPU uses operators implemented in the accelerator to analyze the data file to extract features therefrom.

An aspect of the disclosure includes the data file being a bitmap image file of a field of view captured by the camera, wherein the CPU employs an accelerator to extract features from the bitmap image file.

Another aspect of the disclosure includes the signal processing system being in communication with a control system configured to perform a control action, wherein the control system is configured to perform the control action based on a feature extracted from the bitmap image file.

Another aspect of the disclosure includes that the operators include combinable commutative linear (ReLU) operators implemented by one of the PEs.

Another aspect of the disclosure includes operators including combinable pooling operators implemented by one of the PEs.

Another aspect of the disclosure includes that the combinable pooling operator is a max pooling operator.

Another aspect of the disclosure includes that the combinable pooling operator is an average pooling operator.

Another aspect of the disclosure includes an instruction scheduler configured to implement operators in an accelerator with PEs, including being configured to implement a single one of the PEs to process a single data kernel applied to a single input feature to implement a single output feature.

Another aspect of the disclosure includes an instruction scheduler configured to implement multiple replicated ones of the PEs arranged in parallel to process data kernel tiles.

Another aspect of the disclosure includes an instruction scheduler configured to implement multiple copy PEs in multiple parallel arranged PEs to process multiple tiles of a data core.

Another aspect of the present disclosure includes that the operator implemented in the accelerator is a combinable commutative linear (ReLU) operator.

Another aspect of the present disclosure includes a first merge arrangement that includes a processing unit that processes a single data tile to validate a convolution operation in tandem with a ReLU operator.

Another aspect of the disclosure includes that the operator implemented in the accelerator is a second merge arrangement that includes a processing unit that processes the plurality of data tiles to validate a convolution operation disposed in series with a processing engine configured to validate the pooling operation.

Another aspect of the disclosure includes a third merge arrangement including a first processing unit configured to process a plurality of data tiles to validate a convolution operation arranged in series with a second processing unit configured to process the plurality of data tiles to validate the convolution operation, the third merge arrangement including an intermediate data buffer.

Another aspect of the present disclosure includes a vehicle control system for a vehicle, the vehicle control system comprising: a camera arranged to capture a field of view adjacent thereto; a control system configured to control operation of an actuator of the vehicle; and a signal processing system including a controller, an accelerator, and a memory device, wherein the accelerator includes a plurality of Processing Engines (PEs). The signal processing system is in communication with the camera and the control system. The controller includes an instruction set. The controller is operable to capture an image of the field of view via the camera, wherein the image is comprised of a bitmap image file. The bitmap image file is transferred to the memory device and the accelerator implements a plurality of operators, wherein the plurality of operators are derived from the instruction set. The controller executes a plurality of operators to extract features from the bitmap image file, and the control system controls operation of an actuator of the vehicle based on the features extracted from the bitmap image file.

The above features and advantages and other features and advantages of the present teachings will be readily apparent from the following detailed description of the best modes and other embodiments for carrying out the present teachings when taken in connection with the accompanying drawings.

Drawings

One or more embodiments will now be described, by way of example, with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates a signal processing system including a Central Processing Unit (CPU) configured to execute an artificial neural network and an accelerator, wherein the signal processing system is an element of a vehicle control system including a sensing system and a control system, according to the present disclosure;

FIG. 2 schematically illustrates a portion of an instruction set that may be implemented in an accelerator using a single Processing Engine (PE) in accordance with the present disclosure;

FIG. 3-1 schematically illustrates data input and associated filters in the form of three bitmap image files, each of which is structured as a plurality of pixels arranged in an x-y matrix, according to the present disclosure;

FIG. 3-2 schematically illustrates a process OF feature detection via a processing unit employing a single processing engine to process a single data kernel (PE-K unit) employing a K x K size filter applied to a single Input Feature (IF) for each pixel OF the bitmap image file OF FIG. 3-1 to implement a single Output Feature (OF), in accordance with the present disclosure;

3-3 schematically illustrate a non-limiting embodiment of a process for feature detection via a processing unit that processes a data tile containing an mxm data kernel (PE-T unit), where the PE-T unit is made up of m copies of a number of PE-K units, according to the present disclosure;

3-4 schematically illustrate a non-limiting embodiment of a process for feature detection via a processing unit that processes a plurality of data tiles, each tile containing m number of kxk data cores (PE-W units), wherein the PE-W units are made up of n copies of a number of PE-T units, according to the present disclosure;

fig. 4-1 schematically illustrates a first configuration according to the present disclosure, wherein the input memory includes a first kernel having a 3 x 3 pixel size;

fig. 4-2 schematically illustrates a second configuration according to the present disclosure, wherein the input memory includes a second kernel having a 4 x 4 pixel size;

fig. 4-3 schematically illustrate a third configuration according to the present disclosure, wherein the input memory includes a third kernel having a 5 x 5 pixel size;

FIG. 5 schematically illustrates a hardware-based processing unit for a combinable commutative linear (ReLU) operator that accelerates execution of a ReLU calculation by being implemented in hardware, in accordance with the present disclosure;

FIG. 6 schematically illustrates a hardware-based processing unit for combinable pooling operators that accelerate execution of pooling computations (e.g., maxpool, avgpool) by being implemented in hardware in accordance with the present disclosure;

FIG. 7 schematically illustrates a first merging arrangement according to the present disclosure, including arranging an embodiment of PE-T units (shown with reference to FIGS. 3-3) in series with a ReLU operator (shown with reference to FIG. 5) to effect a convolution operation, and then a ReLU operation to effect a result from an input;

FIG. 8 schematically illustrates a second merging arrangement according to the present disclosure, including arranging an embodiment of PE-W units (shown with reference to FIGS. 3-4) in series with PE-P operators (shown with reference to FIG. 6) to validate convolution operations, and then validation of pooling operations to achieve results from the input;

FIG. 9 schematically illustrates a third merge scheme according to the present disclosure, including arranging a first PE-W (i) unit (shown with reference to FIGS. 3-4) in series with a second PE-W (i+1) unit (shown with reference to FIGS. 3-4), wherein an intermediate data buffer is used to store intermediate results;

FIG. 10 schematically illustrates an x-y matrix of pixels of a bitmap image divided into overlapping subsets according to the present disclosure; and is also provided with

Fig. 11 schematically illustrates a flow chart for advantageously filling a memory buffer for processing by elements of the signal processing system described herein, in accordance with the present disclosure.

It should be understood that the drawings are not necessarily to scale and present a somewhat simplified representation of the various preferred features of the disclosure disclosed herein, including, for example, specific dimensions, orientations, positions and shapes. Details regarding these features will be determined in part by the particular intended application and use environment.

Detailed Description

As described and illustrated herein, the components of the disclosed embodiments can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description is not intended to limit the scope of the disclosure, as claimed, but is merely representative of possible embodiments thereof. In addition, while numerous specific details are set forth in the following description in order to provide a thorough understanding of the embodiments disclosed herein, some embodiments may be practiced without some of these details. In addition, for the sake of clarity, certain technical material that is known in the related art has not been described in detail so as not to unnecessarily obscure the present disclosure. Furthermore, the drawings are in simplified form and are not to precise scale. Furthermore, as shown and described herein, the present disclosure may be practiced in the absence of elements not specifically disclosed herein.

Referring now to the drawings wherein the showings are for the purpose of illustrating certain exemplary embodiments only and not for the purpose of limiting the same, FIG. 1 schematically illustrates a signal processing system 50 including a Central Processing Unit (CPU) 52 configured to execute an artificial neural network 40 and an accelerator 60, wherein the signal processing system 50 is an element of a vehicle control system 10 including a sensing system 15 and a vehicle controller 20. The sensing system 15 captures data that can be transferred to the signal processing system 50. The signal processing system 50 may be configured to extract features from the captured data and communicate the features to the vehicle controller 20, which may be configured to perform control actions based thereon.

In one embodiment, and as described herein, the vehicle control system 10, the sensing system 15, the signal processing system 50, and the vehicle controller 20 are disposed on a vehicle. In one embodiment, the vehicle controller 20 includes an autonomous control system including, for example, one or more actuators for an adaptive cruise control system, an autonomous steering control system, an autonomous braking control system, and the like. In one embodiment, the sensing system 15 may be a digital camera that dynamically captures a bitmap image file 16 of a field of view 18 of the digital camera. In one embodiment, the signal processing system 50 may be configured to dynamically process the signal file to extract one or more features from the bitmap image file 16 of the field of view 18 of the digital camera, which may include detecting obstacles in the path of travel of the vehicle. In one embodiment, the vehicle controller 20 may be configured to perform a control action in response to the extracted features, such as implementing autonomous braking control, autonomous speed control, or autonomous steering control based on a detected obstacle in the vehicle travel path. Vehicles may include, but are not limited to, mobile platforms in the form of commercial vehicles, industrial vehicles, agricultural vehicles, passenger vehicles, aircraft, watercraft, trains, ATVs, personal mobility devices, robots, and the like, to achieve the objects of the present disclosure. It should be appreciated that the specific implementation described above is a non-limiting illustration of one embodiment. The concepts described herein are applicable to various embodiments of the signal processing system 50 as described herein.

The signal processing system 50 includes a CPU 52, a first memory device 54, a Random Access Memory (RAM) device (e.g., dynamic RAM (DRAM) device 56), and an accelerator 60, all of which are interconnected and communicate via a communication bus 58.

The first memory device 54 includes algorithmic code in the form of a first instruction set 51, a portion of which is accessible as a CPU instruction set 53 to control the operation of the CPU 52.

The first memory device 54 also includes algorithmic code in the form of a second instruction set 57, all or a portion of which is in the form of an accelerator instruction set 55 that can be accessed to dynamically control the configuration of the accelerator 60.

The CPU 52 and accelerator 60 execute their respective CPU and accelerator instruction sets 53, 55, respectively, to extract features from the bitmap image file 16, which can be transferred to the vehicle controller 20.

In one embodiment, accelerator 60 is configured as a multi-layered array of multiple Processing Engines (PEs) 70 arranged to effect fast dense matrix-matrix operations and matrix-vector operations.

Accelerator 60 may be configured as a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), or another integrated circuit configured to support high-speed, repetitive, digitally-intensive data processing tasks.

Accelerator 60 includes a PE array 70, an instruction set fetcher 72, an instruction scheduler 74, a global memory buffer 76, and a plurality of data caches 78.

Accelerator instruction set 57 includes an accelerator instruction set 55 for dynamically configuring accelerator 60.

Advantageously, accelerator instruction set 57 and instruction scheduler 74 configure the PEs of accelerator 60 to match the computational requirements. Common operators are configured in a unified format, wherein operators are implemented in a flexible hardware configuration. There is a mechanism for merging layers to reduce data transfer in which instruction scheduler 74 operates to populate inputs and manage the activation of instructions.

Each bitmap image file 16 is stored in DRAM 56.

CPU 52 executes a CPU instruction set 53 that includes processing data in the form of a subset of data from bitmap image files 16 stored in DRAM 56 using accelerator 60 to extract one or more features.

The artificial neural network 40 comprises a plurality of layers that may be executed as stages on the accelerator 60. Each image is processed only once per stage. In this embodiment, the structure of the artificial neural network 40 for a certain application is fixed.

However, depending on the structure of the artificial neural network 40, the accelerator 60 may be reconfigured with different combinations of layers. For example, if the example layer L is a convolution operation (conv), followed by a layer L+1 with a ReLU operation, it connects PE_W and PE_R together. If layer L is a convolution operation (conv), followed by layer L+1, which is also a convolution operation (conv), it connects the PE_W line to PE_W to form a super layer (PE_S).

A set of operators is defined to support dynamic reconfiguration of the Processing Engine (PE) based on calculations at different layers in the artificial neural network 40 at training. Operators may be implemented as instruction sets on hardware (FPGA, DSP, and/or ASIC) to achieve high performance and low power consumption.

The reconfiguration involves only the number of active units in the computation, which can be achieved by feeding data in the ports to activate the corresponding units. For PE_W/PE_K, reconfiguration may be avoided.

When the instruction set includes Convolutional Neural Network (CNN) operations, the instruction set may include the following. An example of a portion of an instruction set that may be implemented as hardware in accelerator 60 is shown with reference to fig. 2. The set of convolutional codes may be as shown in table 1 below.

TABLE 1

The procedure is as follows: conv (& input, i_size, & kernel, k_size, & output, o_size, stride)

Instructions to: ldreg1, inputadd;

ldreg2，i_size；

ld reg3，kernel_add；；

ld reg4，k_size；

ld reg5，output_add；

ld reg6，o_size；

ld reg7，stride

conv reg1，reg2，reg3，reg4，reg5，reg6，reg7

the convolutional code set of table 1 may be implemented in accelerator 60 as an operation 200 (shown with reference to fig. 2) comprising a single PE 220, a convolutional instruction 202, and a plurality of data registers 210 as shown below: reg1 211, reg2 212, reg3 213, reg4 214, reg5 215, reg6 216, and reg7 217.PE 220 is provided with a convolution instruction (conv) 202 to consider that the stride contained in reg7 217 is contributing to input memory 204 and core memory 206. Input memory 204 is obtained from reg1 211 and reg2 212, and core memory 206 is obtained from reg3 213 and reg4 214. The contents of the memory data in Reg1 211 and Reg2 212 originate from bitmap image file 16. Due to the convolution process, PE 220 generates output memory 208, which is transferred to reg5 215 and reg6 216. The hardware may be DSP elements, ALU (arithmetic logic unit) elements or LUT (look-up table) elements executing on ASIC (application specific integrated circuit) chips or FPGA chips.

This concept is described with reference to the convolution (conv) instruction. Other neural network operators that may be implemented in hardware in a similar manner may include deconvolution (deconv) operators, rectifying linearity (ReLU) operators, average (avgpool) operators, and maximum (maxpool) operators.

When implemented as operation 200 in accelerator 60, the set of convolutional codes shown in Table 1 may advantageously be executed in a single clock cycle of CPU 52, thus reducing computational effort (shown with reference to Table 1) as compared to comparable software execution, which may involve eight steps and eight related clock cycles in one embodiment.

In addition to convolutions that may be employed in CNN operations, other operators may include, for example, max/pool operators, commutative linear (ReLU) operators, juxtaposition operators, and the like. Each of these operators may be implemented as hardware in accelerator 60 in a manner similar to that already described to accelerate data processing of bitmap image file 16 for feature extraction.

To improve the parallelism of the convolution process, the data of each bitmap image file is tiled to accommodate the available resources, with the loops being spread out to achieve data parallelism. To parallelize execution of loops, loops are unrolled and the process is parallelized to make a number of iterations on hardware. The number of parallel iterations on the hardware is called the expansion parameter.

Fig. 3-1 schematically shows data input in the form of three bitmap image files 316, the designated parts of which comprise a designated first part 317, a second part 318 and a third part 319. The first portion 317, the second portion 318, and the third portion 319 are also referred to as "tiles". The three bitmap image files 316 pass through the filter element 320 to obtain a plurality of filtered bitmap image file portions 326, each comprising a filtered first portion 327, a filtered second portion 328, and a filtered third portion 329. Each of the bitmap image files 316 is composed of a plurality of pixels arranged in an x-y matrix.

Each of the bitmap image files 316 is an mxmxmxr image, where m represents the height and width of the image, and r is the number of channels. Thus, the RGB image file has three channels, i.e. r=3. The convolutional layer will have k filters (or kernels) of size n x q, where n is smaller than the size of the image and q can be the same or smaller than the number of channels r and can vary for each kernel. The size of the filter results in locally connected structures that are each convolved with the image to produce k feature maps of size m-n+1. Each map is then subsampled over a p x p continuous area with average or maximum pooling, where the range of p is within 2 for small images and no greater than 5 for larger inputs. CNNs include one or more convolution and sub-sampling layers, which may be followed by fully connected layers.

A single one of the pixels is indicated by element 315. In one embodiment, and as described herein, each of the pixels has associated digital data (including a value indicative of an intensity level of each of red, green, and blue (RGB) colors) and one or more associated weighting factors.

Fig. 3-2 schematically illustrates a process for feature detection via a processing unit employing a single Processing Engine (PE) to process a single PE data kernel (PE-K) 322 employing a kxk filter applied to a single Input Feature (IF) 321 for each pixel OF the bitmap image file 316 OF fig. 3-1 to implement a single Output Feature (OF) 323 in accordance with the present disclosure. The size of the kxk filter includes a pixel input 324 for IF 321 and an associated weight 325. Each pixel input 324 of IF 321 includes a magnitude of a value of one of the red, green, or blue portions of the respective pixel, and each weight 325 is a corresponding weight of the respective pixel. To maximize the number of PEs, the total number of least (mxn) Multiply Add Circuits (MACs) is selected, where M is related to the input and N is related to the output. In one embodiment, the PE-K unit 322 is used as a building block to form a PE tile (PE-T) unit 330 to effect a convolution operation.

Fig. 3-3 schematically show a non-limiting embodiment of a process for feature detection via the PE-T unit 330. Each PE-T unit 330 is made up OF a number M OF duplicate PE-K units 322 that are used to parallelize a number M OF IF 331 to generate a single OF 333. Each of the IF 331 is a tile containing a single feature for a subset of the pixels of the bitmap image file of fig. 3-1. Each of the IF 331 includes a magnitude of the value of one feature (i.e., the intensity of one of the red, green, or blue portions of the respective pixel of the n x n matrix subset of pixels 334 for the bitmap image file of fig. 3-1) and a corresponding n x n matrix of the subset of weights 335. PE-T unit 330 further includes a register 336 for storing intermediate results. When the IF size M331 is greater than M, the PE-T unit 330 is repeated M/M times.

Figures 3-4 schematically illustrate a non-limiting embodiment of a process for feature detection using a PE-W unit 340 consisting of a number n of duplicate PE-T units 330. In one embodiment, PE-W unit 340 can be configured to perform convolution operations. The PE-W unit 340 is configured to parallelize a number m OF IF 341 and corresponding weights 345 to generate a number n OF 346. N number OF 346 may be stored for further processing. The m and n terms represent scalar values determined based on memory size and data replication overhead.

Fig. 4-1, 4-2, and 4-3 schematically illustrate dynamic reconfiguration that allows efficient use of resources when different layers of accelerator 60 have different sizes and thus require different numbers of PEs. The solution is to generate a configuration driven by the size of the input kernel k.

The minimum size of kernel k may be determined to achieve the lowest granularity PE, which is designated as mpe_k unit. When the maximum size of the filter is Kmax, the number of mpe_k units may be calculated based on the following formula.

All mPE k units are connected to a preset memory range in a memory buffer 76 (shown with reference to fig. 1) for input.

Selector logic may be added to activate only mpe_k cells (K > K) with larger filter size input pe_k cells, which may be configured by filling the K input storage selector. Thus, given a PE_K unit having a K size filter,the input memory ik=k, the operation comprising activating a certain number ofTo implement pe_k cells.

If K is not a multiple integral of K, the input is filled with a value of 0 by initialization.

Fig. 4-1 schematically shows a first configuration in which the input memory 410 comprises a first kernel 412 having a 3 x 3 pixel size. In this case, selector 414 activates only the first PE 441 of mPE_k unit 440 to generate a first result 451.

Fig. 4-2 schematically illustrates a second configuration in which the input memory 410 includes a second kernel 422 having a 4 x 4 pixel size. In this case, selector 414 activates first PE 441 and second PE 442 of mPE_k unit 440 to generate a second result 452.

Fig. 4-3 schematically illustrate a third configuration in which the input memory 410 includes a third kernel 432 having a 5 x 5 pixel size. In this case, the selector 414 activates the first, second and third PEs 441, 442 and 443 of the mpe_k unit 440. Pre-training building blocks of structural CNN facilitates training of the CNN to quickly reach a desired level of accuracy to generate third result 453.

Combinability is a system design principle that deals with the interrelation of the components. Highly combinable systems provide components that can be selected and assembled in various combinations to meet specific user requirements.

Fig. 5 schematically illustrates a hardware-based processing unit for a combinable ReLU operator 500 that accelerates execution of ReLU computations by being implemented in hardware (i.e., layers of accelerator 60) and provides an interface compatible with outputs from other layers. Element-wise activation functions, such as max (0, x), may be applied using the ReLU operator 500. The ReLU operator 500 increases the nonlinear properties of the decision function of the overall CNN without affecting the acceptance field of the convolutional layer.

The ReLU operator (PE-R) 500 includes a PE 502 that is comprised of an AND gate 504 and a selector gate 506. The ReLU operator 500 generates a signal output o (S) 520.

The inputs to selector gate 506 include data input B510 and digital 0512. Inputs to AND gate 504 include input B510 and high/low input 514. The selection is driven by the sign bit B of input B510. The selector gate 506 logic is as follows:

for B < 0, c=1, o (S) =0; and is also provided with

For B > 0, c=0, o (S) =b;

fig. 6 schematically illustrates a hardware-based processing unit for a combinable pooling operator 600 that accelerates the execution of pooling computations (e.g., maxpool, avgpool) by being implemented in hardware (i.e., layers of accelerator 60) and provides an interface compatible with outputs from other layers. Pooling is used to calculate the maximum or average of features over an area (e.g., over one of the data tiles 317, 318, 319 of the bitmap image file 316 described with reference to fig. 3-1).

The pooling operator reduces the variance by calculating the maximum or average of a particular feature over a region of the image. This ensures that consistent results are obtained even with small translations of image features. This operation may be used for object classification and detection.

The pooling operator 600 includes a PE 602 made up of an operator 604, a buffer 606, and an enable gate 608. The inputs include input B610 and an output enable signal 612 determined by the size of the pooled kernel. Reset gate 609 has an input 607 that is a number 0. Operator 604 is a simple operator, such as a maximum operator or an average operator. The output enable signal 612 is determined by the size of the pooled kernel. This operation includes repeating the input data (including input B610 to operator 604) when output enable signal 612 is low (0), which selects the data based thereon and stores the data in input buffer 606. When output enable signal 612 is shifted high ("1"), enable gate 608 is activated, the resulting maximum or average value is transferred to output line 620, and reset gate 609 is controlled to clear buffer 606.

The concepts described with reference to fig. 4, 5 and 6 may be used as building blocks to reduce the number of layers and associated computational reduction to avoid off-chip operations. This includes merging layers by combining convolutions with non-linearities and pooling, and generating super-layers by combining multiple convolutions.

Fig. 7 schematically illustrates a first merging unit 700 that includes arranging an embodiment of the PE-T unit 330 (shown with reference to fig. 3-3) in series with a ReLU operator 500 (shown with reference to fig. 5) to effect a convolution operation and then a ReLU operation to effect a result 520 from the IF 331.

Each data point output from the convolution operation of PE-T unit 330 is fed directly to ReLU operator 500 without the need for an intermediate storage device.

Fig. 8 schematically illustrates a second merging unit 800 comprising arranging an embodiment of the PE-W unit 340 (shown with reference to fig. 3-4) in series with the PE-P operator 600 (shown with reference to fig. 6) to effect a convolution operation and then a pooling operation to effect a result 620 from the input 344.

Each data point 346 output from the operation of the PE-W unit 340 is fed directly to the corresponding pooling operator 600 without the need for an intermediate storage device.

Fig. 9 schematically shows a third merging unit 900 comprising a first PE-W (i) unit 340 (shown with reference to fig. 3-4) arranged in series with a second PE-W (i+1) unit 340' (shown with reference to fig. 3-4), wherein an intermediate data buffer 910 is used for storing intermediate results.

The data output from the first PE-W (i) unit 340 is input to the buffer 910 before being processed by the second PE-W (i+1) unit 340'.

Fig. 10 schematically shows an x-y matrix of pixels of a bitmap image 1000 divided into overlapping tiles, which are subsets of the matrix of pixels. The bitmap image file 1000 is a non-limiting illustration provided for describing the concepts herein. The bitmap image file 1000 includes tiles specified by elements 1, 2, 3, 4, 5, and 6. The bitmap image file 1000 is an 8 x 8 matrix of pixels and each of tiles 1, 2, 3, 4, 5, and 6 is a 4 x 4 matrix of pixels, with each individual tile overlapping a horizontally adjacent tile and also overlapping a vertically adjacent image. In this embodiment, the magnitude or step of the overlap is 2 pixels. Each of tiles 1, 2, 3, 4, 5, and 6 may be read into a corresponding intermediate data buffer 1020, which is a 4 x 4 matrix, to generate an output that is provided to an output data buffer 1030. This arrangement facilitates a single transfer of pixel data into the output data buffer 1030 for processing.

Fig. 11 schematically illustrates an instruction scheduler routine 1100 in the form of a flow chart that may be executed by the CPU 52 to facilitate initiation of downstream instructions and/or operations, including filling a memory buffer for processing by the elements of the signal processing system 50 described above. Table 2 is provided as a key corresponding to instruction scheduler routine 1100, with the digitally marked blocks and corresponding functions described below. The present teachings may be described in terms of functional and/or logical block components and/or various processing steps. It should be appreciated that such block components may be comprised of hardware, software, and/or firmware components that have been configured to perform specified functions.

TABLE 2

Execution of the instruction scheduler routine 1100 may proceed as follows. The steps of the instruction scheduler routine 1100 may be performed in a suitable order, not necessarily limited to the order described with reference to fig. 11. As used herein, the term "1" means a positive answer or "yes" and the term "0" means a negative answer or "no". Routine 1100 starts (1102) and extracts instructions (1104) and evaluates whether to combine the new instructions with previous instructions (1106). If yes (1106) (1), a buffer is set (1105) and the next instruction is fetched (1104). If not (1106) (0), the input buffer is filled (1108), and the instruction is executed on the data contained in the buffer (1110). The input buffers are filled (1108) in a preset order (e.g., in the order shown by tiles 1, 2, 3, 4, 5, and 6 shown in fig. 10). The routine evaluates whether there is a need to execute another instruction (1112), and if so (1112) (1), the process is repeated starting at step 1104. If not (1112) (0), the iteration ends (1114). This operation facilitates initiation of downstream instructions and/or operations that may be activated by input data output from upstream instructions using a preset local storage area defined for the input data at initialization.

The term "controller" and related terms such as control module, control unit, processor, and the like refer to one or more combinations of Application Specific Integrated Circuits (ASICs), electronic circuits, central processing units (e.g., microprocessors), and associated non-transitory memory components in the form of memory and storage devices (read-only, programmable read-only, random access, hard drives, etc.). The non-transitory memory components may store machine-readable instructions in the form of one or more software or firmware programs or routines, combinational logic circuits, input/output circuits and devices, signal conditioning and buffering circuits, and other components that may be accessed by one or more processors to provide the described functionality. The input/output circuits and devices include analog/digital converters and associated devices that monitor inputs from the sensors, where such inputs are monitored at a preset sampling frequency or in response to triggering events. Software, firmware, programs, instructions, control routines, code, algorithms, and similar terms represent a set of controller-executable instructions including calibration and lookup tables. Each controller executes a control routine to provide the desired function. The routine may be executed periodically (e.g., every 100 microseconds during ongoing operation). Alternatively, the routine may be executed in response to the occurrence of a trigger event. Communication between controllers, actuators, and/or sensors may be implemented using direct wired point-to-point links, networked communication bus links, wireless links, or other suitable communication links. Communication includes exchanging data signals in a suitable form including, for example, exchanging electrical signals via a conductive medium, exchanging electromagnetic signals via air, exchanging optical signals via an optical waveguide, and the like. The data signals may include discrete signals, analog signals, or digitized analog signals representing the inputs from the sensors, the actuator commands, and the communication between the controllers. The term "signal" refers to a physically distinguishable indicator that conveys information and may be of a suitable waveform (e.g., electrical, optical, magnetic, mechanical, or electromagnetic) capable of traversing a medium, such as DC, AC, sine wave, triangular wave, square wave, vibration, or the like.

As used herein, the terms "dynamic" and "dynamically" describe steps or processes that are performed in real-time, and are characterized by monitoring or otherwise determining the state of a parameter and periodically or periodically updating the state of the parameter during execution of the routine or between iterations of the execution of the routine. A parameter is defined as a measurable characteristic representing a physical characteristic of a device or other element that is discernable using one or more sensors and/or physical models. The parameter may have a discrete value, such as "1" or "0", or may vary infinitely in value.

The flowcharts and block diagrams in the flowchart illustrations illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The detailed description and drawings are supporting and descriptive of the present teachings, but the scope of the present teachings is limited only by the claims. While certain of the best modes and other embodiments for carrying out the present teachings have been described in detail, various alternative designs and embodiments exist for practicing the present teachings as defined in the appended claims.

Claims

1. A signal processing system, comprising:

a central processing unit CPU, which communicates with the accelerator;

an instruction scheduler in communication with the accelerator;

a first memory device including a first set of instructions configured to operate the accelerator and a second set of instructions configured to operate the CPU;

a second memory device configured to receive a data file; and

the accelerator is a multi-layer array configured to implement a Convolutional Neural Network (CNN) according to the first instruction set, the accelerator comprising a plurality of Processing Engines (PEs) defined according to processing engine instructions included within the first instruction set, each of the processing engines configured to implement one of a plurality of operators, operators associated with each of the processing engines being fixed such that each processing engine cannot execute any other operator of the plurality of operators;

wherein the instruction scheduler is configured to selectively implement one or more of the plurality of operators in the accelerator by one of activating and deactivating each of the processing engines associated with an operator;

wherein the CPU analyzes the data file with the plurality of operators implemented in the accelerator to extract features from the data file;

wherein the CPU communicates the feature to a second controller;

wherein the first memory device includes a third set of instructions configured to operate the instruction scheduler, the third set of instructions defining a plurality of layers of the convolutional neural network configured to analyze the data file, the third set of instructions directing the instruction scheduler to one of activate and deactivate processing engines applicable to each layer such that the processing engines are activated and deactivated as needed for each layer;

wherein the third instruction set defines a core size for each of the plurality of layers, and wherein the instruction scheduler is configured to activate and deactivate each processing engine according to its associated core size;

wherein the data file comprises a plurality of bitmap images, and each layer is configured to analyze a plurality of tiles selected from the bitmap images; and is also provided with

Wherein the instruction scheduler is configured to activate and deactivate each processing engine for each tier according to the data size of each tile.

2. The signal processing system of claim 1, wherein the plurality of bitmap image files are obtained from a field of view captured by a camera.

3. The signal processing system of claim 2, wherein the signal processing system is in communication with a control system arranged to perform control actions, and

wherein the control system is arranged to perform the control action based on the feature extracted from the bitmap image file.

4. The signal processing system of claim 1, wherein the plurality of operators comprises a combinable commutative linear (ReLU) operator implemented by one of the processing engines.

5. The signal processing system of claim 1, wherein the plurality of operators comprises combinable pooling operators implemented by one of the processing engines.

6. The signal processing system of claim 5, wherein the combinable pooling operator comprises a max pooling operator.

7. A vehicle control system comprising:

a camera arranged to capture a field of view adjacent thereto;

a control system configured to control operation of an actuator of the vehicle;

a signal processing system comprising a controller, an accelerator, and a memory device, wherein the accelerator is configured to be operable to implement a multi-layer array of a Convolutional Neural Network (CNN) having a plurality of layers configured to implement a plurality of Processing Engines (PEs), each of the processing engines being assigned to one or more of the plurality of layers and configured to perform one of a plurality of operators, operators associated with each of the processing engines being fixed such that each processing engine cannot perform any other of the plurality of operators;

the signal processing system is in communication with the camera and the control system, the controller includes a set of instructions,

the controller is operable to:

i) Capturing an image of the field of view via the camera, wherein the image is comprised of a bitmap image file;

ii) transferring the bitmap image file to the memory device;

iii) Implementing the convolutional neural network in the accelerator to perform image recognition according to the processing of the bitmap image file via operators associated with each of the processing engines assigned to each layer;

iv) executing, via the controller, the convolutional neural network to extract features from the bitmap image file, wherein the features include obstructions disposed in the field of view, the convolutional neural network selectively implementing the operators by activating processing engines assigned to each layer and disabling processing engines not assigned to each layer, respectively;

v) controlling, via the control system, operation of the actuator of the vehicle based on the features extracted from the bitmap image file;

wherein the processing engines applicable to each layer are selectively activated and deactivated such that the processing engines are activated and deactivated as needed for each layer;

wherein a kernel size is defined for each of the plurality of layers, and wherein each processing engine activates and deactivates each layer according to its associated kernel size;

wherein each layer is configured to analyze a plurality of tiles selected from the bitmap image file; and is also provided with

Wherein each processing engine activates and deactivates each tier according to the data size of each tile.

8. The vehicle control system of claim 7, wherein the bitmap image file is the field of view captured by the camera.

9. The vehicle control system of claim 7, wherein the plurality of operators comprises a combinable commutative linear (ReLU) operator implemented by one of the processing engines.

10. The vehicle control system of claim 7, wherein the plurality of operators comprises combinable pooling operators implemented by one of the processing engines.