KR20170080087A

KR20170080087A - Method and apparatus of exploiting sparse activation in neural networks to reduce power consumption of synchronous integrated circuits

Info

Publication number: KR20170080087A
Application number: KR1020150191287A
Authority: KR
Inventors: 정재용
Original assignee: 인천대학교 산학협력단
Priority date: 2015-12-31
Filing date: 2015-12-31
Publication date: 2017-07-10
Also published as: KR101806833B1

Abstract

The present invention relates to an apparatus and method for reducing the power consumption of a synchronous integrated circuit that utilizes the sparse activity of an artificial neural network. The apparatus detects the activity of the neuron in the artificial neural network and blocks the clock of the processing element The present invention proposes a device and a method for reducing unnecessary dynamic power consumption.

Description

TECHNICAL FIELD The present invention relates to an apparatus and a method for reducing power consumption of a synchronous integrated circuit that utilizes sparse activity of an artificial neural network,

More particularly, the present invention relates to a device and method for reducing the power consumption of a synchronous integrated circuit that utilizes sparse activity of an artificial neural network, and more particularly, Thereby reducing dynamic power unnecessarily consumed in the synchronous circuit.

The recent development of neural networks, known as deep learning, solves the problems that only humans can solve successfully. Deep learning technology has become a practical tool for a variety of cognitive applications such as image understanding, speech recognition search, natural language processing, and unmanned vehicle operation. A giant neural network is applied to deep running for real-world applications, which includes millions of elements.

In real time, the amount of computation in a neural network is vast, and it also consumes a lot of power, which is a serious problem in battery power systems. CPUs can not deliver real-time computation performance and mobile CPUs and FPGA-based accelerators consume more than 10W, which is not suitable for mobile devices such as mobile phones and wearables. Also, for devices that are always on, more stringent power consumption is required. Therefore, development studies are being sought to find computing platforms fundamentally different from Von Neumann structure-based systems.

Neuromorphic engineering, which began in the 1980s, is a concept that imitates neuro-biological architectures on the nervous system using analog circuits, but the term "neuromorphic" In circuits and systems, and the promo- topic computing system has become an alternative to the power and performance problems of conventional computing platforms.

However, recent large-scale neuromotor systems such as BrainScaleS and Neurogrid have been developed for other purposes to accelerate brain simulation and to extend it to the human level. They are mainly applied to neuroscience research. Unlike these two systems, TrueNorth, a digital novelty system proposed by IBM, aims to apply not only to brain simulation but also to real-world applications. TrueNorth applies a spiking neuron model and mixes a synchronous-asynchronous design design to consume less power when the neurons are inactive.

Recently INsight has been proposed as a novel Lomographic computing system specially designed for deep learning based applications. The system can be implemented by current design automation tools by adopting a fully synchronous design. However, synchronous neurocompic systems such as Insight require a large amount of dynamic power consumption due to the clock.

The present invention proposes a low power consumption scheme for such a digital synchronous new Lomographic system. The present invention is fundamentally developed for a novel Lomographic system, but can also be applied to various synchronous processors associated with a neural network.

SUMMARY OF THE INVENTION It is an object of the present invention to provide a low power consumption scheme for a digital synchronous news broadcast system.

In particular, we propose a method to save power in various synchronous neuro-pic circuits other than spike neural networks, and solve the problem of consuming a lot of power even though synchronous circuit does not perform useful work due to clock I want to.

According to an aspect of the present invention, there is provided an apparatus for reducing power consumption of a synchronous integrated circuit that utilizes sparse activity of an artificial neural network, including: an activity sensing unit for sensing activity of a neuron in neural networks; And a clock gating circuit for performing clock switching on a processing element of the artificial neural network based on the activity detection.

Preferably, the activity sensing unit senses a sparse activity of the neuron whose output of the neuron is zero according to an activity of the neuron, and the clock gating circuit unit monitors the activity of the neuron according to the detection of the sparse activity of the neuron The clock of the processing element can be blocked.

Further, the clock gating circuit unit may switch a clock input to a clock pin of a register included in a unit cell of the tomographic computing system.

In an embodiment of the present invention, the processing element may be a unit cell constituting a semi-systolic multiplier.

Also, a method of reducing power consumption of a synchronous integrated circuit that utilizes sparse activity of an artificial neural network according to the present invention includes a neuron activity sensing step of sensing neuron activity in an artificial neural network; And a clock gating step of performing clock gating on a processing element of the artificial neural network based on the activity detection.

Preferably, the neuron activity sensing step senses a sparse activity of the neuron whose output of the neuron in response to activity of the neuron is zero, and the clock gating step includes sensing the sparse activity of the neuron The clock for the processing element may be blocked.

And further wherein the processing element outputs a default value with a clock cutoff for the processing element as a result of performing the clock gating step.

According to the present invention, it is possible to reduce unnecessary dynamic power consumed in the synchronous circuit by detecting the neuron activity of the artificial neural network and blocking the clock of the processing element for the rare activity according to the activity level.

Particularly in synchronous circuits, it is possible to reduce dynamic power according to the amount of activity, such as spike-based asynchronous circuits, thereby achieving a low power consumption effect for synchronous neuromorphic sensing systems.

1 shows a conceptual diagram of a DAG (Directed Acyclic Graph) for a feed-forward neural network of a layer.
2 shows a practical CNN (Convolutional Neural Networks) called AlxeNet.
Figure 3 shows the activities of each spiral layer of AlxeNet on an input image.
Figure 4 shows the average scarcity value for the activity of each layer in CNN.
5 shows a configuration diagram of an embodiment of a power saving device according to the present invention.
Figure 6 shows a flow diagram of one embodiment of a power saving method in accordance with the present invention.
Figure 7 shows an embodiment of applying the invention to a synapse based on a semi-systolic multiplier.
Figure 8 shows a flow diagram of an embodiment of a power saving method in accordance with the present invention.
Figure 9 illustrates the concept that a 4-bit semi-systolic multiplier is created and added in a partial configuration.
10 shows a signal form in the present invention.
Figure 11 is a comparison of the dynamic power versus the total power for the basic design and the inventive design when the sparsity changes.

BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, features and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings.

First, the terminology used in the present application is used only to describe a specific embodiment, and is not intended to limit the present invention, and the singular expressions may include plural expressions unless the context clearly indicates otherwise. Also, in this application, the terms "comprise", "having", and the like are intended to specify that there are stated features, integers, steps, operations, elements, parts or combinations thereof, But do not preclude the presence or addition of features, numbers, steps, operations, components, parts, or combinations thereof.

In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

The present invention relates to an apparatus and method for reducing the power consumption of a synchronous integrated circuit that utilizes the sparse activity of an artificial neural network. The apparatus detects a neuron activity in the artificial neural network and blocks the clock of the processing element for the sparse activity according to the activity level. The present invention proposes a device and a method for reducing unnecessary dynamic power consumption.

Feed-forward neural networks are the most common form of neural networks that are applied to real-world situations, and they consist of layers with units that use topological order. The first layer is the input layer, the last layer is the output layer, and the other layers are hidden layers.

If the feed-forward network has more than one hidden layer, it is called a deep neural network. Most common layer types are fully connected layers, in which neurons between two adjacent layers are connected in pairs. A layer of n neurons fully connected to m neurons performs a nonlinear transformation.

The activity of the neurons in the input layer and the output of the layer

And y

, The nonlinear transformation can be expressed by the following expression (1).

[Formula 1]

here,

Is a weight matrix,

Is the bias, and f is the activity function.

1 shows a conceptual diagram of a DAG (Directed Acyclic Graph) for a three-layer feed-forward neural network. In FIG. 1, circles denote neurons and lines denote connections. Hidden layer and output layer are completely connected.

Convolutional Neural Networks (CNNs) are multi-layer neural networks that are commonly used as visual recognition tasks and acquire images as input. CNNs have made significant breakthroughs in a variety of image recognition tasks and have created interest in deep-running.

In CNNs, neurons are made up of two-dimensional spaces. Here, the dimension is the width information, the height information, and the depth information. For example, input neurons activated by the pixels of each channel of the input image may be arranged according to the vertical and horizontal positions of the pixels and channels. Each layer of CNN converts input activity represented by a three-dimensional volume into a three-dimensional output volume.

CNN can be made up of two main types of layers, the spiral layer and the fully connected layer. In the helical layer, the neurons have spatial local connections, and the layers carry convolutions because the same depths of neurons share the weights for local connections. The spatial scale of the local connection is called the receptive field. The spiral layer may include a pooling layer and a Local Response Normalization (LRN) layer.

Figure 2 shows a practical CNN called AlxeNet.

AlxeNet has gained a lot of attention due to its remarkable challenge. Since the Alexnet obtains 224 * 224 color images as input, neurons in the input layer are represented by a volume of 224 * 224 * 3, and classify the acquired images into 1000 categories. The Alexnet has five helical layers by three fully connected layers and contains 60M (mega) parameters and 650K (kilo) neurons. The first spiral layer has a 11 * 11 receptive field and includes a pooling layer and an LRN layer. The network achieves a top-5 error rate of 18.2% in the ILSVRC-2012 dataset, which is separated by 1.2M (mega) training and is a 50K (kilo) effective image.

TruNorth, a digital neurometric system proposed by IBM, is a synchronous-asynchronous hybrid neuropeck system consisting of neurons and synapses. Digital, spiking neuron models are used, and when rate coding is applied, the activity of the neuron is represented by the number of spikes, which is a special number of time steps, the time window. The neurons communicate through the AER (Address Event Representation), where the times at which spikes occur are transmitted. Because TruNorth is event based, if no AER packets are received for the core, there is no activity in the core's synapses and neurons as an asynchronous system. Therefore, the switching activity of the circuit is very low when the neurons are not spiked, ie when the neurons are inactive.

As such, asynchronous circuits consume only dynamic power at the time of operation, while synchronous circuits consume a lot of power, even though clocks do not perform useful tasks. However, this may not be the case when clock gating is considered. Clock gating is a low-power circuit design technology that addresses clock power problems and is recently provided by design automation tools.

These tools automatically extract the 'enable' signal for each flop, if there is a flop, and use the original clock for a group of flops sharing the same 'enable' With a new clock gated by an 'enable' signal. This protects a large amount of clock load in clock trees and eliminates the unnecessary switching of the clock pins. Such a conversion is very effective and is performed even at a fine-grained level of less than 10 flops. In digital signal processing circuits, 'enable' signals are usually 'data valid' signals.

Nonetheless, asynchronous circuits are still more power efficient than sparse gated activities when applied to clocked gated circuits in a neuromorphic computing system. Rectified Linear Unit (ReLU) f (x) = max (0, x) is the most common choice for an activation function in a deepest neural network. This is because the activity of a layer usually has a lot of zeros, that is, it is sparse, since exactly zero is created when x is less than zero. Also, activity tends to become more scarce as you move to the output layer.

Figure 3 shows the activities of each spiral layer of AlxeNet on an input image.

FIG. 3 (a) shows an input image, and FIGS. 3 (b) to 3 (f) visually show the activities of neurons in each spiral layer of AlxeNet. The output three-dimensional volume of each layer is separated according to the depth dimension, and each separated piece is tiled. The first and second helical layers include pooling followed by LRN, and the fifth layer only contains pooling.

In FIG. 3, the zero activity appears as black, and these activity maps appear black because most of the activities are zero or small, and the image becomes darker toward the output layer.

Figure 4 (a) shows the average sparse value for each layer activity in AlxeNet for the first 256 input images in the ILSVRC 2012 valid dataset.

FIG. 4A is a graph illustrating a rare value of output activity of each layer in AlxeNet with respect to 256 images. The rare value increases toward the output layer and increases to 0.8 near the output layer.

Figure 4 (b) shows the same trend as VGG-19 as the most recent CNN with 19 layers and 160M (mega) parameters, generated as a result of the skill level for ILSVRC.

The scarcity of activity can be exploited in spike-based asynchronous circuits. "No activation" is usually encoded with a zero spike, so it does not consume dynamic power at all. This can be used in synchronous circuits by applying clock gating.

5 shows a configuration diagram of an embodiment of a power saving device according to the present invention.

In the present invention, a clock for a processing element is gated according to the activity of the neurons, and a clock for the processing element can be gated by "inactive" In this approach, it is important to keep the nonzero activity detection small so that the overhead does not overcome the gain.

To this end, the power saving apparatus 200 according to the present invention roughly includes an activity detection unit 210 for detecting an activity of a processing element 100 and a rare activity detected by the activity detection unit 210 And may include a clock gating circuit portion 250 that switches the clock to the processing element 100 as a basis.

Figure 6 shows a flow diagram of one embodiment of a power saving method in accordance with the present invention.

In the present invention, the power saving device as shown in FIG. 5 is applied to reduce power through clock gating for processing elements according to activity of neurons. The activity detection unit 210 detects the activity of neurons (S110) The clock gating circuit unit 250 provides the clock to the processing element 100 in step S120 and outputs the result of the operation in step S130. The clock gating circuit unit 250 cuts off the clock for the processing element 100 (S150) and only the default value is output (S160).

The present invention suggests how to implement such a configuration in an INsight neuromotor computing system. INsight is a digital synchronous neuropeak system in which synapses are the main processing element.

Figure 7 shows an embodiment of applying the invention to a synapse based on a semi-systolic multiplier.

In the present invention, by adding the power saving device 200 according to the present invention as a simple circuit for reducing the dynamic power consumption of the clock pins of the registers in the unit cells 110a, 110b, 110c, and 110d, .

The activity and synapse weights of the neurons expressed in n bit binary are set to x and y, respectively, and the i-th bit of x and y is denoted by x _i and y _i . As an input to the synapse, each bit of x becomes a consecutive least significant bit (LSB).

A semi-systolic multiplier is based on a primitive cell consisting of two registers, two input AND gates and a full adder, and an n-bit semi-systolic multiplier is composed of n Unit cells.

Representing unit cells from PC ₀ to PC _n _-1 , the weight of the synapse is provided from the synapses in parallel, and PC _i has y _i . Multipliers are referred to as "semi-systolic" because they require a global wire for input, and a systolic system in which neighboring processing elements are interconnected via local wires Is different. Because the common wire is a long wire, it allows a more compact configuration for a novel Lomographic system, although it may not be suitable for high frequency operation.

In the case of applying the power saving method according to the present invention to the embodiment of the present invention as shown in FIG. 7, a flowchart of an embodiment of the power saving method according to the present invention shown in FIG. 8 may be applied.

The neuron activity x is detected (S210). When x is not 0, a clock is applied to each of the unit cells 110a, 110b, 110c, and 110d to detect neuron activity x And a result value y considering the weight is output (S230).

However, if the neuron activity x is 0, the clock is interrupted (S250) to each of the unit cells 110a, 110b, 110c, and 110d so that the default value 0 is output S260 in each unit cell 110a, 110b, 110c, do.

Figure 9 illustrates the concept that a 4-bit semi-systolic multiplier is created and added in a partial configuration.

In a semi-systolic multiplier, each unit cell contributes to the actions associated with the sub-structure. Each unit cell is applied as a partial configuration and is generated as a digit on the diagonal of the points at the same period. The multiplier creates each bit of the configuration individually and gradually adds the partial configurations.

9A shows how the partial configuration is added. The full adder of PC _i has a 1 bit configuration of the current cycle according to one input. The other two inputs are taken from the previous cycle through the flip, one from the same unit cell PC _i and the other from PC _i ₊₁ .

For sign multiplication, the multiplicand x is required to be signed-extended for the last 4 periods, and PC _n _-1 requires the sign bit of y to be a weight of -2 ^N-1 ), So correction is required.

Therefore, the compensation register (Cout register) is initialized to 1, and the AND gate is replaced with a NAND gate. Modified PC is denoted by PC ', except for the last part.

In the semi-systolic multiplier, the registers retain their initial values until the first '1' appears from the input. When '0' appears before the first '1', the value of the compensation register is maintained at '1' in PC ' _n _-1 because the full adder keeps generating a carry. Therefore, when '0' s appear, it is not necessary to assign a new value to the registers, and clock gating is possible.

In FIG. 7, the activity sensing unit 210 is logic added for the data driven clock gating, and only two additional coupling gates and one additional register can sense the clock gating state.

Hereinafter, experimental results of applying the present invention will be described.

We applied the n-bit semi-systolic multiplier's standard method and the proposed method in Verilog, synthesized using Synopsys Design Compiler, and used high Vt library specialized in 0.95V, 125 degrees and SS corners. Lt; / RTI > technology.

Clock gating is performed automatically during synthesis. The results of the synthesis are summarized in Table 1 below.

[Table 1]

Reference scheme has two registers in addition to the 2n-1 registers for unit cells was removed for total register PC _0. One additional register is applied for the sign extension of the serial input, and a latch is applied to the clock gating cell.

The 2n-1 registers are clocked correctly and the registers for sign extension are not clocked gated because they have different enable states from the others. The combined gate count is mapped to a simple D flip-flop for better timing with AND gates for flip-flops with synchronous reset. It is therefore possible to further reduce the area if it is designed at the gate level. In the present invention, since only one register is added to the basic method, the difference in the cell area is small.

10 shows a signal form in the present invention.

The clock signal for the internal register is switched according to the input activity, where 'last in ph0' in the signal form indicates when a sign extension needs to be performed and 'last' means the last bit of the activity .

Inputs continue to appear as 'data in valid' and 'last' appear, but when the connected neuron is inactive, the clock signals of the internal registers are not switched, thus reducing dynamic power.

The invention is applied to randomly generate 20 sets of 1000 vectors and set the scarcity of the vector sets to vary from 0.0 to 1.0 in 0.05 intervals and for each synthesized netlist use Synopsys VCS We generated a Switching Activity Interchange Format (SAIF) file and performed gate level simulation using these vector sets.

The switching behavior in each SAIF file was back-annotated in a netlist and dynamic power analysis was performed in the Design Compiler to derive a power report for each vector set.

Table 2 below shows the results with scarcity of 0.2, 0.6 and 0.8.

[Table 2]

As shown in FIG. 4, the scarcity of the second layer and the seventh layer of AlexNet is approximately 0.2 and 0.8, so that the result for each layer can be derived.

In the basic scheme and the scheme according to the invention, most of the power, regardless of the word size, was consumed in sequential logic due to the short-circuit current.

In the 8-bit basic design with an input sparity of 0.2, the dynamic power occupies 85% of the total power and the static power occupies 15%. When the scarcity is 0.2, the 8-bit design according to the present invention achieves a total power savings of 15% by increasing 4% static power but saving 19% dynamic power. A similar improvement was achieved for 16 bits and a marked improvement was seen when the scarcity increased to 0.8. The 8-bit scheme according to the present invention saves 65% of the dynamic power and saves 54% of the total power over the basic scheme.

As the word size increases by 16 bits, the clock pins are further hidden by clock gating, which is significantly improved. The scheme according to the invention consumes 79% less dynamic power than the basic scheme.

Figure 11 is a comparison of the dynamic power versus the total power for the basic design and the inventive design when the sparsity changes.

In the case of the basic scheme, as the scarcity increases, the switching activity decreases, and consequently the power consumption decreases. In the case of the present invention, dynamic power consumption becomes smaller as spikes reach 1.0 at spike-based asynchronous circuits.

As such, the present invention can detect sparse activity in a deep neural network and achieve a low power consumption effect for a synchronous neuromorphic sensing system. Furthermore, in synchronous circuits, it is possible to reduce dynamic power according to the amount of activity like a spike-based asynchronous circuit.

Although the present invention has been described above with reference to a digital synapse method based on a semi-systolic multiplier, the present invention can be applied to other types of digital circuits that can be variously presented.

The foregoing description is merely illustrative of the technical idea of the present invention, and various changes and modifications may be made by those skilled in the art without departing from the essential characteristics of the present invention. Therefore, the embodiments of the present invention are not intended to limit the scope of the present invention but to limit the scope of the present invention. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents thereof should be construed as being included in the scope of the present invention.

100: processing element,
200: power saving device,
210: an activity detection unit,
250: Clock gating circuitry.

Claims

An activity sensing unit for sensing activity of a neuron in neural networks; And
And a clock gating circuit for performing clock switching on a processing element of the artificial neural network on the basis of the activity detection.

The method according to claim 1,
Wherein the activity detection unit comprises:
Detecting a sparse activity of the neuron whose output of the neuron according to activity of the neuron is zero,
Wherein the clock gating circuit blocks the clock of the processing element according to the detection of the sparse activity of the neuron.

The method according to claim 1,
The clock gating circuit unit includes:
Wherein a clock input to a clock pin of a register included in a unit cell of the Nomogram computing system is switched.

The method according to claim 1,
Wherein the processing element comprises:
Wherein the unit cell is a unit cell constituting a semi-systolic multiplier, wherein the unit cell is a semi-systolic multiplier.

A neuron activity sensing step for sensing activity of a neuron in neural networks; And
And a clock gating step of performing clock gating on a processing element of the artificial neural network on the basis of the activity detection.

6. The method of claim 5,
Wherein the neuron activity sensing step comprises:
Detecting the sparse activity of the neuron whose output of the neuron according to the activity of the neuron is zero,
Wherein the clock gating comprises:
And a clock for the processing element is cut off when detecting the sparse activity of the neuron. The method for reducing power consumption of a synchronous integrated circuit utilizing sparse activity of an artificial neural network.

The method according to claim 6,
Further comprising the step of causing the processing element to output a default value in response to a clock interruption to the processing element upon execution of the clock gating step.