EP3895073A1 - Realigning streams of neuron outputs in artificial neural network computations - Google Patents
Realigning streams of neuron outputs in artificial neural network computationsInfo
- Publication number
- EP3895073A1 EP3895073A1 EP19835489.6A EP19835489A EP3895073A1 EP 3895073 A1 EP3895073 A1 EP 3895073A1 EP 19835489 A EP19835489 A EP 19835489A EP 3895073 A1 EP3895073 A1 EP 3895073A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- stream
- input values
- neuron
- arithmetic unit
- neuron outputs
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 210000002569 neuron Anatomy 0.000 title claims abstract description 133
- 238000013528 artificial neural network Methods 0.000 title description 29
- 238000000034 method Methods 0.000 claims abstract description 40
- 230000005055 memory storage Effects 0.000 claims description 6
- 230000015654 memory Effects 0.000 description 26
- 238000012545 processing Methods 0.000 description 15
- 238000012549 training Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 10
- 238000013459 approach Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000003287 optical effect Effects 0.000 description 6
- 238000004590 computer program Methods 0.000 description 5
- 230000002093 peripheral effect Effects 0.000 description 5
- 238000012546 transfer Methods 0.000 description 5
- 230000001133 acceleration Effects 0.000 description 4
- 238000007792 addition Methods 0.000 description 4
- 210000004556 brain Anatomy 0.000 description 4
- 241001465754 Metazoa Species 0.000 description 3
- 230000006399 behavior Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000013138 pruning Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000000670 limiting effect Effects 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 210000000225 synapse Anatomy 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 230000035508 accumulation Effects 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000007620 mathematical function Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000009257 reactivity Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
Definitions
- the present disclosure relates generally to data processing and, more particularly, to system and method for accelerating artificial neural network
- ANNs Artificial Neural Networks
- the human brain contains 10-20 billion neurons connected through synapses. Electrical and chemical messages are passed from neurons to neurons based on input information and their resistance to passing information.
- a neuron can be represented by a node performing a simple operation of addition coupled with a saturation function.
- a synapse can be represented by a connection between two nodes. Each of the connections can be associated with an operation of multiplication by a constant.
- the ANNs are particularly useful for solving problems that cannot be easily solved by classical computer programs.
- ANNs While forms of the ANNs may vary, they all have the same basic elements similar to the human brain.
- a typical ANN can be organized into layers, each of the layers may include many neurons sharing similar functionality. The inputs of a layer may come from a previous layer, multiple previous layers, any other layers or even the layer itself.
- Major architectures of ANNs include Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and Long Term Short Memory (LTSM) network, but other architectures of ANN can be developed for specific applications. While some operations have a natural sequence, for example a layer depending on previous layers, most of the operations can be carried out in parallel within the same layer. The ANNs can then be computed in parallel on many different computing elements similar to neurons of the brain.
- a single ANN may have hundreds of layers. Each of the layers can involve millions of connections. Thus, a single ANN may potentially require billions of simple operations like multiplications and additions.
- ANNs can result in a very heavy load for processing units (e.g., CPU), even ones running at high rates.
- processing units e.g., CPU
- GPUs graphics processing units
- GPUs can be used to process large ANNs because GPUs have a much higher throughput capacity of operations in comparison to CPUs. Because this approach solves, at least partially, the throughput limitation problem, GPUs appear to be more efficient in the computations of ANNs than the CPUs.
- GPUs are not well suited to the computations of ANNs because the GPUs have been specifically designed to compute graphical images.
- the GPUs may provide a certain level of parallelism in computations.
- the GPUs are constraining the computations in long pipes implying latency and lack of reactivity.
- very large GPUs can be used which may involving excessive power consumption, a typical issue of GPUs. Since the GPUs may require more power consumptions for the computations of ANNs, the deployment of GPUs can be difficult.
- CPUs provide a very generic engine that can execute very few sequences of instructions with a minimum effort in terms of programming, but lack the power of computing for ANN.
- GPUs are slightly more parallel and require a larger effort of programming than CPUs, which can be hidden behind libraries with some performance costs, but are not very well suitable for ANNs.
- Field Programmable Gate Arrays FPGAs are professional components that can be programmed at the hardware level after they are manufactured. The FPGAs can be configured to perform computations in parallel. Therefore, FPGAs can be well suited to compute ANNs.
- One of the challenges of FPGAs is the programming, which requires a much larger effort than programming CPUs and GPUs. Adaption of FPGAs to perform ANN computations can be more challenging than for CPUs and GPUs.
- the existing FPGA solutions do not solve the problem of massive data movement required at large scale for the actual ANN involved in real industrial applications.
- the inputs to be computed with an ANN are typically provided by an artificial intelligence (AI) framework.
- AI artificial intelligence
- Those programs are used by the AI community to develop new ANN or global solutions based on ANN.
- FPGAs are also lacking integration in those software environments.
- a system for accelerating ANN computation may include a controller, a selector
- the controller can be configured to dynamically control the selection, based on a criterion, of an input value from a stream of input values of a neuron.
- the controller can configure the selector to provide, dynamically, the selected input value to the arithmetic unit.
- the controller can provide, to the arithmetic unit, information for the selected input value.
- the information may include an offset of the input value in the stream. The offset cannot be statically computed before the computation of the neuron because the input values in the stream can be generated as a result of previous neuron computation.
- the arithmetic unit can be configured to acquire, based on the offset, a weight from a set of weights.
- the arithmetic unit can further perform a mathematical operation on the selected input value and the weight to obtain a result to be used in computing an output of the neuron.
- the mathematical operation performed by the arithmetic unit in its simplest form may include a multiplication product.
- the criterion applied by the controller may include comparison of the input value to a reference value.
- the comparison may include an equality comparison to zero.
- the comparison may also include a more-than comparison to a threshold.
- the selector may include a multiplexer.
- the selector may also include information related to a memory address. Input values not selected from the subset of input values are not provided to the arithmetic unit, therefore, a count of mathematical operations performed by the arithmetic unit on selected input values from the subset of input values is less than a count of mathematical operations required to be performed by the arithmetic unit on all input values from the subset of the input values.
- a system for realigning streams of neuron outputs may include an arithmetic unit, at least one further arithmetic unit, and a synchronization module communicatively coupled to the arithmetic unit and the further arithmetic unit.
- the arithmetic unit can generate a stream of neuron outputs and the further arithmetic unit can generate a further stream of further neuron outputs.
- the synchronization module can receive the stream of neuron outputs from the arithmetic unit and the further stream of further neuron outputs from the further arithmetic unit.
- the stream of the neuron outputs and the stream of further neuron outputs can be unaligned with respect to each other.
- the synchronization module can dynamically reorder the neuron outputs and the further neuron outputs to obtain an ordered sequence.
- the synchronization module can write the ordered sequence to a local or global memory storage.
- the ordered sequence can be used as further input values for further computations required by the ANN.
- the neuron outputs can be computed by the arithmetic unit based on first input values.
- the first input values can be selected from a stream of input values.
- the further neuron outputs can be computed by the further arithmetic unit based on second input values.
- the second input values can be selected from the further stream of further input values.
- a count of the input values in stream can be equal to a count of further input values in the further stream.
- First indexes of the first input values in the stream of input values can be different from second indexes of the second input values in the further stream of further input values.
- a method for accelerating ANN computation may further include dynamically selecting, by a controller communicatively coupled to a selector and an arithmetic unit and based on a criterion, an input value from a stream of input values of a neuron of ANN.
- the method may include dynamically configuring, by the controller, the selector to provide the selected input value to the arithmetic unit.
- the method may also include
- the method may include dynamically acquiring, by the arithmetic unit and based on the offset, a weight from a set of weights.
- the method may further include performing, by the arithmetic unit, a mathematical operation on the selected input value and the weight to obtain a result, wherein the result is to be used to compute an output of the neuron.
- FIG. 1 is a block diagram showing an example system wherein a method for acceleration of ANN computations can be implemented, according to some example embodiments.
- FIG. 2 shows an ANN, neuron, and transfer function, according to an example embodiment.
- FIG. 3 is a flow chart showing training and inference of ANN, according to some example embodiments.
- FIG. 4 is a block diagram showing a system for acceleration of ANN computations, according to some example embodiments.
- FIG. 5 is a block diagram showing a system for selecting input values for processing by ANN computations, according to an example embodiment.
- FIG. 6 is a block diagram showing a system for selecting input data for processing by ANN computations, according to another example embodiment.
- FIG. 7 is a block diagram showing a system for synchronization of outputs of neurons in ANN computations, according to another example embodiment.
- FIG. 8 is a flow chart showing steps of a method for acceleration of ANN computations, according to some example embodiments.
- FIG. 9 shows a computing system that can be used to implement
- Embodiments of this disclosure are concerned with methods and systems for acceleration of ANN computations.
- Embodiments of present disclosure may facilitate selection of input values for processing by neurons of an ANN in order to avoid unnecessary mathematical operations in computation of outputs of neurons and, thereby, accelerating of computations of the ANN.
- the input values equal to zero are not processed by arithmetic units configured to compute the neurons of ANN.
- the selection of input values can be also based on another criterion.
- the sequence of operations performed by the arithmetic units may depend dynamically on the stream of input values and the criterion used for selection.
- a select-all (no selection) criterion would result in an identical sequence of operations for all inputs, wherein essentially all operations of the neurons (and ANN) are performed.
- the computation of ANN will be similar to ANN computations in existing solutions which do not expose a dynamic behavior.
- module shall be construed to mean a hardware device, software, or a combination of both.
- a hardware-based module can use one or more microprocessors, FPGAs, application-specific integrated circuits (ASICs), programmable logic devices, transistor-based circuits, or various combinations thereof.
- Software-based modules can constitute computer programs, computer program procedures, computer program functions, and the like.
- a module of a system can be implemented by a computer or server, or by multiple computers or servers interconnected into a network.
- module may also refer to a subpart of a computer system, a hardware device, an integrated circuit, or a computer program.
- Technical effects of certain embodiments of the present disclosure can include configuring integrated circuits, FPGAs, or computer systems to perform ANN computations without execution of redundant and unnecessary mathematical operations, thereby accelerating the ANN computations. Further technical effects of some embodiments of the present disclosure can facilitate configuration of integrated circuits, FPGAs, or computer systems to dynamically qualify data on which
- FIG. 1 is a block diagram showing an example system 100, wherein a method for accelerating ANN computation can be implemented, according to some example embodiments.
- the system 100 can be part of a computing system, such as a personal computer, a server, a cloud-based computing recourse, and the like.
- the system 100 may include one or more FPGA boards 105 and a chipset 135 including a least one CPU.
- the chipset 135 can be communicatively connected to the FPGA boards 105 via a communication interface.
- the communication interface may include a Peripheral Component Interconnect Express (PCIE) standard 130.
- PCIE Peripheral Component Interconnect Express
- the communication interface may also include an Ethernet connection 131.
- the FPGA board 105 may include an FPGA 115, a volatile memory 110, and a non-volatile memory 120.
- the volatile memory 110 may include a double data rate synchronous dynamic random-access memory (DDR SDRAM), High Bandwidth Memory (HBM), or any other type of memory.
- the volatile memory 110 may include the host memory.
- the non-volatile memory 120 may include Electrically Erasable Programmable Read-Only Memory (EEROM), a solid-state drive (SSD), a flash memory, and so forth.
- EEROM Electrically Erasable Programmable Read-Only Memory
- SSD solid-state drive
- flash memory and so forth.
- the FPGA 115 can include blocks.
- the blocks may include a set of elementary nodes (also referred to as gates) performing basic hardware operations, such as Boolean operations.
- the blocks may further include registers retaining bit information, one or more memory storage of different sizes, and one or more digital signal processors (DSPs) to perform arithmetic computations, for example, additions and multiplications.
- DSPs digital signal processors
- Programming of FPGA 115 may include configuring each of the blocks to have an expected behavior and connecting the blocks by routing information between the blocks. Programming of FPGA 115 can be carried out using a result from a compiler taking as input schematic description, gate-level description, hardware languages like Verilog, System Verilog, or Very High Speed Integrated Circuit Hardware Description Language (VHDL), or any combination of thereof.
- VHDL Very High Speed Integrated Circuit Hardware Description Language
- the non-volatile memory 120 may be configured to store instructions in a form of bit file 125 to be executed by the FPGA 115.
- the FPGA 115 can be configured by the instructions to perform one or more floating point operations including
- the volatile memory 110 may be configured to store weights W[i] for neurons of one or more ANNs, input values V[i] to be processed for the ANNs, and results of ANNs computation including any intermediate results of computations of layers of the
- FIG. 2 shows ANN 210, neuron 220, and transfer function 230, according to some example embodiments.
- the ANN 210 may include one or more input layers 240, one or more hidden layers 250, and one or more output layers 260.
- Each of the input layers, hidden layers, and output layers may include one or more (artificial) neurons 220. The number of neurons can be different for different layers.
- Each of neurons 220 may represent a calculation of a mathematical function
- V[i] are neuron input values
- W[i] are weights assigned to input values at neuron
- F(X) is a transfer function.
- the transfer function 230 F(X) is selected to be zero for X ⁇ 0 and have a limit of zero as X approaches zero.
- the transfer function F(X) can be in the form of a sigmoid. The result of calculation of a neuron propagates as an input value of further neurons in the ANN.
- the further neurons can belong to either a next layer, a previous layer or the same layer.
- ANN 210 illustrated in FIG. 2 can be referred to as a feedforward neural network
- embodiments of the present disclosure can be also used in computations of convolution neural networks, recurrent neural networks, long short-term memory networks, and other types of ANNs.
- FIG. 3 is a flow chart showing training 310 and inference 325 of an ANN, according to some example embodiments.
- the training 310 (also known as learning) is a process of teaching ANN 305 to output a proper result based on a given set of training data 315.
- the process of training may include determining weights 320 of neurons of the ANN 305 based on training data 315.
- the training data 315 may include samples. Each of the samples may be represented as a pair of input values and an expected output.
- the training data 315 may include hundreds to millions of samples. While the training 310 is required to be performed only once, it may require a significant amount of computations and take a considerable time.
- the ANNs can be configured to solve different tasks including, for example, image recognition, speech
- the inference 325 is a process of computation of an ANN.
- the inference 325 uses the trained ANN weights 320 and new data 330 including new sets of input values. For each new set of input values, the computation of the ANN provides a new output which answer the problem that the ANN is supposed to solve.
- an ANN can be trained to recognize various animals in images.
- the ANN can be trained on millions of images of animals. Submitting a new image to the ANN would provide the information for animals in the new image (this process being known as image tagging). While the inference for each image takes less computations than training, number of inferences can be large because new images can be received from billions of sources.
- the inference 325 includes multiple computations of sum of products:
- V[i] are new input values and W[i] are weights associated with neurons of ANN.
- Some previous approaches for performing inference include inspection of the weights W[i] and replacing some of the weights W[i] with zero values if a value of the weight is relatively small when compared to other weights of the ANN. In FIG. 3, this process is shown as pruning 335.
- the pruning 335 generates new weights 340 that then can be used in inference 325 instead of the weights 320.
- Advantage of these approaches is that replacing the weights with zero values may allow decreasing the number of computations of the ANN, since multiplications by zero can be avoided in computations.
- the disadvantage of these approaches is that the ANN can become less accurate in producing a correct output due to lack of correspondence between the new weights 340 and training data 315 used in training of ANN.
- Another disadvantage of these approaches is that the pruning of weights is not based on new input values and allow only to avoid operations with weights equal to zero.
- the weights 310 may remain unchanged in inference 325, while
- Multiplications V[i ⁇ x W[i] are not carried out if a predetermined criterion is satisfied with respect to input value V[i]. For example, multiplication V[i] x W[i] can be skipped if the input value V[i] is substantially zero.
- the criterion used to select the operations to be done may be different from a comparison to zero and, thereby, allowing to avoid other operations dynamically based on the input values V[i] and other values, for example, static values including weights.
- embodiments of present disclosure allow dynamic selection of operations to be performed.
- FIG. 4 is a block diagram showing a system 400 for accelerating ANN computation, according to some example embodiments.
- the system may include an arithmetic unit 425, a controller 415, and a selector 420.
- the controller 415 may receive a set ⁇ T[i 0 ], Tfi , , V[L X-1 ] ⁇ of X input values of data 405.
- the controller 415 may optionally receive further input values 406 which are different from the input values 405.
- the further input values 406 can be related to the neuron, the layer, the ANN, the weights, the operation to be carried or any other kind of values.
- the controller 415 may provide, based on the input values 405 and the further values 406, an indication to the selector 420 as to which of the X input values are to be selected in the stream.
- the controller 415 may also provide, to the arithmetic unit 425, an offset or an index or bit enables of one or multiple selected value(s) in the set [VU O I VIH] . T[ ⁇ -i] ⁇
- FIG. 5 is a block diagram showing a controller 415, according to some example embodiments.
- the controller 415 can compare the input values to reference value(s) (ref).
- the reference value(s) can be included in the further values 406. In some embodiments, the controller 415 may not use the further values 406. In some embodiments, the reference value(s) can be equal to zero.
- the controller 415 may provide the selector 420 with one or more index(es) of the input value to be selected. In other embodiments, the controller 415 can perform a selection of an input value based on different criteria. In some embodiments, one or multiple input values 405 and multiple further values 406 can be used for the selection. In certain
- the input value may be selected if the input value is less than a threshold.
- the selector 420 may receive the set of input values ⁇ T[i 0 ], V[h]’ n[ ⁇ c- i ] and the indication from the controller 415 as to which of the input values to select.
- the selector 420 may select a value V[i] and provide the selected input value V[i] to the arithmetic unit 425.
- the information of the selected input value may be represented in any form.
- controller 415 and the selector 420 may be carried out as a single unit configured to perform functionalities of both controller 415 and selector 420.
- the selector 420 can be also configured to select weights 410 based on indications from the controller 415.
- the arithmetic unit 425 can be configured to compute sums, multiplications, accumulations, or other operations.
- the arithmetic unit 425 may receive the selected value V[i] from the selector 420 and the offset of index of selected value V[i] from the controller 415.
- the arithmetic unit 425 may be further configured to select, based on the offset, a weight W[i] corresponding to value V[i].
- the arithmetic unit 425 may further determine product V[i] x W[i] and add the product to corresponding sum. Because the multiplication is performed only for selected values of data and selected weights, the computation of sum, and hence computation of the ANN, can be accelerated.
- the arithmetic unit 425 can determine products V[j]xW[k], wherein j and k are determined based on the input values 405, further input values 406, and the weights 410 specified by the controller 415. In some embodiments, the arithmetic unit 425 can perform further mathematical operations different from products and sums, independently, prior to, or after performing the products or sums.
- FIG. 6 is a block diagram showing a system 600 for selecting input values to be processed in an ANN computation, according to another example embodiment.
- the system may include a controller 415, an arithmetic unit 425, and a memory 610.
- the memory 610 can be configured to store a set of X input values ⁇ T[i 0 ], Ffi , ... V[i x-1 ] ⁇ .
- the controller 415 can be configured to determine an address of an input value to be selected for multiplication in arithmetic unit 425.
- the arithmetic unit 425 may read the selected input value based on the address received from the controller 415.
- FIG. 7 is a block diagram showing a system 700 for synchronization of results of parallel calculations of neurons in the ANN, according to an example embodiment.
- the system 700 may include arithmetic units 710, a synchronization module 715, and a memory 720.
- the arithmetic units 710 can process input values of different data 705.
- the data 705 can be related to each other or be in part or generally the same input data.
- the data 705 may represent red, green, and blue color components of the color images.
- the data 705 may represent input values from different receptive fields selected by filters. Because the amount of input values selected for processing by different arithmetic units 710 can be different, arithmetic units 710 may finish processing of the input data at different rates.
- results of neuron computations (performed in parallel) within a layer of the ANN can be unaligned.
- data 0 and data 1 of the data 705 may include equal numbers of input values.
- the number and indexes of input values selected from data 0 to compute i-th neuron output by a first arithmetic unit of the arithmetic units 710 can be different from the number and indexes of input values selected from datcq to compute i-th neuron output by a second arithmetic unit of the arithmetic units 710.
- i-th neuron output of the first arithmetic unit can be generated substantially prior to an i-th neuron output of the second arithmetic unit; or 2) the i-th neuron output of the first arithmetic unit can be generated substantially after the i-th neuron output of the second arithmetic. Therefore, a first stream of neuron outputs from the first arithmetic unit can be substantially unaligned with a second stream of neuron outputs from the second arithmetic unit even when the first arithmetic unit and the second arithmetic unit are configured to process the data 0 and data 1 in parallel.
- the first stream of neuron outputs and the second stream of neuron outputs can also be unaligned due to the difference in the number of neuron outputs from the first arithmetic unit and the second arithmetic unit because the first arithmetic unit and the second arithmetic unit can be configured to generate different numbers of neuron outputs based on the same number of input values.
- the synchronization module 715 can be configured to re-align outputs from arithmetic units 710.
- the synchronization module can receive neuron outputs from arithmetic units 710, one neuron output at time, and write the neuron outputs to memory storage 720, in proper order.
- the synchronization module 715 may receive partial results of neurons or multiple results of different neurons.
- the neuron outputs may be received by the synchronization module 715 in an unordered time sequence O[0,0], O[0,l], O[l,0], 0[ 0,2], 0[1,1], ... .
- the synchronization module 715 can write the neuron outputs in a proper order: O[0,0], O[l,0], ..., O[N,0], O[0,l], 0[1,1], ..., 0[N,1], 0[ 0,2], 0[1,2], ... .
- the neuron outputs can be read in a proper order as inputs for neurons at further layer(s) of the ANN.
- Data 705 may include labels 730 indicating the last input value of the stream for input values.
- the arithmetic units 710 can be configured to send, to the synchronization module 715, an indication 735 that the last part of the input data is being processed. Upon receiving the indication 735, the synchronization module 715 may finish writing neuron output to memory storage 720 and stand by ready for the next series of neuron outputs.
- the indication 730 can be indicative of the first part of a new input data being processed rather than the last part of the input data. In another embodiment, the indication may include a number of input values to be processed. The indication can be part of the data or a separate information, as shown on FIG. 7 with marks 730 and 735.
- FIG. 8 is a flow chart illustrating a method 800 for accelerating of ANN computations, in accordance with some example embodiments.
- the operations may be combined, performed in parallel, or performed in a different order.
- the method 800 may also include additional or fewer operations than those illustrated.
- the method 800 may be performed by system 100 described above with reference to in FIG. 1.
- the method 800 may select, by a controller communicatively coupled to a selector and an arithmetic unit and based on a criterion, an input value from a stream of input values of a neuron.
- the controller can be configured to select input values by comparison of the input values to a reference value.
- the reference value can be equal to zero.
- the method 800 may configure, by the controller, the selector to provide the selected input value to the arithmetic unit.
- the method 800 may provide, by the controller, to the arithmetic unit, information for the selected input value.
- the information may include an offset of the selected input values in the stream.
- the method 800 may acquire, by the arithmetic unit and based on the information, a weight from a set of weights.
- the method 800 may perform, by the arithmetic unit, a mathematical operation on the selected input value and the weight to obtain a result, wherein the result is to be used in computing an output of the neuron.
- the arithmetic unit may determine a multiplication product of the selected input values and weight, and summate the multiplication product into a sum.
- FIG. 9 illustrates an example computing system 900 that may be used to implement embodiments described herein.
- the example computing system 900 of FIG. 9 may include one or more processors 910 and memory 920.
- Memory 920 may store, in part, instructions and data for execution by the one or more processors 910.
- Memory 920 can store the executable code when the exemplary computing system 900 is in operation.
- the processor 910 may include internal accelerators like a graphical processing unit, a Field Programmable Gate Array, or similar accelerators that may be suitable for use with embodiments described herein.
- the memory 920 may include internal accelerators like a graphical processing unit, a Field Programmable Gate Array, or similar accelerators that may be suitable for use with embodiments described herein.
- the example computing system 900 of FIG. 9 may further include a mass storage 930, portable storage 940, one or more output devices 950, one or more input devices 960, a network interface 970, and one or more peripheral devices 980.
- the components shown in FIG. 9 are depicted as being connected via a single bus 990.
- the components may be connected through one or more data transport means.
- the one or more processors 910 and memory 920 may be connected via a local microprocessor bus, and the mass storage 930, one or more peripheral devices 980, portable storage 940, and network interface 970 may be connected via one or more input/output buses.
- Mass storage 930 which may be implemented with a magnetic disk drive, an optical disk drive or a solid state drive, is a non-volatile storage device for storing data and instructions for use by a magnetic disk, an optical disk drive or SSD, which in turn may be used by one or more processors 910. Mass storage 930 can store the system software for implementing embodiments described herein for purposes of loading that software into memory 920.
- the mass storage 930 may also include internal accelerators like a graphical processing unit, a Field Programmable Gate Array, or similar
- Portable storage 940 may operate in conjunction with a portable non-volatile storage medium, such as a compact disk (CD) or digital video disc (DVD), to input and output data and code to and from the computing system 900 of FIG. 9.
- a portable non-volatile storage medium such as a compact disk (CD) or digital video disc (DVD)
- CD compact disk
- DVD digital video disc
- the system software for implementing embodiments described herein may be stored on such a portable medium and input to the computing system 900 via the portable storage 940.
- One or more input devices 960 provide a portion of a user interface.
- the one or more input devices 960 may include an alphanumeric keypad, such as a keyboard, for inputting alphanumeric and other information, or a pointing device, such as a mouse, a trackball, a stylus, or cursor direction keys.
- the computing system 900 as shown in FIG. 9 includes one or more output devices 950. Suitable one or more output devices 950 include speakers, printers, network interfaces, and monitors.
- Network interface 970 can be utilized to communicate with external devices, external computing devices, servers, and networked systems via one or more
- Network interface 970 may be a network interface card, such as an Ethernet card, optical transceiver, radio frequency transceiver, or any other type of device that can send and receive information.
- Other examples of such network interfaces may include Bluetooth®, 3G, 4G, and WiFi® radios in mobile computing devices as well as a USB.
- ne or more peripheral devices 980 may include any type of computer support device to add additional functionality to the computing system.
- the one or more peripheral devices 980 may include a modem or a router.
- the example computing system 900 of FIG. 9 may also include one or more accelerator devices 985.
- the accelerator devices 985 may include PCIe-form-factor boards or storage-form-factor boards, or any electronic board equipped with a specific electronic component like a Graphical Processing Unit, a Neural Processing Unit, a Multi-CPU component, a Field Programmable Gate Array component, or similar accelerators electronic or photonic components, that may be suitable for use with embodiments described herein.
- the components contained in the exemplary computing system 900 of FIG. 9 are those typically found in computing systems that may be suitable for use with embodiments described herein and are intended to represent a broad category of such computer components that are well known in the art.
- the exemplary computing system 900 of FIG. 9 can be a personal computer, hand held computing device, telephone, mobile computing device, workstation, server, minicomputer, mainframe computer, or any other computing device.
- the computer can also include different bus configurations, networked platforms, multi-processor platforms, and so forth.
- Various operating systems (OS) can be used including UNIX, Linux, Windows, Macintosh OS, Palm OS, and other suitable operating systems.
- Some of the above-described functions may be composed of instructions that are stored on storage media (e.g., computer-readable medium).
- the instructions may be retrieved and executed by the processor.
- storage media are memory devices, tapes, disks, and the like.
- the instructions are operational when executed by the processor to direct the processor to operate in accord with the example
- Non-volatile media include, for example, optical or magnetic disks, such as a fixed disk.
- Volatile media include dynamic memory, such as RAM.
- Transmission media include coaxial cables, copper wire, and fiber optics, among others, including the wires that include one embodiment of a bus.
- Transmission media can also take the form of acoustic or light waves, such as those generated during radio frequency and infrared data communications.
- Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, any other magnetic medium, SSD, a CD-read-only memory (ROM) disk, DVD, any other optical medium, any other physical medium with patterns of marks or holes, a RAM, a PROM, an EPROM, an EEPROM, a FLASHEPROM, any other memory chip or cartridge, a carrier wave, or any other medium from which a computer can read.
- Various forms of computer-readable media may be involved in carrying one or more sequences of one or more instructions to a CPU for execution.
- a bus carries the data to system RAM, from which a CPU retrieves and executes the instructions.
- the instructions received by system RAM can optionally be stored on a fixed disk either before or after execution by a CPU.
- the instructions or data may not be used by the CPU but be accessed in writing or reading from the other devices without having the CPU directing them.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
Systems and methods for synchronizing streams of neuron outputs are provided. An example method includes generating, by an arithmetic unit, a stream of neuron outputs, generating, by at least one further arithmetic unit, a further stream of further neuron outputs, receiving, by a synchronization module communicatively coupled to the arithmetic unit and the further arithmetic unit, the stream of neuron outputs from the arithmetic unit and the further stream of further neuron outputs from the further arithmetic unit, wherein the stream of the neuron outputs and the stream of further neuron outputs are unaligned with respect to each other, and reordering, by the synchronization module, the neuron outputs and the further neuron outputs to obtain an ordered sequence.
Description
REALIGNING STREAMS OF NEURON OUTPUTS IN ARTIFICIAL NEURAL
NETWORK COMPUTATIONS
TECHNICAL FIELD
[0001] The present disclosure relates generally to data processing and, more particularly, to system and method for accelerating artificial neural network
computations.
BACKGROUND
[0002] Artificial Neural Networks (ANNs) are simplified and reduced models reproducing behavior of human brain. The human brain contains 10-20 billion neurons connected through synapses. Electrical and chemical messages are passed from neurons to neurons based on input information and their resistance to passing information. In the ANNs, a neuron can be represented by a node performing a simple operation of addition coupled with a saturation function. A synapse can be represented by a connection between two nodes. Each of the connections can be associated with an operation of multiplication by a constant. The ANNs are particularly useful for solving problems that cannot be easily solved by classical computer programs.
[0003] While forms of the ANNs may vary, they all have the same basic elements similar to the human brain. A typical ANN can be organized into layers, each of the layers may include many neurons sharing similar functionality. The inputs of a layer may come from a previous layer, multiple previous layers, any other layers or even the layer itself. Major architectures of ANNs include Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and Long Term Short Memory (LTSM) network, but other architectures of ANN can be developed for specific applications. While some operations have a natural sequence, for example a layer depending on previous layers, most of the operations can be carried out in parallel within the same
layer. The ANNs can then be computed in parallel on many different computing elements similar to neurons of the brain. A single ANN may have hundreds of layers. Each of the layers can involve millions of connections. Thus, a single ANN may potentially require billions of simple operations like multiplications and additions.
[0004] Because of the larger number of operations and their parallel nature, ANNs can result in a very heavy load for processing units (e.g., CPU), even ones running at high rates. Sometimes, to overcome limitations of CPUs, graphics processing units (GPUs) can be used to process large ANNs because GPUs have a much higher throughput capacity of operations in comparison to CPUs. Because this approach solves, at least partially, the throughput limitation problem, GPUs appear to be more efficient in the computations of ANNs than the CPUs. However, GPUs are not well suited to the computations of ANNs because the GPUs have been specifically designed to compute graphical images.
[0005] The GPUs may provide a certain level of parallelism in computations.
However, the GPUs are constraining the computations in long pipes implying latency and lack of reactivity. To deliver the maximum throughput, very large GPUs can be used which may involving excessive power consumption, a typical issue of GPUs. Since the GPUs may require more power consumptions for the computations of ANNs, the deployment of GPUs can be difficult.
[0006] To summarize, CPUs provide a very generic engine that can execute very few sequences of instructions with a minimum effort in terms of programming, but lack the power of computing for ANN. GPUs are slightly more parallel and require a larger effort of programming than CPUs, which can be hidden behind libraries with some performance costs, but are not very well suitable for ANNs.
[0007] Field Programmable Gate Arrays (FPGAs) are professional components that can be programmed at the hardware level after they are manufactured. The FPGAs can be configured to perform computations in parallel. Therefore, FPGAs can be well suited to compute ANNs. One of the challenges of FPGAs is the programming, which requires a much larger effort than programming CPUs and GPUs. Adaption of FPGAs to perform ANN computations can be more challenging than for CPUs and GPUs.
[0008] Most attempts in programming FPGAs to compute ANNs have being focusing on a specific ANN or a subset of ANNs, or requiring to modify the ANN structure to fit into a specific limited accelerator, or providing a basic functionality without solving the problem of computing ANN on FPGAs globally. The computation scale is typically not taken into account for existing FPGA solutions, many of the research being limited to a single or few computation engines, which could be
replicated. The existing FPGA solutions do not solve the problem of massive data movement required at large scale for the actual ANN involved in real industrial applications. The inputs to be computed with an ANN are typically provided by an artificial intelligence (AI) framework. Those programs are used by the AI community to develop new ANN or global solutions based on ANN. FPGAs are also lacking integration in those software environments.
SUMMARY
[0009] This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
[0010] Provided are computer-implemented systems and methods for accelerating ANN computations.
[0011] According to one example embodiments, a system for accelerating ANN computation is provided. The system may include a controller, a selector
communicatively coupled to the controller, and an arithmetic unit communicatively coupled to the controller and the selector. The controller can be configured to dynamically control the selection, based on a criterion, of an input value from a stream of input values of a neuron. The controller can configure the selector to provide, dynamically, the selected input value to the arithmetic unit. The controller can provide, to the arithmetic unit, information for the selected input value. The information may include an offset of the input value in the stream. The offset cannot be statically computed before the computation of the neuron because the input values in the stream can be generated as a result of previous neuron computation. The arithmetic unit can be configured to acquire, based on the offset, a weight from a set of weights. The arithmetic unit can further perform a mathematical operation on the selected input value and the weight to obtain a result to be used in computing an output of the neuron.
[0012] The mathematical operation performed by the arithmetic unit in its simplest form may include a multiplication product. The criterion applied by the controller may include comparison of the input value to a reference value. The comparison may include an equality comparison to zero. The comparison may also include a more-than comparison to a threshold.
[0013] The selector may include a multiplexer. The selector may also include information related to a memory address. Input values not selected from the subset of input values are not provided to the arithmetic unit, therefore, a count of mathematical operations performed by the arithmetic unit on selected input values from the subset of input values is less than a count of mathematical operations required to be performed by the arithmetic unit on all input values from the subset of the input values.
[0014] According to another example embodiment, a system for realigning streams of neuron outputs is provided. The system may include an arithmetic unit, at least one
further arithmetic unit, and a synchronization module communicatively coupled to the arithmetic unit and the further arithmetic unit. The arithmetic unit can generate a stream of neuron outputs and the further arithmetic unit can generate a further stream of further neuron outputs. The synchronization module can receive the stream of neuron outputs from the arithmetic unit and the further stream of further neuron outputs from the further arithmetic unit. The stream of the neuron outputs and the stream of further neuron outputs can be unaligned with respect to each other. The synchronization module can dynamically reorder the neuron outputs and the further neuron outputs to obtain an ordered sequence.
[0015] Reordering may include positioning an i-th neuron output from the stream before an i-th further neuron output from the further stream for i=l, ..., N, wherein N is a number of neuron outputs in the stream.
[0016] The synchronization module can write the ordered sequence to a local or global memory storage. The ordered sequence can be used as further input values for further computations required by the ANN.
[0017] The neuron outputs can be computed by the arithmetic unit based on first input values. The first input values can be selected from a stream of input values. The further neuron outputs can be computed by the further arithmetic unit based on second input values. The second input values can be selected from the further stream of further input values. A count of the input values in stream can be equal to a count of further input values in the further stream. First indexes of the first input values in the stream of input values can be different from second indexes of the second input values in the further stream of further input values. The difference in the first indexes and the second indexes may cause either an i-th neuron output in the stream of the neuron outputs to be generated substantially prior to an i-th further neuron output in the further stream of further neurons or the i-th neuron output in the stream to be generated substantially
after the i-th further neuron output in the further stream for at least one of i=l, .. N, wherein N is a number of neuron outputs in the stream.
[0018] According to yet another example embodiment, a method for accelerating ANN computation is provided. The method may further include dynamically selecting, by a controller communicatively coupled to a selector and an arithmetic unit and based on a criterion, an input value from a stream of input values of a neuron of ANN. The method may include dynamically configuring, by the controller, the selector to provide the selected input value to the arithmetic unit. The method may also include
dynamically providing, by the controller to the arithmetic unit, an offset of the selected input value. The method may include dynamically acquiring, by the arithmetic unit and based on the offset, a weight from a set of weights. The method may further include performing, by the arithmetic unit, a mathematical operation on the selected input value and the weight to obtain a result, wherein the result is to be used to compute an output of the neuron.
[0019] Additional objects, advantages, and novel features will be set forth in part in the detailed description section of this disclosure, which follows, and in part will become apparent to those skilled in the art upon examination of this specification and the accompanying drawings or may be learned by production or operation of the example embodiments. The objects and advantages of the concepts may be realized and attained by means of the methodologies, instrumentalities, and combinations
particularly pointed out in the appended claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0021] Embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and, in which:
[0022] FIG. 1 is a block diagram showing an example system wherein a method for acceleration of ANN computations can be implemented, according to some example embodiments.
[0023] FIG. 2 shows an ANN, neuron, and transfer function, according to an example embodiment.
[0024] FIG. 3 is a flow chart showing training and inference of ANN, according to some example embodiments.
[0025] FIG. 4 is a block diagram showing a system for acceleration of ANN computations, according to some example embodiments.
[0026] FIG. 5 is a block diagram showing a system for selecting input values for processing by ANN computations, according to an example embodiment.
[0027] FIG. 6 is a block diagram showing a system for selecting input data for processing by ANN computations, according to another example embodiment.
[0028] FIG. 7 is a block diagram showing a system for synchronization of outputs of neurons in ANN computations, according to another example embodiment.
[0029] FIG. 8 is a flow chart showing steps of a method for acceleration of ANN computations, according to some example embodiments.
[0030] FIG. 9 shows a computing system that can be used to implement
embodiments of the disclosed technology.
DETAILED DESCRIPTION
[0031] The following detailed description includes references to the accompanying drawings, which form a part of the detailed description. The drawings show
illustrations in accordance with exemplary embodiments. These exemplary
embodiments, which are also referred to herein as "examples," are described in enough detail to enable those skilled in the art to practice the present subject matter. The embodiments can be combined, other embodiments can be utilized, or structural, logical, and electrical changes can be made without departing from the scope of what is claimed. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope is defined by the appended claims and their equivalents.
[0032] For purposes of this document, the terms "or" and "and" shall mean "and/or" unless stated otherwise or clearly intended otherwise by the context of their use. The term "a" shall mean "one or more" unless stated otherwise or where the use of "one or more" is clearly inappropriate. The terms "comprise," "comprising," "include," and "including" are interchangeable and not intended to be limiting. For example, the term "including" shall be interpreted to mean "including, but not limited to."
[0033] Embodiments of this disclosure are concerned with methods and systems for acceleration of ANN computations. Embodiments of present disclosure may facilitate selection of input values for processing by neurons of an ANN in order to avoid unnecessary mathematical operations in computation of outputs of neurons and, thereby, accelerating of computations of the ANN. In one example embodiment, the input values equal to zero are not processed by arithmetic units configured to compute the neurons of ANN. The selection of input values can be also based on another criterion. The sequence of operations performed by the arithmetic units may depend dynamically on the stream of input values and the criterion used for selection. For example, a select-all (no selection) criterion would result in an identical sequence of operations for all inputs, wherein essentially all operations of the neurons (and ANN)
are performed. In this case the computation of ANN will be similar to ANN computations in existing solutions which do not expose a dynamic behavior.
[0034] While some embodiments of the present disclosure are described herein in reference to operations of FPGAs, the present technology may be also practiced with application-specific integrated circuits (ASICs), programmable logic devices, transistor- based circuits, or various combinations thereof. The methods described herein can be also implemented by hardware modules, software modules, or combinations of both. The methods can also be embodied in computer-readable instructions stored on computer-readable media.
[0035] The term "module" shall be construed to mean a hardware device, software, or a combination of both. For example, a hardware-based module can use one or more microprocessors, FPGAs, application-specific integrated circuits (ASICs), programmable logic devices, transistor-based circuits, or various combinations thereof. Software-based modules can constitute computer programs, computer program procedures, computer program functions, and the like. In addition, a module of a system can be implemented by a computer or server, or by multiple computers or servers interconnected into a network. Alternatively, module may also refer to a subpart of a computer system, a hardware device, an integrated circuit, or a computer program.
[0036] Technical effects of certain embodiments of the present disclosure can include configuring integrated circuits, FPGAs, or computer systems to perform ANN computations without execution of redundant and unnecessary mathematical operations, thereby accelerating the ANN computations. Further technical effects of some embodiments of the present disclosure can facilitate configuration of integrated circuits, FPGAs, or computer systems to dynamically qualify data on which
mathematical operations are to be performed in the ANN computations. Yet further technical effects of embodiments of the present disclosure include configuration of
integrated circuits, FPGAs, or computer systems to dynamically align results of neuron computations performed in parallel by multiple processing units.
[0037] Referring now to the drawings, exemplary embodiments are described. The drawings are schematic illustrations of idealized example embodiments. Thus, the example embodiments discussed herein should not be construed as limited to the particular illustrations presented herein, rather these example embodiments can include deviations and differ from the illustrations presented herein.
[0038] FIG. 1 is a block diagram showing an example system 100, wherein a method for accelerating ANN computation can be implemented, according to some example embodiments. The system 100 can be part of a computing system, such as a personal computer, a server, a cloud-based computing recourse, and the like. The system 100 may include one or more FPGA boards 105 and a chipset 135 including a least one CPU. The chipset 135 can be communicatively connected to the FPGA boards 105 via a communication interface. The communication interface may include a Peripheral Component Interconnect Express (PCIE) standard 130. The communication interface may also include an Ethernet connection 131.
[0039] The FPGA board 105 may include an FPGA 115, a volatile memory 110, and a non-volatile memory 120. The volatile memory 110 may include a double data rate synchronous dynamic random-access memory (DDR SDRAM), High Bandwidth Memory (HBM), or any other type of memory. The volatile memory 110 may include the host memory. The non-volatile memory 120 may include Electrically Erasable Programmable Read-Only Memory (EEROM), a solid-state drive (SSD), a flash memory, and so forth.
[0040] The FPGA 115 can include blocks. The blocks may include a set of elementary nodes (also referred to as gates) performing basic hardware operations, such as Boolean operations. The blocks may further include registers retaining bit information, one or more memory storage of different sizes, and one or more digital signal processors
(DSPs) to perform arithmetic computations, for example, additions and multiplications. Programming of FPGA 115 may include configuring each of the blocks to have an expected behavior and connecting the blocks by routing information between the blocks. Programming of FPGA 115 can be carried out using a result from a compiler taking as input schematic description, gate-level description, hardware languages like Verilog, System Verilog, or Very High Speed Integrated Circuit Hardware Description Language (VHDL), or any combination of thereof.
[0041] The non-volatile memory 120 may be configured to store instructions in a form of bit file 125 to be executed by the FPGA 115. The FPGA 115 can be configured by the instructions to perform one or more floating point operations including
multiplication and addition to calculate sum of products that can be used in neural network computations.
[0042] The volatile memory 110 may be configured to store weights W[i] for neurons of one or more ANNs, input values V[i] to be processed for the ANNs, and results of ANNs computation including any intermediate results of computations of layers of the
ANNs.
[0043] FIG. 2 shows ANN 210, neuron 220, and transfer function 230, according to some example embodiments. The ANN 210 may include one or more input layers 240, one or more hidden layers 250, and one or more output layers 260. Each of the input layers, hidden layers, and output layers may include one or more (artificial) neurons 220. The number of neurons can be different for different layers.
[0044] Each of neurons 220 may represent a calculation of a mathematical function
[0045] wherein V[i] are neuron input values, W[i] are weights assigned to input values at neuron, and F(X) is a transfer function. Typically, the transfer function 230 F(X) is selected to be zero for X < 0 and have a limit of zero as X approaches zero. For
example, the transfer function F(X) can be in the form of a sigmoid. The result of calculation of a neuron propagates as an input value of further neurons in the ANN.
The further neurons can belong to either a next layer, a previous layer or the same layer.
[0046] It should be noted that while the ANN 210 illustrated in FIG. 2 can be referred to as a feedforward neural network, embodiments of the present disclosure can be also used in computations of convolution neural networks, recurrent neural networks, long short-term memory networks, and other types of ANNs.
[0047] FIG. 3 is a flow chart showing training 310 and inference 325 of an ANN, according to some example embodiments. The training 310 (also known as learning) is a process of teaching ANN 305 to output a proper result based on a given set of training data 315. The process of training may include determining weights 320 of neurons of the ANN 305 based on training data 315. The training data 315 may include samples. Each of the samples may be represented as a pair of input values and an expected output. The training data 315 may include hundreds to millions of samples. While the training 310 is required to be performed only once, it may require a significant amount of computations and take a considerable time. The ANNs can be configured to solve different tasks including, for example, image recognition, speech
recognition, handwriting recognition, machine translation, social network filtering, video games, medical diagnosis, and so forth.
[0048] The inference 325 is a process of computation of an ANN. The inference 325 uses the trained ANN weights 320 and new data 330 including new sets of input values. For each new set of input values, the computation of the ANN provides a new output which answer the problem that the ANN is supposed to solve. For example, an ANN can be trained to recognize various animals in images. Correspondingly, the ANN can be trained on millions of images of animals. Submitting a new image to the ANN would provide the information for animals in the new image (this process being known as image tagging). While the inference for each image takes less computations than
training, number of inferences can be large because new images can be received from billions of sources.
[0049] The inference 325 includes multiple computations of sum of products:
[0050] wherein the V[i] are new input values and W[i] are weights associated with neurons of ANN. Some previous approaches for performing inference include inspection of the weights W[i] and replacing some of the weights W[i] with zero values if a value of the weight is relatively small when compared to other weights of the ANN. In FIG. 3, this process is shown as pruning 335. The pruning 335 generates new weights 340 that then can be used in inference 325 instead of the weights 320. Advantage of these approaches is that replacing the weights with zero values may allow decreasing the number of computations of the ANN, since multiplications by zero can be avoided in computations. The disadvantage of these approaches is that the ANN can become less accurate in producing a correct output due to lack of correspondence between the new weights 340 and training data 315 used in training of ANN. Another disadvantage of these approaches is that the pruning of weights is not based on new input values and allow only to avoid operations with weights equal to zero.
[0051] In contrast to previous approaches, in some embodiments of the present disclosure, the weights 310 may remain unchanged in inference 325, while
multiplication by zero can be avoided by inspecting input values V[i]. Multiplications V[i\ x W[i] are not carried out if a predetermined criterion is satisfied with respect to input value V[i]. For example, multiplication V[i] x W[i] can be skipped if the input value V[i] is substantially zero. In some other embodiments of the present disclosure, the criterion used to select the operations to be done may be different from a comparison to zero and, thereby, allowing to avoid other operations dynamically based on the input values V[i] and other values, for example, static values including weights. Thus, in
contrast to previous approaches, embodiments of present disclosure allow dynamic selection of operations to be performed.
[0052] FIG. 4 is a block diagram showing a system 400 for accelerating ANN computation, according to some example embodiments. The system may include an arithmetic unit 425, a controller 415, and a selector 420.
[0053] The controller 415 may receive a set {T[i0], Tfi , , V[LX-1]} of X input values of data 405. The controller 415 may optionally receive further input values 406 which are different from the input values 405. The further input values 406 can be related to the neuron, the layer, the ANN, the weights, the operation to be carried or any other kind of values. The controller 415 may provide, based on the input values 405 and the further values 406, an indication to the selector 420 as to which of the X input values are to be selected in the stream. The controller 415 may also provide, to the arithmetic unit 425, an offset or an index or bit enables of one or multiple selected value(s) in the set [VUOI VIH] . T[ύ-i]}·
[0054] FIG. 5 is a block diagram showing a controller 415, according to some example embodiments. Upon receiving the set {T[i0], Tfi , , VTG } of X input values of the data, the controller 415 can compare the input values to reference value(s) (ref). The reference value(s) can be included in the further values 406. In some embodiments, the controller 415 may not use the further values 406. In some embodiments, the reference value(s) can be equal to zero. Based on the result of the comparison, the controller 415 may provide the selector 420 with one or more index(es) of the input value to be selected. In other embodiments, the controller 415 can perform a selection of an input value based on different criteria. In some embodiments, one or multiple input values 405 and multiple further values 406 can be used for the selection. In certain
embodiments, other selection operations could be done involving one or multiple X input values 405 or one or multiple further values 406. For example, the input value may be selected if the input value is less than a threshold.
[0055] Referring back to FIG. 4, the selector 420 may receive the set of input values {T[i0], V[h]’ n[ΐc-i ] and the indication from the controller 415 as to which of the input values to select. The selector 420 may select a value V[i] and provide the selected input value V[i] to the arithmetic unit 425. The information of the selected input value may be represented in any form. In some embodiments, the controller 415 and the selector 420 may be carried out as a single unit configured to perform functionalities of both controller 415 and selector 420. In further embodiments, the selector 420 can be also configured to select weights 410 based on indications from the controller 415.
[0056] The arithmetic unit 425 can be configured to compute sums, multiplications, accumulations, or other operations. The arithmetic unit 425 may receive the selected value V[i] from the selector 420 and the offset of index of selected value V[i] from the controller 415. The arithmetic unit 425 may be further configured to select, based on the offset, a weight W[i] corresponding to value V[i]. The arithmetic unit 425 may further determine product V[i] x W[i] and add the product to corresponding sum. Because the multiplication is performed only for selected values of data and selected weights, the computation of sum, and hence computation of the ANN, can be accelerated. In some embodiments, the arithmetic unit 425 can determine products V[j]xW[k], wherein j and k are determined based on the input values 405, further input values 406, and the weights 410 specified by the controller 415. In some embodiments, the arithmetic unit 425 can perform further mathematical operations different from products and sums, independently, prior to, or after performing the products or sums.
[0057] FIG. 6 is a block diagram showing a system 600 for selecting input values to be processed in an ANN computation, according to another example embodiment. The system may include a controller 415, an arithmetic unit 425, and a memory 610. The memory 610 can be configured to store a set of X input values {T[i0], Ffi , ... V[ix-1]}. The controller 415 can be configured to determine an address of an input value to be
selected for multiplication in arithmetic unit 425. The arithmetic unit 425 may read the selected input value based on the address received from the controller 415.
[0058] FIG. 7 is a block diagram showing a system 700 for synchronization of results of parallel calculations of neurons in the ANN, according to an example embodiment. The system 700 may include arithmetic units 710, a synchronization module 715, and a memory 720.
[0059] The arithmetic units 710 can process input values of different data 705. The data 705 can be related to each other or be in part or generally the same input data. For example, in the computation of the ANN trained to recognize objects in color images, the data 705 may represent red, green, and blue color components of the color images.
In another example, in computations of the ANN the data 705 may represent input values from different receptive fields selected by filters. Because the amount of input values selected for processing by different arithmetic units 710 can be different, arithmetic units 710 may finish processing of the input data at different rates.
Therefore, results of neuron computations (performed in parallel) within a layer of the ANN can be unaligned.
[0060] For example, data0 and data1 of the data 705 may include equal numbers of input values. However, the number and indexes of input values selected from data0 to compute i-th neuron output by a first arithmetic unit of the arithmetic units 710 can be different from the number and indexes of input values selected from datcq to compute i-th neuron output by a second arithmetic unit of the arithmetic units 710. Due to the differences in the numbers of selected input values and the indexes of the selected input values, either 1) i-th neuron output of the first arithmetic unit can be generated substantially prior to an i-th neuron output of the second arithmetic unit; or 2) the i-th neuron output of the first arithmetic unit can be generated substantially after the i-th neuron output of the second arithmetic. Therefore, a first stream of neuron outputs from the first arithmetic unit can be substantially unaligned with a second stream of
neuron outputs from the second arithmetic unit even when the first arithmetic unit and the second arithmetic unit are configured to process the data0 and data1 in parallel.
The first stream of neuron outputs and the second stream of neuron outputs can also be unaligned due to the difference in the number of neuron outputs from the first arithmetic unit and the second arithmetic unit because the first arithmetic unit and the second arithmetic unit can be configured to generate different numbers of neuron outputs based on the same number of input values.
[0061] The synchronization module 715 can be configured to re-align outputs from arithmetic units 710. The synchronization module can receive neuron outputs from arithmetic units 710, one neuron output at time, and write the neuron outputs to memory storage 720, in proper order. In some embodiments, the synchronization module 715 may receive partial results of neurons or multiple results of different neurons. In example of FIG. 1, the synchronization module 715 may receive neuron outputs O[0,i] (i=l,...,last) from the first of the arithmetic units, neuron outputs 0[l,i] (i=l,.. ,last) from the second of the arithmetic units, and so on. The neuron outputs may be received by the synchronization module 715 in an unordered time sequence O[0,0], O[0,l], O[l,0], 0[ 0,2], 0[1,1], ... . The synchronization module 715 can write the neuron outputs in a proper order: O[0,0], O[l,0], ..., O[N,0], O[0,l], 0[1,1], ..., 0[N,1], 0[ 0,2], 0[1,2], ... . Thus, the neuron outputs can be read in a proper order as inputs for neurons at further layer(s) of the ANN. Data 705 may include labels 730 indicating the last input value of the stream for input values. The arithmetic units 710 can be configured to send, to the synchronization module 715, an indication 735 that the last part of the input data is being processed. Upon receiving the indication 735, the synchronization module 715 may finish writing neuron output to memory storage 720 and stand by ready for the next series of neuron outputs. In some embodiments, the indication 730 can be indicative of the first part of a new input data being processed rather than the last part of the input data. In another embodiment, the indication may include a number of input
values to be processed. The indication can be part of the data or a separate information, as shown on FIG. 7 with marks 730 and 735.
[0062] FIG. 8 is a flow chart illustrating a method 800 for accelerating of ANN computations, in accordance with some example embodiments. In some embodiments, the operations may be combined, performed in parallel, or performed in a different order. The method 800 may also include additional or fewer operations than those illustrated. The method 800 may be performed by system 100 described above with reference to in FIG. 1.
[0063] In block 802, the method 800 may select, by a controller communicatively coupled to a selector and an arithmetic unit and based on a criterion, an input value from a stream of input values of a neuron. The controller can be configured to select input values by comparison of the input values to a reference value. The reference value can be equal to zero. In block 804, the method 800 may configure, by the controller, the selector to provide the selected input value to the arithmetic unit.
[0064] In block 806, the method 800 may provide, by the controller, to the arithmetic unit, information for the selected input value. The information may include an offset of the selected input values in the stream. In block 808, the method 800 may acquire, by the arithmetic unit and based on the information, a weight from a set of weights. In block 810, the method 800 may perform, by the arithmetic unit, a mathematical operation on the selected input value and the weight to obtain a result, wherein the result is to be used in computing an output of the neuron. For example, the arithmetic unit may determine a multiplication product of the selected input values and weight, and summate the multiplication product into a sum. A count of the input values in the stream may be greater than a count of mathematical operations performed by the arithmetic units, wherein the operations are performed on the input values selected from the stream.
[0065] FIG. 9 illustrates an example computing system 900 that may be used to implement embodiments described herein. The example computing system 900 of FIG. 9 may include one or more processors 910 and memory 920. Memory 920 may store, in part, instructions and data for execution by the one or more processors 910. Memory 920 can store the executable code when the exemplary computing system 900 is in operation. The processor 910 may include internal accelerators like a graphical processing unit, a Field Programmable Gate Array, or similar accelerators that may be suitable for use with embodiments described herein. The memory 920 may include internal accelerators like a graphical processing unit, a Field Programmable Gate Array, or similar accelerators that may be suitable for use with embodiments described herein. The example computing system 900 of FIG. 9 may further include a mass storage 930, portable storage 940, one or more output devices 950, one or more input devices 960, a network interface 970, and one or more peripheral devices 980.
[0066] The components shown in FIG. 9 are depicted as being connected via a single bus 990. The components may be connected through one or more data transport means. The one or more processors 910 and memory 920 may be connected via a local microprocessor bus, and the mass storage 930, one or more peripheral devices 980, portable storage 940, and network interface 970 may be connected via one or more input/output buses.
[0067] Mass storage 930, which may be implemented with a magnetic disk drive, an optical disk drive or a solid state drive, is a non-volatile storage device for storing data and instructions for use by a magnetic disk, an optical disk drive or SSD, which in turn may be used by one or more processors 910. Mass storage 930 can store the system software for implementing embodiments described herein for purposes of loading that software into memory 920. The mass storage 930 may also include internal accelerators like a graphical processing unit, a Field Programmable Gate Array, or similar
accelerators that may be suitable for use with embodiments described herein.
[0068] Portable storage 940 may operate in conjunction with a portable non-volatile storage medium, such as a compact disk (CD) or digital video disc (DVD), to input and output data and code to and from the computing system 900 of FIG. 9. The system software for implementing embodiments described herein may be stored on such a portable medium and input to the computing system 900 via the portable storage 940.
[0069] One or more input devices 960 provide a portion of a user interface. The one or more input devices 960 may include an alphanumeric keypad, such as a keyboard, for inputting alphanumeric and other information, or a pointing device, such as a mouse, a trackball, a stylus, or cursor direction keys. Additionally, the computing system 900 as shown in FIG. 9 includes one or more output devices 950. Suitable one or more output devices 950 include speakers, printers, network interfaces, and monitors.
[0070] Network interface 970 can be utilized to communicate with external devices, external computing devices, servers, and networked systems via one or more
communications networks such as one or more wired, wireless, or optical networks including, for example, the Internet, intranet, LAN, WAN, cellular phone networks (e.g., Global System for Mobile communications network, packet switching communications network, circuit switching communications network), Bluetooth radio, and an IEEE 802.11-based radio frequency network, among others. Network interface 970 may be a network interface card, such as an Ethernet card, optical transceiver, radio frequency transceiver, or any other type of device that can send and receive information. Other examples of such network interfaces may include Bluetooth®, 3G, 4G, and WiFi® radios in mobile computing devices as well as a USB.
[0071] C)ne or more peripheral devices 980 may include any type of computer support device to add additional functionality to the computing system. The one or more peripheral devices 980 may include a modem or a router.
[0072] The example computing system 900 of FIG. 9 may also include one or more accelerator devices 985. The accelerator devices 985 may include PCIe-form-factor
boards or storage-form-factor boards, or any electronic board equipped with a specific electronic component like a Graphical Processing Unit, a Neural Processing Unit, a Multi-CPU component, a Field Programmable Gate Array component, or similar accelerators electronic or photonic components, that may be suitable for use with embodiments described herein.
[0073] The components contained in the exemplary computing system 900 of FIG. 9 are those typically found in computing systems that may be suitable for use with embodiments described herein and are intended to represent a broad category of such computer components that are well known in the art. Thus, the exemplary computing system 900 of FIG. 9 can be a personal computer, hand held computing device, telephone, mobile computing device, workstation, server, minicomputer, mainframe computer, or any other computing device. The computer can also include different bus configurations, networked platforms, multi-processor platforms, and so forth. Various operating systems (OS) can be used including UNIX, Linux, Windows, Macintosh OS, Palm OS, and other suitable operating systems.
[0074] Some of the above-described functions may be composed of instructions that are stored on storage media (e.g., computer-readable medium). The instructions may be retrieved and executed by the processor. Some examples of storage media are memory devices, tapes, disks, and the like. The instructions are operational when executed by the processor to direct the processor to operate in accord with the example
embodiments. Those skilled in the art are familiar with instructions, processor(s), and storage media.
[0075] It is noteworthy that any hardware platform suitable for performing the processing described herein is suitable for use with the example embodiments. The terms "computer-readable storage medium" and "computer-readable storage media" as used herein refer to any medium or media that participate in providing instructions to a CPU for execution. Such media can take many forms, including, but not limited to, non-
volatile media, volatile media, and transmission media. Non-volatile media include, for example, optical or magnetic disks, such as a fixed disk. Volatile media include dynamic memory, such as RAM. Transmission media include coaxial cables, copper wire, and fiber optics, among others, including the wires that include one embodiment of a bus. Transmission media can also take the form of acoustic or light waves, such as those generated during radio frequency and infrared data communications. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, any other magnetic medium, SSD, a CD-read-only memory (ROM) disk, DVD, any other optical medium, any other physical medium with patterns of marks or holes, a RAM, a PROM, an EPROM, an EEPROM, a FLASHEPROM, any other memory chip or cartridge, a carrier wave, or any other medium from which a computer can read.
[0076] Various forms of computer-readable media may be involved in carrying one or more sequences of one or more instructions to a CPU for execution. A bus carries the data to system RAM, from which a CPU retrieves and executes the instructions. The instructions received by system RAM can optionally be stored on a fixed disk either before or after execution by a CPU. The instructions or data may not be used by the CPU but be accessed in writing or reading from the other devices without having the CPU directing them.
[0077] Thus, systems and methods for accelerating ANN computations are described. Although embodiments have been described with reference to specific exemplary embodiments, it will be evident that various modifications and changes can be made to these exemplary embodiments without departing from the broader spirit and scope of the present application. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.
Claims
1. A system for synchronizing streams of neuron outputs, the system comprising:
an arithmetic unit;
at least one further arithmetic unit; and
a synchronization module communicatively coupled to the arithmetic unit and the further arithmetic unit, wherein:
the arithmetic unit is configured to generate a stream of neuron outputs; and
the at least one further arithmetic unit is configured to generate a further stream of further neuron outputs; and
the synchronization module is configured to:
receive the stream of neuron outputs from the arithmetic unit and the further stream of further neuron outputs from the at least one further arithmetic unit, the stream of the neuron outputs and the stream of further neuron outputs being unaligned with respect to each other; and
reorder the neuron outputs and the further neuron outputs to obtain an ordered sequence.
2. The system of claim 1, wherein the reordering includes positioning an i-th neuron output from the stream before an i-th further neuron output from the further stream for i=l, ..., N and wherein N is a number of neuron outputs in the stream.
3. The system of claim 1, wherein the synchronization module is further configured to write the ordered sequence to a memory storage and wherein the ordered sequence is to be used as further input values for further neurons.
4. The system of claim 1, wherein:
the neuron outputs are computed by the arithmetic unit based on first input values, the first input values being selected from a stream of input values; and
the further neuron outputs are computed by the at least one further arithmetic unit based on second input values, the second input values being selected from the further stream of further input values; and wherein:
a count of the input values in the stream is equal to a count of further input values in the further stream; and
first indexes of the first input values in the stream of input values are different from second indexes of the second input values in the further stream of further input values.
5. The system of claim 4, wherein the difference in the first indexes and the second indexes causes at least one of the following:
an i-th neuron output from the stream being generated prior to an i-th further neuron output from the further stream; and
the i-th neuron output from the stream being generated after the i-th further neuron output from the further stream for at least one of i=l, ..., N, wherein N is a number of neuron outputs in the stream.
6. A method for synchronizing streams of neuron outputs, the method comprising:
generating, by an arithmetic unit, a stream of neuron outputs;
generating, by at least one further arithmetic unit, a further stream of further neuron outputs;
receiving, by a synchronization module communicatively coupled to the arithmetic unit and the further arithmetic unit, the stream of neuron outputs from the arithmetic unit and the further stream of further neuron outputs from the at least one further arithmetic unit, wherein the stream of the neuron outputs and the stream of further neuron outputs are unaligned with respect to each other; and
reordering, by the synchronization module, the neuron outputs and the further neuron outputs to obtain an ordered sequence.
7. The method of claim 6, wherein the reordering includes positioning an i-th
neuron output from the stream before an i-th further neuron output from the further stream for i=l, ..., N and wherein N is a number of neuron outputs in the stream.
8. The method of claim 6, further comprising writing, by the synchronization
module, the ordered sequence to a memory storage and wherein the ordered sequence is to be used as further input values for further neurons.
9. The method of claim 6, wherein:
the neuron outputs are computed by the arithmetic unit based on first input values, the first input values being selected from a stream of input values; and
the further neuron outputs are computed by the at least one further arithmetic unit based on second input values, the second input values being selected from the further stream of further input values; and wherein:
a count of the input values in the stream is equal to a count of further input values in the further stream; and
first indexes of the first input values in the stream of input values are different from second indexes of the second input values in the further stream of further input values.
10. The method of claim 9, wherein the difference in the first indexes and the second indexes causes at least one of the following:
an i-th neuron output from the stream being generated prior to an i-th further neuron output from the further stream; and
the i-th neuron output from the stream being generated after the i-th further neuron output from the further stream for at least one of i=l, .. N, wherein N is a number of neuron outputs in the stream.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/215,685 US10769527B2 (en) | 2018-12-11 | 2018-12-11 | Accelerating artificial neural network computations by skipping input values |
PCT/IB2018/059878 WO2020121023A1 (en) | 2018-12-11 | 2018-12-11 | Accelerating artificial neural network computations by skipping input values |
PCT/IB2019/060631 WO2020121203A1 (en) | 2018-12-11 | 2019-12-10 | Realigning streams of neuron outputs in artificial neural network computations |
Publications (1)
Publication Number | Publication Date |
---|---|
EP3895073A1 true EP3895073A1 (en) | 2021-10-20 |
Family
ID=69158146
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP19835488.8A Pending EP3895072A1 (en) | 2018-12-11 | 2019-12-10 | Realigning streams of neuron outputs in artificial neural network computations |
EP19835489.6A Pending EP3895073A1 (en) | 2018-12-11 | 2019-12-10 | Realigning streams of neuron outputs in artificial neural network computations |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP19835488.8A Pending EP3895072A1 (en) | 2018-12-11 | 2019-12-10 | Realigning streams of neuron outputs in artificial neural network computations |
Country Status (2)
Country | Link |
---|---|
EP (2) | EP3895072A1 (en) |
WO (2) | WO2020121202A1 (en) |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160358069A1 (en) * | 2015-06-03 | 2016-12-08 | Samsung Electronics Co., Ltd. | Neural network suppression |
CN109328361B (en) * | 2016-06-14 | 2020-03-27 | 多伦多大学管理委员会 | Accelerator for deep neural network |
GB2560600B (en) * | 2017-11-06 | 2020-03-04 | Imagination Tech Ltd | Nueral Network Hardware |
-
2019
- 2019-12-10 EP EP19835488.8A patent/EP3895072A1/en active Pending
- 2019-12-10 WO PCT/IB2019/060630 patent/WO2020121202A1/en unknown
- 2019-12-10 WO PCT/IB2019/060631 patent/WO2020121203A1/en unknown
- 2019-12-10 EP EP19835489.6A patent/EP3895073A1/en active Pending
Also Published As
Publication number | Publication date |
---|---|
EP3895072A1 (en) | 2021-10-20 |
WO2020121202A1 (en) | 2020-06-18 |
WO2020121203A1 (en) | 2020-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11625583B2 (en) | Quality monitoring and hidden quantization in artificial neural network computations | |
US20200226458A1 (en) | Optimizing artificial neural network computations based on automatic determination of a batch size | |
US10990525B2 (en) | Caching data in artificial neural network computations | |
US20200311511A1 (en) | Accelerating neuron computations in artificial neural networks by skipping bits | |
EP3924891A1 (en) | Quality monitoring and hidden quantization in artificial neural network computations | |
US11068784B2 (en) | Generic quantization of artificial neural networks | |
US11494624B2 (en) | Accelerating neuron computations in artificial neural networks with dual sparsity | |
US10769527B2 (en) | Accelerating artificial neural network computations by skipping input values | |
US11568255B2 (en) | Fine tuning of trained artificial neural network | |
US11126912B2 (en) | Realigning streams of neuron outputs in artificial neural network computations | |
EP3948685A1 (en) | Accelerating neuron computations in artificial neural networks by skipping bits | |
WO2020121030A1 (en) | Caching data in artificial neural network computations | |
EP3895073A1 (en) | Realigning streams of neuron outputs in artificial neural network computations | |
US20220222519A1 (en) | Optimizing operations in artificial neural network | |
EP3895071A1 (en) | Accelerating artificial neural network computations by skipping input values | |
US11645510B2 (en) | Accelerating neuron computations in artificial neural networks by selecting input data | |
WO2022153078A1 (en) | Optimizing operations in artificial neural network | |
US20210117800A1 (en) | Multiple locally stored artificial neural network computations | |
WO2022053851A1 (en) | Fine tuning of trained artificial neural network | |
WO2020144493A1 (en) | Optimizing artificial neural network computations based on automatic determination of a batch size | |
WO2020208396A1 (en) | Accelerating neuron computations in artificial neural networks by selecting input data | |
EP4049187A1 (en) | Multiple locally stored artificial neural network computations | |
EP3915055A1 (en) | Generic quantization of artificial neural networks | |
WO2021209789A1 (en) | Modifying structure of artificial neural networks by collocating parameters | |
WO2020152571A1 (en) | Generic quantization of artificial neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20210711 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
17Q | First examination report despatched |
Effective date: 20240619 |