WO2022028666A1 - Using non-uniform weight distribution to increase efficiency of fixed-point neural network inference - Google Patents

Using non-uniform weight distribution to increase efficiency of fixed-point neural network inference Download PDF

Info

Publication number
WO2022028666A1
WO2022028666A1 PCT/EP2020/071784 EP2020071784W WO2022028666A1 WO 2022028666 A1 WO2022028666 A1 WO 2022028666A1 EP 2020071784 W EP2020071784 W EP 2020071784W WO 2022028666 A1 WO2022028666 A1 WO 2022028666A1
Authority
WO
WIPO (PCT)
Prior art keywords
values
value
low
neural network
weight
Prior art date
Application number
PCT/EP2020/071784
Other languages
French (fr)
Inventor
Michael Jacob
Moshe Shahar
Hu Liu
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2020/071784 priority Critical patent/WO2022028666A1/en
Priority to CN202080104544.6A priority patent/CN116249990A/en
Publication of WO2022028666A1 publication Critical patent/WO2022028666A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • Some embodiments described in the present disclosure relate to a computerized apparatus executing a neural network and, more specifically, but not exclusively, to a computerized apparatus executing a neural network having fixed-point weight values.
  • neural network is commonly used to describe a computer system inspired by the human brain and nervous system.
  • a neural network usually involves a large amount of processing objects operating in parallel and arranged and connected in layers (or tiers).
  • deep in Deep Neural Networks (DNN) refers to an amount of layers in such a neural network.
  • the term “inference” refers to applying parameters and calculations from a trained neural network model to infer one or more output values in response to one or more input values.
  • a typical computation in a neural network layer involves summing a plurality of products between a layer input value, also known as an activation value, and an associated weight value and mapping the resulting sum to a layer output value.
  • a neural network has a plurality of weight values used in such computations.
  • the plurality of weight values is a plurality of fixed-point values.
  • Fixed-point representation of a value is a method of representing an approximation of a real number value, where there is an identified amount of digits after a radix point.
  • the radix point In decimal notation, the radix point is known as a decimal point.
  • the radix point In binary notation, the radix point may be referred to as a “binary point”. In binary notation, the number “one quarter” may be represented as 0.01 in base 2.
  • a fixed-point value represented by an identified amount of binary bits, denoted by //, may have one of 2 n values.
  • a fixed-point value may be split into two parts, a high-part value consisting of an identified amount of most-significant bits thereof, and a low-part value consisting of all least significant bits of the fixed-point value not members of the high-part value.
  • a 16-bit binary value may be split into a high-part value having 5 bits and a low-part value having 11 bits.
  • the 16-bit binary value may be split into equal sized high-part and low-part values, each having 8 bits.
  • any one of the n bits may be non-zero for at least some of the plurality of fixed-point values.
  • DNNs where the DNN’s plurality of weight values has a non-uniform distribution, having a small variance around an identified value and a long tail.
  • Some examples of a non-uniform distribution are a normal distribution, a Gaussian distribution, a Laplace distribution and a Cauchy distribution.
  • Some embodiments described in the present disclosure use a non-uniform distribution of a plurality of weight values of a neural network to increase efficiency of computation executed by the neural network.
  • an amount of multiplications executed by the neural network is mitigated by using one multiplier to multiply an activation value by two or more low-part values of two or more of the plurality of weight values, and multiplying the activation value by only non-zero high-part values of the two or more weight values.
  • an apparatus for configuring a neural network comprises a processing unit configured for: receiving a first set and a second set of weight values of a plurality of weight values of the neural network, where each weight value of the first set and the second set has a low-part value and a high-part value; producing a set of combined low-part values, each produced by combining respective low-part values of two weight values, one selected from the first set and another selected from the second set; and configuring the neural network to compute at least one output value by concurrently computing: a first set of intermediate values, by applying a first set of multipliers of the neural network to the set of combined low-part values; and a second set of intermediate values, by applying a second set of multipliers of the neural network to a set of high-part values of at least some of the first set and the second set, each high-part value of the set of high-part values associated with a low-part value where the low-part value and the high-part value both originated from a common weight value of
  • Concurrently applying the second set of multipliers to the set of high-part values and applying the first set of multipliers to the set of combined low-part values facilitates reducing cost of operation of the neural network by reducing an amount of cycles required to compute an output value compared to a nonconcurrent computation and thus reducing power consumption and in addition reducing cost of implementation of the neural network by reducing an amount of multipliers compared to providing a multiplier for each of the plurality of weight values, without adversely effecting accuracy of the output value.
  • a method for configuring a neural network comprises: receiving a first set and a second set of weight values from a plurality of weight values of the neural network, where each weight value of the first set and the second set has a low-part value and a high-part value; producing a set of combined low-part values, each produced by combining respective low-part values of two weight values, one selected from the first set and another selected from the second set; and configuring the neural network to compute at least one output value by concurrently computing: a first set of intermediate values, by applying a first set of multipliers of the neural network to the set of combined low-part values; and a second set of intermediate values, by applying a second set of multipliers of the neural network to a set of high-part values of at least some of the first set and the second set, each high-part value of the set of high-part values associated with a low-part value where the low-part value and the high-part value both originated from a common weight value of the first set and the
  • an apparatus for executing a neural network comprises a processing unit configured for: configuring the neural network by: receiving a first set and a second set of weight values from a plurality of weight values of the neural network, where each weight value of the first set and the second set has a low-part value and a high-part value; producing a set of combined low-part values, each produced by combining respective low-part values of two weight values, one selected from the first set and another selected from the second set; and configuring the neural network to compute at least one output value by concurrently computing: a first set of intermediate values, by applying a first set of multipliers of the neural network to the set of combined low-part values; and a second set of intermediate values, by applying a second set of multipliers of the neural network to a set of high-part values of at least some of the first set and the second set, each high-part value of the set of high-part values associated with a low-part value where the low-part value and the high-part value both originated
  • a software program product for configuring a neural network comprises: a non-transitory computer readable storage medium; first program instructions for receiving a first set and a second set of weight values from a plurality of weight values of the neural network, where each weight value of the first set and the second set has a low-part value and a high-part value; second program instructions for producing a set of combined low-part values, each produced by combining respective low-part values of two weight values, one selected from the first set and another selected from the second set; and third program instructions for configuring the neural network to compute at least one output value by concurrently computing: a first set of intermediate values, by applying a first set of multipliers of the neural network to the set of combined low-part values; and a second set of intermediate values, by applying a second set of multipliers of the neural network to a set of high-part values of at least some of the first set and the second set, each high-part value of the set of high-part values associated with a low-part
  • a computer program comprises program instructions which, when executed by a processor, cause the processor to: receive a first set and a second set of weight values from a plurality of weight values of the neural network, where each weight value of the first set and the second set has a low-part value and a high-part value; produce a set of combined low-part values, each produced by combining respective low-part values of two weight values, one selected from the first set and another selected from the second set; and configure the neural network to compute at least one output value by concurrently computing: a first set of intermediate values, by applying a first set of multipliers of the neural network to the set of combined low-part values; and a second set of intermediate values, by applying a second set of multipliers of the neural network to a set of high-part values of at least some of the first set and the second set, each high-part value of the set of high-part values associated with a low-part value where the low-part value and the high-part value both originated from a common
  • each of the set of combined low-part values is associated with one of a set of activation values.
  • applying the first set of multipliers to the set of combined low-part values comprises each of the first set of multipliers multiplying one of the set of combined low-part values with the respective activation value associated therewith.
  • applying the second set of multipliers to the set of high- part values comprises each of the second set of multipliers multiplying one of the set of high- part values by the respective activation value associated with the respective low-part value associated with the high-part value.
  • computing the at least one output value comprises computing the at least one output value using the first set of intermediate values and the second set of intermediate values.
  • each of the set of high-part values is not equal to zero.
  • each respective activation value multiplied by a high-part value of the set of high part values is not equal to zero.
  • Applying a multiplier of the second set of multipliers only to a high-part value that does not equal zero and additionally or alternatively only to an activation value that does not equal zero facilitates reducing an amount of multipliers in the second set of multipliers compared to an implementation where a multiplier is applied also to a zero value, thus reducing cost of implementation of the neural network without impacting accuracy of an output value inferred by the neural network.
  • configuring the neural network to compute at least one output value further comprises: computing another second set of intermediate values by at least some multipliers, selected from one or more of the first set of multipliers and the second set of multipliers, multiplying one of a set of other high-part values of at least some other high-part values selected from one or more of the first set and the second set, the other high-part value associated with another low-part value where the other low-part value and the other high-part value both originated from another common weight value of the first set and the second set, by the respective activation value associated with the respective other low-part value associated with the other high-part value.
  • computing the at least one output value comprises computing the at least one output value further using the other second set of intermediate values.
  • the second set of multipliers has an identified amount of multipliers.
  • the identified amount of multipliers is 32.
  • a set of non-zero high-parts, produced by selecting a complete set of high-part values of all of the first set and the second set, comprises more high-part values than the identified amount of multipliers, and the set of other high-part values comprises a plurality of other high-part values of the complete set of high-part values not members of the set of high-part values.
  • receiving the first set comprises: receiving a first sequence of low-part values of the first set; receiving a first sequence of bits, each associated with one of the first sequence of low-part values in order and having a value of 1 when a respective high-part value associated with the low-part value is not equal to zero, otherwise having a value of 0; and receiving a first sequence of high-part values, each associated with a non-zero bit of the first sequence of bits, in order; and wherein receiving the second set comprises: receiving a second sequence of low-part values of the second set; receiving a second sequence of bits, each associated with one other of the second sequence of low-part values in order and having a value of 1 when another respective high-part value associated with the other low-part value is not equal to zero, otherwise having a value of 0; and receiving a second sequence of high-part values, each associated with another non-zero bit of the second sequence of bits, in order.
  • Using a sequence of bits to associate each of a sequence of high-part values with one of a sequence of low-part values reduces an amount of memory required to create such an association compared to some other methods, for example using an ordinal number or providing a high-part value for each of the low-part values.
  • each weight value comprises a weight amount of bits
  • each low-part value comprises a low-part amount of bits
  • each low-part value is a least significant part of the weight value.
  • the low-part amount of bits is half of the weight amount of bits.
  • the weight amount of bits is selected from the group of bit amounts consisting of: 4, 8, 16, 32, and 64.
  • the neural network comprises a plurality of layers, each having a plurality of layer weight values of the plurality of weight values of the neural network.
  • the first set and the second set are selected from the plurality of layer weight values of one of the plurality of layers. Selecting the first set and the second set from the plurality of weight values of one of the plurality of layers allows considering a distribution of the plurality of weight values of the layer in selecting the first set and the second set, thus improving a reduction in an amount of cycles required for computation of an output value of the neural network.
  • a first pair of sets comprises the first set and the second set.
  • a second pair of sets comprises another first set of weight values of the plurality of weight values and another second set of weight values of the plurality of weight values.
  • the processing unit is further configured for: receiving the second pair of sets; producing another set of combined low-part values, each produced by combining respective other low-part values of two other weight values, one selected from the other first set and yet another selected from the other second set; and configuring the neural network to compute the at least one output value by further concurrently computing: another first set of intermediate values, by applying another first set of multipliers of the neural network to the other set of combined low-part values; and another second set of intermediate values, by applying another second set of multipliers of the neural network to another set of high-part values of at least some of the other first set and the other second set, each other high-part value of the other set of high-part values associated with another low-part value where the other low-part value and the other high-part value both originated from another common weight value of the other first set and the other second set.
  • the first set of multipliers is different from the other first set of multipliers and the second set of multipliers is different from the other second set of multipliers.
  • Using another first set of multipliers different from the first set of multipliers and another second set of multipliers different from the second set of multipliers facilitates increasing throughput of the neural network by computing the other first set of intermediate values and the other second set of intermediate values concurrently to computing the first set of intermediate values and the second set of intermediate values.
  • the first set and the second set are produced by receiving another set of weight values of the plurality of weight values, and splitting the other set of weight values into the first set and the second set, such that an amount of weight values of the first set is equal to an amount of weight values of the second set.
  • the plurality of weight values has a non-uniform distribution with a variance less than an identified variance threshold.
  • FIG. 1 is a schematic diagram representing an exemplary product of two numbers in binary representation
  • FIG. 2 is a schematic block diagram representing part of an exemplary neural network according to some embodiments
  • FIG. 3 is a schematic block diagram of an exemplary apparatus, according to some embodiments.
  • FIG. 4 is a flowchart schematically representing an optional flow of operations for configuring a neural network, according to some embodiments
  • FIG. 5 is a flowchart schematically representing another optional flow of operations for configuring a neural network, according to some embodiments.
  • FIG. 6 is a flowchart schematically representing an optional flow of operations for executing a neural network, according to some embodiments.
  • a non-uniform distribution around a value is used to mean “a non-uniform distribution with a small variance around a value”
  • non-uniform plurality of values around an identified value is used to mean a plurality of values having a non-uniform distribution with a small variance around the identified value.
  • a non- uniform plurality of values around 0 is used to mean “a plurality of values having a non- uniform distribution with a small variance around 0”.
  • an offset value may be added to each of the plurality of weight values to map the plurality of weight values to another non-uniform plurality of weight values around 0.
  • the offset value may be negative or non-negative.
  • a typical deep neural network comprises millions of parameters and may require millions of arithmetic operations, requiring computation and digital memory resources exceeding capabilities of many devices, for example mobile devices, some embedded devices and some custom hardware devices.
  • an amount of computation cycles required for inference impacts both an amount of time required for inference, and thus impacts throughput of the neural network, and an amount of power consumed by the neural network, and thus cost of operation thereof.
  • One way to reduce cost of production of a neural network is by reducing an amount of physical computation resources of the neural network, some examples being an amount of computation elements of the neural network and an amount of memory used by the neural network.
  • reducing the amount of physical computation resources of the neural network reduces the area of a semiconductor component comprising the integrated circuit.
  • Some neural networks reduce an amount of computation resources by reducing an amount of bits used to store at least some of the plurality of weights of the neural network. This practice is known as quantization of the neural network. Quantization may also include reducing an amount of bits used to store at least some of the activation values of the neural network. However, quantizing a neural network may reduce accuracy of an output of the neural network. In quantization, an original plurality of weight values, each represented by an original amount of bits, may be mapped to a plurality of quantized weight values, each represented by a reduced amount of bits, where the reduced amount of bits is less than the original amount of bits.
  • Some neural networks instead of concurrently applying a layer’s computation resources to all weight values of a level, use the layer’s computation resources to compute in a plurality of batches (iterations).
  • the layer’s computation resources are applied to some of the layer’s plurality of weights and an output of the layer is computed using a plurality of batch results.
  • An amount of time required to compute an output of such a neural network is increased according to an amount of batches executed, reducing throughput and possibly increasing an amount of power required to execute the neural network.
  • a product of multiplying by zero is zero.
  • a weight value has a high-part value equaling zero
  • a product computed using such a weight value also has at least some high-part that is equal to zero.
  • FIG. 1 showing a schematic diagram 100 representing an exemplary product of two numbers in binary representation.
  • number N1 is equal 3 and has a high-part HP1 that equals zero
  • number N2 is equal 3 and has a high-part HP2 that equals zero.
  • product Pl equals 9 and is a result of multiplying number N1 by number N2.
  • product Pl has a high-part HP3 that equals zero.
  • a high-part does not necessarily include all consecutive most significant bits that are equal zero.
  • both number N1 and number N2 have other bits, consecutive to high-part HP1 and high-part HP2 respectively, which are zero, however they are excluded from the respective high-part.
  • the plurality of weight values is a plurality of fixed-point values.
  • the plurality of weight values is a plurality of fixed-point quantized values, produced by quantizing a plurality of floating-point weight values of a neural network.
  • a first set of multipliers of the neural network is applied to a set of combined low-part values, each produced by combining low-part values of two or more weight values of the plurality of weight values, while concurrently a second set of multipliers of the neural network is applied to a set of nonzero high-part values.
  • the second set of multipliers has fewer multipliers than the first set of multipliers.
  • the set of combined low-part values is produced by combining respective low-part values of two weight values, one selected from a first set of weight values of a plurality of weight values of the neural network, and another selected from a second set of weight values of the plurality of weight values.
  • the set of non-zero high-part values is at least part of a set of high-part values of at least some of the first set and the second set, such that each high-part value of the set of high-part values is associated with a low-part value where the low-part value and the high-part value originated from a common weight value of the first set and the second set.
  • one or more intermediate values computed by applying the first set of multipliers and the second set of multipliers are used to compute one or more output values of the neural network. Refraining from applying a multiplier to a high-part that is equal to zero allows reducing an amount of multipliers of the neural network, thus reducing cost of production and cost of operation of the neural network. Applying one multiplier to a combined low-part value facilitates using fewer bits to represent non-zero parts of two or more weight values while preserving full accuracy of the neural network. In addition, applying one multiplier to a combined low-part value facilitates reducing an amount of computation cycles required to compute two or more products of two or more low-part values, thus reducing cost of operation of the neural network by reducing power consumption.
  • the method described above is not quantization of the neural network’s plurality of weights. According to some embodiments of the present disclosure, full accuracy of the neural network is preserved as there is no loss of significant values. Multiplication by zero yields zero, thus refraining from applying a multiplier to a high- part that equals zero does not lose a significant value. In such embodiments, when the second set of multipliers has fewer multipliers than the first set of multipliers the amount of computing resources of the neural network is reduced without impacting accuracy of an output value inferred by the neural network. In addition, applying fewer multipliers reduces power consumption of the neural network.
  • applying a multiplier to a high-part value comprises multiplying the high- part value by an associated activation value.
  • the second set of multipliers is applied to the set of non-zero high-part values subject to each of the respective associated activation values being non-zero. Refraining from multiplying the high-part value by the associated activation value when the associated activation value is zero facilitates reducing an amount of computation cycles used by the neural network, reducing power consumption of the neural network.
  • Concurrently applying the first set of multipliers to the set of combined low-part values and applying the second set of multipliers to at least some of the set of high-part values facilitates computing an output value using fewer physical computation resources, and thus reducing cost of production and cost of implementation of the neural network, without reducing throughput of the neural network.
  • the first set and the second set of weight values have an amount of weight values having a non-zero high-part which exceeds an amount of multipliers in the second set of multipliers.
  • at least some multipliers of the first set of multipliers and the second set of multipliers are applied to a set of other high-part values to compute one or more other intermediate results.
  • the one or more other intermediate results are further used to compute the one or more output values of the neural network. Applying the at least some multipliers to the set of other high-part values allows preserving accuracy of an output of the neural network without increasing an amount of physical computation resources thereof.
  • one or more of the second set of multipliers are applied to one or more combined high-part values, each produced by combining two or more of the set of high-part values.
  • Applying one or more of the second set of multipliers to one or more combined high- part values facilitates applying the second set of multipliers to all of the at least some of the set of high-part values, allowing computing the one or more output values without applying the at least some multipliers to the set of other high-part values, thus reducing computation time, and therefor increasing throughput and reducing power consumption of executing the neural network.
  • Embodiments may be a system, a method, and/or a computer program product.
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the embodiments.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, and any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or Flash memory erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disk
  • memory stick a floppy disk, and any suitable combination of the foregoing.
  • a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of embodiments may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of embodiments.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the fimctions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the fimctions/acts specified in the flowchart and/or block diagram block or blocks.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • a first set 210 of weight values comprises weight values 210-1 through 210-N
  • a second set 220 of weight values comprises weight values 220-1 through 220-N
  • first set 210 and second set 220 comprise weight values of a plurality of weight values of a neural network.
  • each weight value of first set 210 and second set 220 has a low-part and a high-part.
  • weight values 210-1, 210-2 and 210-N each may have low-part value 211-1, 211- 2 and 211-N respectively and high-part value 212-1, 212-3 and 212-N respectively.
  • weight values 220-1, 220-2 and 220-N each may have low-part value 221-1, 221-2 and 221-N respectively and high-part value 222-1, 222-3 and 222-N respectively.
  • each low-part value of a weight value is a least significant part of the weight value.
  • each weight value of the plurality of weight values comprises a weight amount of bits, some examples being 8 bits and 32 bits. Other examples of a weight amount of bits include 4, 16, and 64.
  • each low-part value comprises a low-part amount of bits, less than the weight amount of bits. Optionally, the low-part amount of bits is half of the weight amount of bits.
  • first set 210 and second set 220 are used to produce set of combined low- part values 230.
  • combined low-part value 230-1 may be produced by combining low-part value 211-1 of weight value 210-1 of first set 210 and low-part value 221-1 of weight value 220-1 of second set 220.
  • combined low-part value 230-2 may be produced by combining low-part value 211-2 of weight value 210-2 of first set 210 and low-part value 221- 2 of weight value 220-2 of second set 220.
  • combined low-part value 230-N may be produced by combining low-part value 211-N of weight value 210-N of first set 210 and low- part value 221-N of weight value 220-N of second set 220.
  • each of combined low-part values 230 is associated with one of set of activation values 240.
  • combined low-part value 230-1 may be associated with activation value 240-1 of set of activation values 240.
  • combined low-part value 230- 2 may be associated with activation value 240-2 of set of activation values 240 and combined low-part value 230-N may be associated with activation value 240-N of set of activation values 240.
  • first set of multipliers 260 is applied to combined low-part values 230, optionally to multiply each of combined low-part value 230 by the activation value of set of activation values 240 associated therewith.
  • multiplier 260-1 may be applied to multiply combined low-part value 230-1 by activation value 240-1.
  • multiplier 260-2 may be applied to multiply combined low- part value 230-2 by activation value 240-2 and multiplier 260-N may be applied to multiply combined low-part value 230-N by activation value 240-N.
  • a set of high-part values 250 comprises a plurality of high-part values of at least some of the first set and the second set, for example high-part value 212-1 of weight value 210-1 of first set 210, high-part value 212-N of weight value 210-N of first set 210 and high- part value 222-N of weight value 220-N of second set 220.
  • set of high-part values 250 comprises all respective high-part values of all weight values of first set 210 and second set 220.
  • each of set of high-part values 250 is not equal to zero.
  • a second set of multipliers 270 comprising multiplier 270-1 through multiplier 270-M, is applied to set of high-part values 250, optionally to multiply each of set of high-part values 250 by the activation value of set of activation values 240 associated with a low-part value associated therewith.
  • high-part value 212-1 is associated with low-part value 211-1 both having originated from weight value 210-1.
  • Low-part value 211-1 is associated in this example with activation value 240-1, thus in this example high-part value 212-1 is associated with activation value 240-1.
  • both high-part value 212-N and high-part value 222-N are associated with activation value 240-N.
  • activation value 240-N is provided to multiplier 270-2, optionally for multiplication by high-part value 212-N, and is provided to multiplier 270-M, optionally for multiplication by high-part value 222-N.
  • second multiplier 270-1 is applied to high-part value 212-1 and activation value 240-1.
  • high-part values for example high-part value 212-2, high-part-value 222-1 and high-part value 222-2 are not members of set of high-part values 250 and are not provided to second set of multipliers 270.
  • high-part value 212-2, high-part-value 222-1 and high-part value 222-2 are each equal zero.
  • high-part value 212-2 is non zero and associated activation value 240-2 is equal zero, and thus not provided to one of second set of multipliers 270.
  • first set of multipliers 260 computes a first set of intermediate values.
  • second set of multipliers 270 computes a second set of intermediate values.
  • one or more output values are computed using the first set of intermediate values and the second set of intermediate values, optionally using one or more adders 280.
  • each of the first set of intermediate values has a respective low-part and a respective high-part.
  • one of the first set of intermediate values may stem from multiplication of combined low-part value 230-1 (produced by combining low-part value 211- 1 and low-part value 221-1) by respective application value 240-1.
  • one or more respective low-part values of the first set of intermediate values are provided to a first adder of one or more adders 280, and one or more respective high-part values of the first set of intermediate values is provided to a second adder of the one or more adders 280.
  • a respective low-part of an output of multiplier 260-1 may be provided to the first adder of adders 280, and a respective high-part of the output of multiplier 260-1 may be provided to a second adder of adders 280.
  • another respective low-part of an output of multiplier 260-N may provided to the first adder of adders 280, and another respective high-part of the output of multiplier 260-N may be provided to the second adder of adders 280.
  • a neural network comprising part 200 is configured using the following exemplary apparatus.
  • FIG. 3 showing a schematic block diagram of an exemplary apparatus 300, according to some embodiments.
  • the apparatus is a single device.
  • the apparatus is a system of two or more devices, e.g. computers, that are configured to interact so as to achieve the functionality described herein.
  • apparatus 300 comprises processing unit 301 connected to storage 302.
  • Processing unit 301 may be any kind of programmable or non-programmable circuitry that is configured to carry out the operations described herein.
  • Processing unit 301 may comprise hardware as well as software.
  • processing unit 301 may comprise one or more processors and a transitory or non-transitory memory that carries a program which causes the processing unit to perform the respective operations when the program is executed by the one or more processors.
  • Storage 302 may be a transitory or non-transitory memory.
  • storage 302 carries the program.
  • storage 302 is a non-volatile digital storage, some examples being a hard disk drive, a solid state drive, a network storage, and a storage network.
  • processing unit 301 retrieves a plurality of weight values of a neural network from storage 302.
  • processing unit 301 is connected to a digital communication network interface 303.
  • processing unit 301 is connected to storage 302 via digital communication network interface 303.
  • apparatus 300 implements the following optional method.
  • processing unit 301 receives first set 210 and second set 220 of a plurality of weight values of a neural network.
  • the neural network is executed by processing unit 301.
  • the neural network is executed by another processing unit (not shown), connected to processing unit 301.
  • the plurality of weight values has a non-uniform distribution with a variance less than an identified variance threshold.
  • the variance threshold is identified such than at least an identified part of the plurality of weight values have a respective high-part that is equal zero, for example, at least half of the plurality of weight values may have a respective high-part that is equal zero.
  • the variance threshold is identified such that at least a third of the plurality of weight values have a respective high-part that is zero.
  • receiving first set 210 in 401 comprises processing unit 301 receiving in 501 a first sequence of low-part values of first set 210, for example low-part values 211-1 through 211-N.
  • processing unit 301 receives a first sequence of bits, each associated with one low-part value of the first sequence of low-part values in order, and having a value of 1 when a respective high-part value associated with the low-part values is not equal to zero, otherwise having a value of 0.
  • the first sequence of bits has a value of 1 in places 1 and N.
  • processing unit 301 receives a first sequence of high-part values, each associated with a non-zero bit of the first sequence of bits, in order.
  • the first sequence of high-part values comprises of high-part value 212-1 and high- part value 212-N thereafter, where high-part value 212-2 is not a member of the first sequence of high-part values.
  • high-part value 212-N is after high-part value 212- 1 in the first sequence of high-part values, however not necessarily immediately following high- part value 212-1.
  • set of high-part values 250 is a combination of the first sequence of high-part values and the second sequence of high-part values.
  • receiving second set 220 comprises processing unit 301 receiving in 511 a second sequence of low-part values of second set 220, for example low-part values 221-1 through 221-N.
  • processing unit 301 receives a second sequence of bits, each associated with one other low-part value of the second sequence of low-part values in order and having a value of 1 when another respective high-part value associated with the other low-part value is not equal to zero, otherwise having a value of 0.
  • processing unit 301 receives a second sequence of high-part values, each associated with another non-zero bit of the second sequence of bits, in order.
  • first set 210 and second set 220 are produced by processing unit 301 receiving another set of weight value of the plurality of weight values and processing unit 301 splitting the other set of weight values into first set 210 and second set 220 such that an amount of weight values of first set 210, denoted by TV, is equal to an amount of weight values of second set 220.
  • processing unit 301 produces set of combined low-part values 230.
  • each of set of combined low-part values 230 is produced by combining respective low-part values of two weight values, one selected from first set 210 and another selected from second set 220.
  • each of set of combined low-part values 230 is produced by combining respective low-part values of more than two weight values, for example one selected from first set 210, another selected from second set 220 and yet another selected from a third set of low-part values not shown.
  • processing unit 301 optionally configures the neural network to compute one or more output values.
  • the neural network comprises a plurality of layers, each having a plurality of layer weight values of the plurality of weight values of the neural network.
  • first set 210 and second set 220 are selected from the plurality of layer weight values of one of the plurality of layers.
  • the one or more output values are one or more output values of the layer.
  • computing the one or more output values is by concurrently in 411 computing a first set of intermediate values and in 412 computing a second set of intermediate values.
  • computing the one or more output values in 403 comprises computing the one or more output values in 420 using the first set of intermediate values and the second set of intermediate values.
  • each of set of combined low-part values 230 is associated with one of set of activation values 240.
  • processing unit 301 configures the neural network to compute in 411 the first set of intermediate values by applying first set of multipliers 260 to set of combined low- part values 230.
  • applying first set of multipliers 260 to set of combined low-part values 230 comprises each of first set of multipliers 260 multiplying one of set of combined low-part values 230 with the respective activation value associated therewith.
  • processing unit 301 configures the neural network to compute in 412 the second set of intermediate values by applying second set of multipliers 270 to set of high-part values 250.
  • set of high-part values 250 is a set of high-part values of at least some of first set 210 and second set 220.
  • each high-part value of set of high-part values 250 is associated with a low-part value where the low-part value and the high-part value originated from a common weight value of the first set and the second set.
  • applying second set of multipliers 270 to set of high-part values 250 comprises each of second set of multipliers 270 multiplying one of set of high-part values 250 by the respective activation value associated with the respective low-part value associated with the high-part value.
  • each respective activation value multiplied by one of set of high-part values 250 is not equal to zero.
  • applying second set of multipliers 270 to set of high-part values 250 comprises at least one of second set of multipliers 270 multiplying an activation value by a combined high-set value, produced by combining two of set of high-part values 250 each associated with the activation value.
  • a combined high-set value is produced by combining more than two high-part values.
  • the neural network comprises additional multipliers not in either of first set of multipliers 260 and second set of multipliers 270.
  • second set of multipliers 270 has an identified amount of multipliers, denoted by AT, for example 32 multipliers.
  • a set of non-zero high-parts produced by selecting a complete set of high-part values of all of first set 210 and second set 220, comprises more than M high-part values.
  • set of high-part values 250 comprises M high-part values.
  • a set of other high-part values comprises a plurality of other high- part values of the complete set of high-part values not members of set of high-part values 250, for example one or more of high-part value 212-2, high-part value 222-1 and high-part value 222-2.
  • processing unit 301 computes another second set of intermediate values.
  • processing unit 301 computes the other second set of intermediate values using at least some multipliers selected from first set of multipliers 260 and second set of multipliers 270.
  • each of the at least some multipliers are applied to one of a set of other high-part values of at least some other high-part values selected from one or more of first set 210 and second set 220, for example high-part value 212-2.
  • the other high-part value is associated with another low-part value, for example low-part value 211-2, where the other low-part value and the other high-part value both originated from another common value of first set 210 and second set 220, for example weight value 210-2.
  • applying one of the at least some multipliers to one of the set of other high-part values comprises multiplying the other high-part value by the respective activation value, for example 240-2, associated with the respective other low-part value, for example low-part value 211-2 associated with the other high-part value 212-2.
  • processing unit 301 optionally computes the one or more output values further using the other second set of intermediate values.
  • the plurality of weight values comprises a plurality of pairs of sets, and the method described above is repeated for other pairs using other sets of multipliers of the neural network, optionally concurrently.
  • a first pair of sets comprises first set 210 and second set 220.
  • another pair of sets comprises another first set of weight values of the plurality of weight values and another second set of weight values of the plurality of weight values.
  • 401, 402, 403, 411, 412 and 420 are repeated using the other pair of sets and another first set of multipliers of the neural network and another second set of multipliers of the neural network.
  • the first set of multipliers is different from the other first set of multipliers.
  • the second set of multipliers is different from the other second set of multipliers.
  • 401, 402, 403, 411, 412 and 420 are repeated using the other pair of sets and the first set of multipliers of the neural network and the second set of multipliers of the neural network.
  • method 400 is applied to each of one or more layers of the neural network.
  • a computer program comprises program instructions which, when executed by processing unit 301, cause apparatus 300 to implement method 400.
  • apparatus 300 executes the neural network. In some such embodiments, apparatus implements the following optional method.
  • processing unit 301 configures the neural network, optionally by executing method 400, optionally to compute one or more output values.
  • processing unit 301 optionally receives an input value and in 620 processing unit 301 optionally computes the one or more output values in response to the input value.
  • neural network is intended to include all such new technologies a priori.
  • composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.
  • the singular form “a”, “an” and “the” include plural references unless the context clearly dictates otherwise.
  • the term “a compound” or “at least one compound” may include a plurality of compounds, including mixtures thereof.
  • range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of embodiments. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
  • a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range.
  • the phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Feedback Control In General (AREA)

Abstract

An apparatus and method for configuring a neural network, comprising a processing unit configured for: combining respective low-part values of two weight values, one selected from a first set of weight values and another selected from a second set of weight values of a plurality of non-uniformly distributed weight values of the neural network; and concurrently applying a first set of multipliers of the neural network to the set of combined low-part values and applying a second set of multipliers of the neural network to a set of non-zero high-part values, each high-part value associated with a low-part value where the low-part value and the high-part value both originated from a common weight value.

Description

USING NON-UNIFORM WEIGHT DISTRIBUTION TO INCREASE EFFICIENCY OF FIXED-POINT NEURAL NETWORK INFERENCE
BACKGROUND
Some embodiments described in the present disclosure relate to a computerized apparatus executing a neural network and, more specifically, but not exclusively, to a computerized apparatus executing a neural network having fixed-point weight values.
The term neural network is commonly used to describe a computer system inspired by the human brain and nervous system. A neural network usually involves a large amount of processing objects operating in parallel and arranged and connected in layers (or tiers). The term “deep” in Deep Neural Networks (DNN) refers to an amount of layers in such a neural network. The term “inference” refers to applying parameters and calculations from a trained neural network model to infer one or more output values in response to one or more input values. A typical computation in a neural network layer involves summing a plurality of products between a layer input value, also known as an activation value, and an associated weight value and mapping the resulting sum to a layer output value. A neural network has a plurality of weight values used in such computations. In some neural networks the plurality of weight values is a plurality of fixed-point values. Fixed-point representation of a value is a method of representing an approximation of a real number value, where there is an identified amount of digits after a radix point. In decimal notation, the radix point is known as a decimal point. In binary notation, the radix point may be referred to as a “binary point”. In binary notation, the number “one quarter” may be represented as 0.01 in base 2.
A fixed-point value represented by an identified amount of binary bits, denoted by //, may have one of 2n values. A fixed-point value may be split into two parts, a high-part value consisting of an identified amount of most-significant bits thereof, and a low-part value consisting of all least significant bits of the fixed-point value not members of the high-part value. For example, a 16-bit binary value may be split into a high-part value having 5 bits and a low-part value having 11 bits. In another example, the 16-bit binary value may be split into equal sized high-part and low-part values, each having 8 bits.
When a plurality of fixed-point values has a uniform distribution, any one of the n bits may be non-zero for at least some of the plurality of fixed-point values. However, there exist DNNs where the DNN’s plurality of weight values has a non-uniform distribution, having a small variance around an identified value and a long tail. Some examples of a non-uniform distribution are a normal distribution, a Gaussian distribution, a Laplace distribution and a Cauchy distribution. When the small variance is around the value of 0 and each of the plurality of weight values is represented by a fixed-point value, there is an identified amount of most significant bits, denoted by k, for which most of the plurality of weight values have a high-part value equal zero.
SUMMARY
Some embodiments described in the present disclosure use a non-uniform distribution of a plurality of weight values of a neural network to increase efficiency of computation executed by the neural network. In such embodiments, an amount of multiplications executed by the neural network is mitigated by using one multiplier to multiply an activation value by two or more low-part values of two or more of the plurality of weight values, and multiplying the activation value by only non-zero high-part values of the two or more weight values.
The foregoing and other objectives are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.
According to a first aspect of the disclosure, an apparatus for configuring a neural network comprises a processing unit configured for: receiving a first set and a second set of weight values of a plurality of weight values of the neural network, where each weight value of the first set and the second set has a low-part value and a high-part value; producing a set of combined low-part values, each produced by combining respective low-part values of two weight values, one selected from the first set and another selected from the second set; and configuring the neural network to compute at least one output value by concurrently computing: a first set of intermediate values, by applying a first set of multipliers of the neural network to the set of combined low-part values; and a second set of intermediate values, by applying a second set of multipliers of the neural network to a set of high-part values of at least some of the first set and the second set, each high-part value of the set of high-part values associated with a low-part value where the low-part value and the high-part value both originated from a common weight value of the first set and the second set. Concurrently applying the second set of multipliers to the set of high-part values and applying the first set of multipliers to the set of combined low-part values facilitates reducing cost of operation of the neural network by reducing an amount of cycles required to compute an output value compared to a nonconcurrent computation and thus reducing power consumption and in addition reducing cost of implementation of the neural network by reducing an amount of multipliers compared to providing a multiplier for each of the plurality of weight values, without adversely effecting accuracy of the output value.
According to a second aspect of the disclosure, a method for configuring a neural network comprises: receiving a first set and a second set of weight values from a plurality of weight values of the neural network, where each weight value of the first set and the second set has a low-part value and a high-part value; producing a set of combined low-part values, each produced by combining respective low-part values of two weight values, one selected from the first set and another selected from the second set; and configuring the neural network to compute at least one output value by concurrently computing: a first set of intermediate values, by applying a first set of multipliers of the neural network to the set of combined low-part values; and a second set of intermediate values, by applying a second set of multipliers of the neural network to a set of high-part values of at least some of the first set and the second set, each high-part value of the set of high-part values associated with a low-part value where the low-part value and the high-part value both originated from a common weight value of the first set and the second set.
According to a third aspect of the disclosure, an apparatus for executing a neural network comprises a processing unit configured for: configuring the neural network by: receiving a first set and a second set of weight values from a plurality of weight values of the neural network, where each weight value of the first set and the second set has a low-part value and a high-part value; producing a set of combined low-part values, each produced by combining respective low-part values of two weight values, one selected from the first set and another selected from the second set; and configuring the neural network to compute at least one output value by concurrently computing: a first set of intermediate values, by applying a first set of multipliers of the neural network to the set of combined low-part values; and a second set of intermediate values, by applying a second set of multipliers of the neural network to a set of high-part values of at least some of the first set and the second set, each high-part value of the set of high-part values associated with a low-part value where the low-part value and the high-part value both originated from a common weight value of the first set and the second set; receiving an input value; and computing the one or more output values in response to the input value.
According to a fourth aspect of the disclosure, a software program product for configuring a neural network comprises: a non-transitory computer readable storage medium; first program instructions for receiving a first set and a second set of weight values from a plurality of weight values of the neural network, where each weight value of the first set and the second set has a low-part value and a high-part value; second program instructions for producing a set of combined low-part values, each produced by combining respective low-part values of two weight values, one selected from the first set and another selected from the second set; and third program instructions for configuring the neural network to compute at least one output value by concurrently computing: a first set of intermediate values, by applying a first set of multipliers of the neural network to the set of combined low-part values; and a second set of intermediate values, by applying a second set of multipliers of the neural network to a set of high-part values of at least some of the first set and the second set, each high-part value of the set of high-part values associated with a low-part value where the low-part value and the high- part value both originated from a common weight value of the first set and the second set. Optionally, the first, second and third program instructions are executed by at least one computerized processor from the non-transitory computer readable storage medium.
According to a fifth aspect of the disclosure, a computer program comprises program instructions which, when executed by a processor, cause the processor to: receive a first set and a second set of weight values from a plurality of weight values of the neural network, where each weight value of the first set and the second set has a low-part value and a high-part value; produce a set of combined low-part values, each produced by combining respective low-part values of two weight values, one selected from the first set and another selected from the second set; and configure the neural network to compute at least one output value by concurrently computing: a first set of intermediate values, by applying a first set of multipliers of the neural network to the set of combined low-part values; and a second set of intermediate values, by applying a second set of multipliers of the neural network to a set of high-part values of at least some of the first set and the second set, each high-part value of the set of high-part values associated with a low-part value where the low-part value and the high-part value both originated from a common weight value of the first set and the second set.
In an implementation form of the first and second aspects, each of the set of combined low-part values is associated with one of a set of activation values. Optionally, applying the first set of multipliers to the set of combined low-part values comprises each of the first set of multipliers multiplying one of the set of combined low-part values with the respective activation value associated therewith. Optionally, applying the second set of multipliers to the set of high- part values comprises each of the second set of multipliers multiplying one of the set of high- part values by the respective activation value associated with the respective low-part value associated with the high-part value. Optionally, computing the at least one output value comprises computing the at least one output value using the first set of intermediate values and the second set of intermediate values. Optionally, each of the set of high-part values is not equal to zero. Optionally, each respective activation value multiplied by a high-part value of the set of high part values is not equal to zero. Applying a multiplier of the second set of multipliers only to a high-part value that does not equal zero and additionally or alternatively only to an activation value that does not equal zero facilitates reducing an amount of multipliers in the second set of multipliers compared to an implementation where a multiplier is applied also to a zero value, thus reducing cost of implementation of the neural network without impacting accuracy of an output value inferred by the neural network.
In another implementation form of the first and second aspects, configuring the neural network to compute at least one output value further comprises: computing another second set of intermediate values by at least some multipliers, selected from one or more of the first set of multipliers and the second set of multipliers, multiplying one of a set of other high-part values of at least some other high-part values selected from one or more of the first set and the second set, the other high-part value associated with another low-part value where the other low-part value and the other high-part value both originated from another common weight value of the first set and the second set, by the respective activation value associated with the respective other low-part value associated with the other high-part value. Optionally, computing the at least one output value comprises computing the at least one output value further using the other second set of intermediate values. Optionally, the second set of multipliers has an identified amount of multipliers. Optionally, the identified amount of multipliers is 32. Optionally, a set of non-zero high-parts, produced by selecting a complete set of high-part values of all of the first set and the second set, comprises more high-part values than the identified amount of multipliers, and the set of other high-part values comprises a plurality of other high-part values of the complete set of high-part values not members of the set of high-part values. Optionally, receiving the first set comprises: receiving a first sequence of low-part values of the first set; receiving a first sequence of bits, each associated with one of the first sequence of low-part values in order and having a value of 1 when a respective high-part value associated with the low-part value is not equal to zero, otherwise having a value of 0; and receiving a first sequence of high-part values, each associated with a non-zero bit of the first sequence of bits, in order; and wherein receiving the second set comprises: receiving a second sequence of low-part values of the second set; receiving a second sequence of bits, each associated with one other of the second sequence of low-part values in order and having a value of 1 when another respective high-part value associated with the other low-part value is not equal to zero, otherwise having a value of 0; and receiving a second sequence of high-part values, each associated with another non-zero bit of the second sequence of bits, in order. Using a sequence of bits to associate each of a sequence of high-part values with one of a sequence of low-part values reduces an amount of memory required to create such an association compared to some other methods, for example using an ordinal number or providing a high-part value for each of the low-part values.
In a further implementation form of the first and second aspects, each weight value comprises a weight amount of bits, each low-part value comprises a low-part amount of bits, and each low-part value is a least significant part of the weight value. Optionally, the low-part amount of bits is half of the weight amount of bits. Optionally, the weight amount of bits is selected from the group of bit amounts consisting of: 4, 8, 16, 32, and 64.
In a further implementation form of the first and second aspects, the neural network comprises a plurality of layers, each having a plurality of layer weight values of the plurality of weight values of the neural network. Optionally, the first set and the second set are selected from the plurality of layer weight values of one of the plurality of layers. Selecting the first set and the second set from the plurality of weight values of one of the plurality of layers allows considering a distribution of the plurality of weight values of the layer in selecting the first set and the second set, thus improving a reduction in an amount of cycles required for computation of an output value of the neural network.
In a further implementation form of the first and second aspects, a first pair of sets comprises the first set and the second set. Optionally, a second pair of sets comprises another first set of weight values of the plurality of weight values and another second set of weight values of the plurality of weight values. Optionally, the processing unit is further configured for: receiving the second pair of sets; producing another set of combined low-part values, each produced by combining respective other low-part values of two other weight values, one selected from the other first set and yet another selected from the other second set; and configuring the neural network to compute the at least one output value by further concurrently computing: another first set of intermediate values, by applying another first set of multipliers of the neural network to the other set of combined low-part values; and another second set of intermediate values, by applying another second set of multipliers of the neural network to another set of high-part values of at least some of the other first set and the other second set, each other high-part value of the other set of high-part values associated with another low-part value where the other low-part value and the other high-part value both originated from another common weight value of the other first set and the other second set. Optionally, the first set of multipliers is different from the other first set of multipliers and the second set of multipliers is different from the other second set of multipliers. Using another first set of multipliers different from the first set of multipliers and another second set of multipliers different from the second set of multipliers facilitates increasing throughput of the neural network by computing the other first set of intermediate values and the other second set of intermediate values concurrently to computing the first set of intermediate values and the second set of intermediate values.
In a further implementation form of the first and second aspects, the first set and the second set are produced by receiving another set of weight values of the plurality of weight values, and splitting the other set of weight values into the first set and the second set, such that an amount of weight values of the first set is equal to an amount of weight values of the second set.
In a further implementation form of the first and second aspects, the plurality of weight values has a non-uniform distribution with a variance less than an identified variance threshold.
Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims.
Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)
Some embodiments are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments may be practiced.
In the drawings:
FIG. 1 is a schematic diagram representing an exemplary product of two numbers in binary representation; FIG. 2 is a schematic block diagram representing part of an exemplary neural network according to some embodiments;
FIG. 3 is a schematic block diagram of an exemplary apparatus, according to some embodiments;
FIG. 4 is a flowchart schematically representing an optional flow of operations for configuring a neural network, according to some embodiments;
FIG. 5 is a flowchart schematically representing another optional flow of operations for configuring a neural network, according to some embodiments; and
FIG. 6 is a flowchart schematically representing an optional flow of operations for executing a neural network, according to some embodiments.
DETAILED DESCRIPTION
For brevity, the term “a non-uniform distribution around a value” is used to mean “a non-uniform distribution with a small variance around a value”, and the term “non-uniform plurality of values around an identified value” is used to mean a plurality of values having a non-uniform distribution with a small variance around the identified value. Thus, “a non- uniform plurality of values around 0” is used to mean “a plurality of values having a non- uniform distribution with a small variance around 0”.
It should be noted that the following description focuses on embodiments comprising a neural network having a non-uniform plurality of weight values around 0. However, the scope of the apparatus and methods described herein is not limited to such embodiments. In other possible embodiments, where the plurality of weight values is a non-uniform plurality of weight values around another value, an offset value may be added to each of the plurality of weight values to map the plurality of weight values to another non-uniform plurality of weight values around 0. The offset value may be negative or non-negative.
A typical deep neural network comprises millions of parameters and may require millions of arithmetic operations, requiring computation and digital memory resources exceeding capabilities of many devices, for example mobile devices, some embedded devices and some custom hardware devices. In addition, an amount of computation cycles required for inference impacts both an amount of time required for inference, and thus impacts throughput of the neural network, and an amount of power consumed by the neural network, and thus cost of operation thereof.
One way to reduce cost of production of a neural network is by reducing an amount of physical computation resources of the neural network, some examples being an amount of computation elements of the neural network and an amount of memory used by the neural network. For example, when the neural network is implemented as an integrated circuit, reducing the amount of physical computation resources of the neural network reduces the area of a semiconductor component comprising the integrated circuit.
Some neural networks reduce an amount of computation resources by reducing an amount of bits used to store at least some of the plurality of weights of the neural network. This practice is known as quantization of the neural network. Quantization may also include reducing an amount of bits used to store at least some of the activation values of the neural network. However, quantizing a neural network may reduce accuracy of an output of the neural network. In quantization, an original plurality of weight values, each represented by an original amount of bits, may be mapped to a plurality of quantized weight values, each represented by a reduced amount of bits, where the reduced amount of bits is less than the original amount of bits. As fewer bits are used to represent each of the quantized weight values there are fewer possible values for a quantized weight value than for a weight value, and more than one of the original plurality of weight values may be mapped to a common quantized weight value. As a result, accuracy of a neural network using the plurality of quantized weight values may be reduced compared to using the original plurality of weight values.
Some neural networks, instead of concurrently applying a layer’s computation resources to all weight values of a level, use the layer’s computation resources to compute in a plurality of batches (iterations). In such solutions, in each batch the layer’s computation resources are applied to some of the layer’s plurality of weights and an output of the layer is computed using a plurality of batch results. An amount of time required to compute an output of such a neural network is increased according to an amount of batches executed, reducing throughput and possibly increasing an amount of power required to execute the neural network.
A product of multiplying by zero is zero. When a weight value has a high-part value equaling zero, a product computed using such a weight value also has at least some high-part that is equal to zero. Reference is now made to FIG. 1, showing a schematic diagram 100 representing an exemplary product of two numbers in binary representation. In this example, number N1 is equal 3 and has a high-part HP1 that equals zero, and number N2 is equal 3 and has a high-part HP2 that equals zero. In this example, product Pl equals 9 and is a result of multiplying number N1 by number N2. In this example, product Pl has a high-part HP3 that equals zero. It should be noted that a high-part does not necessarily include all consecutive most significant bits that are equal zero. For example, in diagram 100 both number N1 and number N2 have other bits, consecutive to high-part HP1 and high-part HP2 respectively, which are zero, however they are excluded from the respective high-part.
When a neural network’s plurality of weight values has a non-uniform distribution around 0, there may be an amount of most-significant bits k for which the plurality of weight values each have a high-part equal zero and thus a plurality of products computed using the plurality of weight values each have a high-part equal zero. Optionally, the plurality of weight values is a plurality of fixed-point values. Optionally, the plurality of weight values is a plurality of fixed-point quantized values, produced by quantizing a plurality of floating-point weight values of a neural network.
Some embodiments described herewithin propose using this property to reduce an amount of explicit multiplications by zero and thus reduce an amount of computing resources of the neural network. To do so, in some embodiments described herewithin, a first set of multipliers of the neural network is applied to a set of combined low-part values, each produced by combining low-part values of two or more weight values of the plurality of weight values, while concurrently a second set of multipliers of the neural network is applied to a set of nonzero high-part values. Optionally, the second set of multipliers has fewer multipliers than the first set of multipliers. Optionally, the set of combined low-part values is produced by combining respective low-part values of two weight values, one selected from a first set of weight values of a plurality of weight values of the neural network, and another selected from a second set of weight values of the plurality of weight values. Optionally, the set of non-zero high-part values is at least part of a set of high-part values of at least some of the first set and the second set, such that each high-part value of the set of high-part values is associated with a low-part value where the low-part value and the high-part value originated from a common weight value of the first set and the second set. Optionally one or more intermediate values computed by applying the first set of multipliers and the second set of multipliers are used to compute one or more output values of the neural network. Refraining from applying a multiplier to a high-part that is equal to zero allows reducing an amount of multipliers of the neural network, thus reducing cost of production and cost of operation of the neural network. Applying one multiplier to a combined low-part value facilitates using fewer bits to represent non-zero parts of two or more weight values while preserving full accuracy of the neural network. In addition, applying one multiplier to a combined low-part value facilitates reducing an amount of computation cycles required to compute two or more products of two or more low-part values, thus reducing cost of operation of the neural network by reducing power consumption. It is important to emphasize that the method described above is not quantization of the neural network’s plurality of weights. According to some embodiments of the present disclosure, full accuracy of the neural network is preserved as there is no loss of significant values. Multiplication by zero yields zero, thus refraining from applying a multiplier to a high- part that equals zero does not lose a significant value. In such embodiments, when the second set of multipliers has fewer multipliers than the first set of multipliers the amount of computing resources of the neural network is reduced without impacting accuracy of an output value inferred by the neural network. In addition, applying fewer multipliers reduces power consumption of the neural network.
Optionally, applying a multiplier to a high-part value comprises multiplying the high- part value by an associated activation value. Optionally, the second set of multipliers is applied to the set of non-zero high-part values subject to each of the respective associated activation values being non-zero. Refraining from multiplying the high-part value by the associated activation value when the associated activation value is zero facilitates reducing an amount of computation cycles used by the neural network, reducing power consumption of the neural network.
Concurrently applying the first set of multipliers to the set of combined low-part values and applying the second set of multipliers to at least some of the set of high-part values facilitates computing an output value using fewer physical computation resources, and thus reducing cost of production and cost of implementation of the neural network, without reducing throughput of the neural network.
It may be that the first set and the second set of weight values have an amount of weight values having a non-zero high-part which exceeds an amount of multipliers in the second set of multipliers. Optionally, at least some multipliers of the first set of multipliers and the second set of multipliers are applied to a set of other high-part values to compute one or more other intermediate results. Optionally, the one or more other intermediate results are further used to compute the one or more output values of the neural network. Applying the at least some multipliers to the set of other high-part values allows preserving accuracy of an output of the neural network without increasing an amount of physical computation resources thereof.
Optionally, one or more of the second set of multipliers are applied to one or more combined high-part values, each produced by combining two or more of the set of high-part values. Applying one or more of the second set of multipliers to one or more combined high- part values facilitates applying the second set of multipliers to all of the at least some of the set of high-part values, allowing computing the one or more output values without applying the at least some multipliers to the set of other high-part values, thus reducing computation time, and therefor increasing throughput and reducing power consumption of executing the neural network.
Before explaining at least one embodiment in detail, it is to be understood that embodiments are not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. Implementations described herein are capable of other embodiments or of being practiced or carried out in various ways.
Embodiments may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the embodiments.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of embodiments may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of embodiments.
Aspects of embodiments are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the fimctions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the fimctions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Reference is now made to FIG. 2, showing a schematic block diagram representing part of an exemplary neural network 200 according to some embodiments. In such embodiments, a first set 210 of weight values comprises weight values 210-1 through 210-N, and a second set 220 of weight values comprises weight values 220-1 through 220-N. Optionally, first set 210 and second set 220 comprise weight values of a plurality of weight values of a neural network. Optionally, each weight value of first set 210 and second set 220 has a low-part and a high-part. For example, weight values 210-1, 210-2 and 210-N each may have low-part value 211-1, 211- 2 and 211-N respectively and high-part value 212-1, 212-3 and 212-N respectively. Similarly, in this example, weight values 220-1, 220-2 and 220-N each may have low-part value 221-1, 221-2 and 221-N respectively and high-part value 222-1, 222-3 and 222-N respectively. Optionally, each low-part value of a weight value is a least significant part of the weight value.
Optionally, each weight value of the plurality of weight values comprises a weight amount of bits, some examples being 8 bits and 32 bits. Other examples of a weight amount of bits include 4, 16, and 64. Optionally, each low-part value comprises a low-part amount of bits, less than the weight amount of bits. Optionally, the low-part amount of bits is half of the weight amount of bits.
Optionally, first set 210 and second set 220 are used to produce set of combined low- part values 230. For example, combined low-part value 230-1 may be produced by combining low-part value 211-1 of weight value 210-1 of first set 210 and low-part value 221-1 of weight value 220-1 of second set 220. Similarly, combined low-part value 230-2 may be produced by combining low-part value 211-2 of weight value 210-2 of first set 210 and low-part value 221- 2 of weight value 220-2 of second set 220. Similarly, combined low-part value 230-N may be produced by combining low-part value 211-N of weight value 210-N of first set 210 and low- part value 221-N of weight value 220-N of second set 220.
Optionally, each of combined low-part values 230 is associated with one of set of activation values 240. For example, combined low-part value 230-1 may be associated with activation value 240-1 of set of activation values 240. Similarly, combined low-part value 230- 2 may be associated with activation value 240-2 of set of activation values 240 and combined low-part value 230-N may be associated with activation value 240-N of set of activation values 240.
Optionally, first set of multipliers 260, comprising multiplier 260-1 through multiplier 260-N, is applied to combined low-part values 230, optionally to multiply each of combined low-part value 230 by the activation value of set of activation values 240 associated therewith. For example, multiplier 260-1 may be applied to multiply combined low-part value 230-1 by activation value 240-1. Similarly, multiplier 260-2 may be applied to multiply combined low- part value 230-2 by activation value 240-2 and multiplier 260-N may be applied to multiply combined low-part value 230-N by activation value 240-N.
Optionally, a set of high-part values 250 comprises a plurality of high-part values of at least some of the first set and the second set, for example high-part value 212-1 of weight value 210-1 of first set 210, high-part value 212-N of weight value 210-N of first set 210 and high- part value 222-N of weight value 220-N of second set 220. Optionally, set of high-part values 250 comprises all respective high-part values of all weight values of first set 210 and second set 220. Optionally, each of set of high-part values 250 is not equal to zero.
Optionally, a second set of multipliers 270, comprising multiplier 270-1 through multiplier 270-M, is applied to set of high-part values 250, optionally to multiply each of set of high-part values 250 by the activation value of set of activation values 240 associated with a low-part value associated therewith. In this example, high-part value 212-1 is associated with low-part value 211-1 both having originated from weight value 210-1. Low-part value 211-1 is associated in this example with activation value 240-1, thus in this example high-part value 212-1 is associated with activation value 240-1. Similarly, in this example both high-part value 212-N and high-part value 222-N are associated with activation value 240-N.
It should be noted that one activation value may be provided to more than one of second set of multipliers 270. In this example, activation value 240-N is provided to multiplier 270-2, optionally for multiplication by high-part value 212-N, and is provided to multiplier 270-M, optionally for multiplication by high-part value 222-N. Optionally, second multiplier 270-1 is applied to high-part value 212-1 and activation value 240-1.
Optionally, some high-part values, for example high-part value 212-2, high-part-value 222-1 and high-part value 222-2 are not members of set of high-part values 250 and are not provided to second set of multipliers 270. Optionally, high-part value 212-2, high-part-value 222-1 and high-part value 222-2 are each equal zero. Optionally, high-part value 212-2 is non zero and associated activation value 240-2 is equal zero, and thus not provided to one of second set of multipliers 270.
Optionally, first set of multipliers 260 computes a first set of intermediate values. Optionally second set of multipliers 270 computes a second set of intermediate values. Optionally, one or more output values are computed using the first set of intermediate values and the second set of intermediate values, optionally using one or more adders 280.
Optionally, each of the first set of intermediate values has a respective low-part and a respective high-part. For example, one of the first set of intermediate values may stem from multiplication of combined low-part value 230-1 (produced by combining low-part value 211- 1 and low-part value 221-1) by respective application value 240-1. Optionally, one or more respective low-part values of the first set of intermediate values are provided to a first adder of one or more adders 280, and one or more respective high-part values of the first set of intermediate values is provided to a second adder of the one or more adders 280. For example, a respective low-part of an output of multiplier 260-1 may be provided to the first adder of adders 280, and a respective high-part of the output of multiplier 260-1 may be provided to a second adder of adders 280. Similarly, another respective low-part of an output of multiplier 260-N may provided to the first adder of adders 280, and another respective high-part of the output of multiplier 260-N may be provided to the second adder of adders 280.
In some embodiments a neural network comprising part 200 is configured using the following exemplary apparatus. Reference is now made also to FIG. 3, showing a schematic block diagram of an exemplary apparatus 300, according to some embodiments. In one embodiment, the apparatus is a single device. In another embodiment, the apparatus is a system of two or more devices, e.g. computers, that are configured to interact so as to achieve the functionality described herein. Optionally, apparatus 300 comprises processing unit 301 connected to storage 302. Processing unit 301 may be any kind of programmable or non-programmable circuitry that is configured to carry out the operations described herein. Processing unit 301 may comprise hardware as well as software. For example, processing unit 301 may comprise one or more processors and a transitory or non-transitory memory that carries a program which causes the processing unit to perform the respective operations when the program is executed by the one or more processors. Storage 302 may be a transitory or non-transitory memory. Optionally, storage 302 carries the program. Optionally, storage 302 is a non-volatile digital storage, some examples being a hard disk drive, a solid state drive, a network storage, and a storage network. Optionally, processing unit 301 retrieves a plurality of weight values of a neural network from storage 302. Optionally, processing unit 301 is connected to a digital communication network interface 303. Optionally, processing unit 301 is connected to storage 302 via digital communication network interface 303.
To configure a neural network, in some embodiments apparatus 300 implements the following optional method.
Reference is now made also to FIG. 4, showing a flowchart schematically representing an optional flow of operations 400 for configuring a neural network, according to some embodiments. In such embodiments, in 401 processing unit 301 receives first set 210 and second set 220 of a plurality of weight values of a neural network. Optionally, the neural network is executed by processing unit 301. Optionally, the neural network is executed by another processing unit (not shown), connected to processing unit 301.
Optionally, the plurality of weight values has a non-uniform distribution with a variance less than an identified variance threshold. Optionally, the variance threshold is identified such than at least an identified part of the plurality of weight values have a respective high-part that is equal zero, for example, at least half of the plurality of weight values may have a respective high-part that is equal zero. In another example, the variance threshold is identified such that at least a third of the plurality of weight values have a respective high-part that is zero. Reference is now made also to FIG. 5, showing a flowchart schematically representing another optional flow of operations 500 for configuring a neural network, according to some embodiments. In such embodiments, receiving first set 210 in 401 comprises processing unit 301 receiving in 501 a first sequence of low-part values of first set 210, for example low-part values 211-1 through 211-N. Optionally, in 502 processing unit 301 receives a first sequence of bits, each associated with one low-part value of the first sequence of low-part values in order, and having a value of 1 when a respective high-part value associated with the low-part values is not equal to zero, otherwise having a value of 0. For example, when high-part value 212-1 and high-part value 212-N are not equal zero, the first sequence of bits has a value of 1 in places 1 and N. When high-part value 212-2 is equal to zero, the first sequence of bits has a value of 0 in place 2. Optionally, in 503 processing unit 301 receives a first sequence of high-part values, each associated with a non-zero bit of the first sequence of bits, in order. Continuing the above example, the first sequence of high-part values comprises of high-part value 212-1 and high- part value 212-N thereafter, where high-part value 212-2 is not a member of the first sequence of high-part values. It should be noted that high-part value 212-N is after high-part value 212- 1 in the first sequence of high-part values, however not necessarily immediately following high- part value 212-1. Optionally, set of high-part values 250 is a combination of the first sequence of high-part values and the second sequence of high-part values.
Optionally, receiving second set 220 comprises processing unit 301 receiving in 511 a second sequence of low-part values of second set 220, for example low-part values 221-1 through 221-N. Optionally, in 512 processing unit 301 receives a second sequence of bits, each associated with one other low-part value of the second sequence of low-part values in order and having a value of 1 when another respective high-part value associated with the other low-part value is not equal to zero, otherwise having a value of 0. Optionally, in 513, processing unit 301 receives a second sequence of high-part values, each associated with another non-zero bit of the second sequence of bits, in order.
Reference is now made again to FIG. 4.
Optionally, first set 210 and second set 220 are produced by processing unit 301 receiving another set of weight value of the plurality of weight values and processing unit 301 splitting the other set of weight values into first set 210 and second set 220 such that an amount of weight values of first set 210, denoted by TV, is equal to an amount of weight values of second set 220.
Optionally, in 402 processing unit 301 produces set of combined low-part values 230. Optionally, each of set of combined low-part values 230 is produced by combining respective low-part values of two weight values, one selected from first set 210 and another selected from second set 220. Optionally, each of set of combined low-part values 230 is produced by combining respective low-part values of more than two weight values, for example one selected from first set 210, another selected from second set 220 and yet another selected from a third set of low-part values not shown.
In 403, processing unit 301 optionally configures the neural network to compute one or more output values. Optionally, the neural network comprises a plurality of layers, each having a plurality of layer weight values of the plurality of weight values of the neural network. Optionally, first set 210 and second set 220 are selected from the plurality of layer weight values of one of the plurality of layers. Optionally, the one or more output values are one or more output values of the layer.
Optionally, computing the one or more output values is by concurrently in 411 computing a first set of intermediate values and in 412 computing a second set of intermediate values. Optionally, computing the one or more output values in 403 comprises computing the one or more output values in 420 using the first set of intermediate values and the second set of intermediate values.
Optionally, each of set of combined low-part values 230 is associated with one of set of activation values 240.
Optionally, processing unit 301 configures the neural network to compute in 411 the first set of intermediate values by applying first set of multipliers 260 to set of combined low- part values 230. Optionally, applying first set of multipliers 260 to set of combined low-part values 230 comprises each of first set of multipliers 260 multiplying one of set of combined low-part values 230 with the respective activation value associated therewith.
Optionally, processing unit 301 configures the neural network to compute in 412 the second set of intermediate values by applying second set of multipliers 270 to set of high-part values 250. Optionally, set of high-part values 250 is a set of high-part values of at least some of first set 210 and second set 220. Optionally, each high-part value of set of high-part values 250 is associated with a low-part value where the low-part value and the high-part value originated from a common weight value of the first set and the second set. Optionally, applying second set of multipliers 270 to set of high-part values 250 comprises each of second set of multipliers 270 multiplying one of set of high-part values 250 by the respective activation value associated with the respective low-part value associated with the high-part value. Optionally, each respective activation value multiplied by one of set of high-part values 250 is not equal to zero. Optionally, applying second set of multipliers 270 to set of high-part values 250 comprises at least one of second set of multipliers 270 multiplying an activation value by a combined high-set value, produced by combining two of set of high-part values 250 each associated with the activation value. Optionally, a combined high-set value is produced by combining more than two high-part values.
Optionally, the neural network comprises additional multipliers not in either of first set of multipliers 260 and second set of multipliers 270.
Optionally, second set of multipliers 270 has an identified amount of multipliers, denoted by AT, for example 32 multipliers. Optionally, a set of non-zero high-parts, produced by selecting a complete set of high-part values of all of first set 210 and second set 220, comprises more than M high-part values. Optionally, set of high-part values 250 comprises M high-part values. Optionally, a set of other high-part values comprises a plurality of other high- part values of the complete set of high-part values not members of set of high-part values 250, for example one or more of high-part value 212-2, high-part value 222-1 and high-part value 222-2.
Optionally, in 430, processing unit 301 computes another second set of intermediate values. Optionally, processing unit 301 computes the other second set of intermediate values using at least some multipliers selected from first set of multipliers 260 and second set of multipliers 270. Optionally, each of the at least some multipliers are applied to one of a set of other high-part values of at least some other high-part values selected from one or more of first set 210 and second set 220, for example high-part value 212-2. Optionally, the other high-part value is associated with another low-part value, for example low-part value 211-2, where the other low-part value and the other high-part value both originated from another common value of first set 210 and second set 220, for example weight value 210-2. Optionally, applying one of the at least some multipliers to one of the set of other high-part values comprises multiplying the other high-part value by the respective activation value, for example 240-2, associated with the respective other low-part value, for example low-part value 211-2 associated with the other high-part value 212-2. In 420, processing unit 301 optionally computes the one or more output values further using the other second set of intermediate values.
Optionally, the plurality of weight values comprises a plurality of pairs of sets, and the method described above is repeated for other pairs using other sets of multipliers of the neural network, optionally concurrently. Optionally, a first pair of sets comprises first set 210 and second set 220. Optionally, another pair of sets comprises another first set of weight values of the plurality of weight values and another second set of weight values of the plurality of weight values.
Optionally, 401, 402, 403, 411, 412 and 420 are repeated using the other pair of sets and another first set of multipliers of the neural network and another second set of multipliers of the neural network. Optionally, the first set of multipliers is different from the other first set of multipliers. Optionally, the second set of multipliers is different from the other second set of multipliers. Optionally, 401, 402, 403, 411, 412 and 420 are repeated using the other pair of sets and the first set of multipliers of the neural network and the second set of multipliers of the neural network.
Optionally, method 400 is applied to each of one or more layers of the neural network.
In some embodiments, a computer program comprises program instructions which, when executed by processing unit 301, cause apparatus 300 to implement method 400.
In some embodiments, apparatus 300 executes the neural network. In some such embodiments, apparatus implements the following optional method.
Reference is now made also to FIG. 6, showing a flowchart schematically representing an optional flow of operations 600 for executing a neural network, according to some embodiments. In such embodiments, in 601 processing unit 301 configures the neural network, optionally by executing method 400, optionally to compute one or more output values. In 610 processing unit 301 optionally receives an input value and in 620 processing unit 301 optionally computes the one or more output values in response to the input value.
The descriptions of the various embodiments have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
It is expected that during the life of a patent maturing from this application many relevant neural networks will be developed and the scope of the term neural network is intended to include all such new technologies a priori.
As used herein the term “about” refers to ± 10 %.
The terms "comprises", "comprising", "includes", "including", “having” and their conjugates mean "including but not limited to". This term encompasses the terms "consisting of' and "consisting essentially of'.
The phrase "consisting essentially of means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method. As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.
The word “exemplary” is used herein to mean “serving as an example, instance or illustration”. Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.
The word “optionally” is used herein to mean “is provided in some embodiments and not provided in other embodiments”. Any particular embodiment may include a plurality of “optional” features unless such features conflict.
Throughout this application, various embodiments may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of embodiments. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.
It is appreciated that certain features of embodiments, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of embodiments, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.
Although embodiments have been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims.
All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to embodiments. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims

1. An apparatus for configuring a neural network, comprising a processing unit configured for: receiving a first set and a second set of weight values of a plurality of weight values of the neural network, where each weight value of the first set and the second set has a low-part value and a high-part value; producing a set of combined low-part values, each produced by combining respective low-part values of two weight values, one selected from the first set and another selected from the second set; and configuring the neural network to compute at least one output value by concurrently computing: a first set of intermediate values, by applying a first set of multipliers of the neural network to the set of combined low-part values; and a second set of intermediate values, by applying a second set of multipliers of the neural network to a set of high-part values of at least some of the first set and the second set, each high-part value of the set of high-part values associated with a low-part value where the low-part value and the high-part value both originated from a common weight value of the first set and the second set.
2. The apparatus of claim 1, wherein each of the set of combined low-part values is associated with one of a set of activation values; wherein applying the first set of multipliers to the set of combined low-part values comprises each of the first set of multipliers multiplying one of the set of combined low-part values with the respective activation value associated therewith; wherein applying the second set of multipliers to the set of high-part values comprises each of the second set of multipliers multiplying one of the set of high-part values by the respective activation value associated with the respective low-part value associated with the high-part value; and wherein computing the at least one output value comprises computing the at least one output value using the first set of intermediate values and the second set of intermediate values.
3. The apparatus of claim 2, wherein each of the set of high-part values is not equal to zero.
24
4. The apparatus of any of claims 2 and 3, wherein each respective activation value multiplied by a high-part value of the set of high part values is not equal to zero.
5. The apparatus of any of claims 2-4, wherein configuring the neural network to compute at least one output value further comprises: computing another second set of intermediate values by at least some multipliers, selected from one or more of the first set of multipliers and the second set of multipliers, multiplying one of a set of other high-part values of at least some other high-part values selected from one or more of the first set and the second set, the other high-part value associated with another low-part value where the other low-part value and the other high-part value both originated from another common weight value of the first set and the second set, by the respective activation value associated with the respective other low-part value associated with the other high-part value; and wherein computing the at least one output value comprises computing the at least one output value further using the other second set of intermediate values.
6. The apparatus of claim 5, wherein the second set of multipliers has an identified amount of multipliers; wherein a set of non-zero high-parts, produced by selecting a complete set of high-part values of all of the first set and the second set, comprises more high-part values than the identified amount of multipliers; and wherein the set of other high-part values comprises a plurality of other high-part values of the complete set of high-part values not members of the set of high-part values.
7. The apparatus of claim 6, wherein receiving the first set comprises: receiving a first sequence of low-part values of the first set; receiving a first sequence of bits, each associated with one of the first sequence of low-part values in order and having a value of 1 when a respective high-part value associated with the low-part value is not equal to zero, otherwise having a value of 0; and receiving a first sequence of high-part values, each associated with a non-zero bit of the first sequence of bits, in order; and wherein receiving the second set comprises: receiving a second sequence of low-part values of the second set; receiving a second sequence of bits, each associated with one other of the second sequence of low-part values in order and having a value of 1 when another respective high-part value associated with the other low-part value is not equal to zero, otherwise having a value of 0; and receiving a second sequence of high-part values, each associated with another non-zero bit of the second sequence of bits, in order.
8. The apparatus of any of claims 6 and 7, wherein the identified amount of multipliers is 32.
9. The apparatus of any of claims 1-8, wherein: each weight value comprises a weight amount of bits; each low-part value comprises a low-part amount of bits; and each low-part value is a least significant part of the weight value.
10. The apparatus of claim 9, wherein the low-part amount of bits is half of the weight amount of bits.
11. The apparatus of any of claims 9 and 10, wherein the weight amount of bits is selected from the group of bit amounts consisting of: 4, 8, 16, 32, and 64.
12. The apparatus of any of claims 1-11, wherein the neural network comprises a plurality of layers, each having a plurality of layer weight values of the plurality of weight values of the neural network; and wherein the first set and the second set are selected from the plurality of layer weight values of one of the plurality of layers.
13. The apparatus of any of claims 1-12, wherein a first pair of sets comprises the first set and the second set; wherein a second pair of sets comprises another first set of weight values of the plurality of weight values and another second set of weight values of the plurality of weight values; wherein the processing unit is further configured for: receiving the second pair of sets; producing another set of combined low-part values, each produced by combining respective other low-part values of two other weight values, one selected from the other first set and yet another selected from the other second set; and configuring the neural network to compute the at least one output value by further concurrently computing: another first set of intermediate values, by applying another first set of multipliers of the neural network to the other set of combined low-part values; and another second set of intermediate values, by applying another second set of multipliers of the neural network to another set of high-part values of at least some of the other first set and the other second set, each other high-part value of the other set of high-part values associated with another low-part value where the other low-part value and the other high-part value both originated from another common weight value of the other first set and the other second set; and wherein the first set of multipliers is different from the other first set of multipliers and the second set of multipliers is different from the other second set of multipliers.
14. The apparatus of any of claims 1-13, wherein the first set and the second set are produced by: receiving another set of weight values of the plurality of weight values; and splitting the other set of weight values into the first set and the second set, such that an amount of weight values of the first set is equal to an amount of weight values of the second set.
15. The apparatus of any of claims 1-14, wherein the plurality of weight values has a non- uniform distribution with a variance less than an identified variance threshold.
16. A method for configuring a neural network, comprising: receiving a first set and a second set of weight values from a plurality of weight values of the neural network, where each weight value of the first set and the second set has a low-part value and a high-part value;
27 producing a set of combined low-part values, each produced by combining respective low-part values of two weight values, one selected from the first set and another selected from the second set; and configuring the neural network to compute at least one output value by concurrently computing: a first set of intermediate values, by applying a first set of multipliers of the neural network to the set of combined low-part values; and a second set of intermediate values, by applying a second set of multipliers of the neural network to a set of high-part values of at least some of the first set and the second set, each high-part value of the set of high-part values associated with a low-part value where the low-part value and the high-part value both originated from a common weight value of the first set and the second set.
17. An apparatus for executing a neural network, comprising a processing unit configured for: configuring the neural network by: receiving a first set and a second set of weight values from a plurality of weight values of the neural network, where each weight value of the first set and the second set has a low-part value and a high-part value; producing a set of combined low-part values, each produced by combining respective low-part values of two weight values, one selected from the first set and another selected from the second set; and configuring the neural network to compute at least one output value by concurrently computing: a first set of intermediate values, by applying a first set of multipliers of the neural network to the set of combined low-part values; and a second set of intermediate values, by applying a second set of multipliers of the neural network to a set of high-part values of at least some of the first set and the second set, each high-part value of the set of high-part values associated with a low-part value where the low-part value and the high-part value both originated from a common weight value of the first set and the second set; receiving an input value; and computing the one or more output values in response to the input value.
28
18. A software program product for configuring a neural network, comprising: a non-transitory computer readable storage medium; first program instructions for receiving a first set and a second set of weight values from a plurality of weight values of the neural network, where each weight value of the first set and the second set has a low-part value and a high-part value; second program instructions for producing a set of combined low-part values, each produced by combining respective low-part values of two weight values, one selected from the first set and another selected from the second set; and third program instructions for configuring the neural network to compute at least one output value by concurrently computing: a first set of intermediate values, by applying a first set of multipliers of the neural network to the set of combined low-part values; and a second set of intermediate values, by applying a second set of multipliers of the neural network to a set of high-part values of at least some of the first set and the second set, each high-part value of the set of high-part values associated with a low-part value where the low-part value and the high-part value both originated from a common weight value of the first set and the second set; wherein the first, second and third program instructions are executed by at least one computerized processor from the non-transitory computer readable storage medium.
19. A computer program comprising program instructions which, when executed by a processor, cause the processor to: receive a first set and a second set of weight values from a plurality of weight values of the neural network, where each weight value of the first set and the second set has a low-part value and a high-part value; produce a set of combined low-part values, each produced by combining respective low- part values of two weight values, one selected from the first set and another selected from the second set; and configure the neural network to compute at least one output value by concurrently computing: a first set of intermediate values, by applying a first set of multipliers of the neural network to the set of combined low-part values; and
29 a second set of intermediate values, by applying a second set of multipliers of the neural network to a set of high-part values of at least some of the first set and the second set, each high-part value of the set of high-part values associated with a low-part value where the low-part value and the high-part value both originated from a common weight value of the first set and the second set.
30
PCT/EP2020/071784 2020-08-03 2020-08-03 Using non-uniform weight distribution to increase efficiency of fixed-point neural network inference WO2022028666A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/EP2020/071784 WO2022028666A1 (en) 2020-08-03 2020-08-03 Using non-uniform weight distribution to increase efficiency of fixed-point neural network inference
CN202080104544.6A CN116249990A (en) 2020-08-03 2020-08-03 Improving fixed point neural network inference efficiency with non-uniform weight distribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/071784 WO2022028666A1 (en) 2020-08-03 2020-08-03 Using non-uniform weight distribution to increase efficiency of fixed-point neural network inference

Publications (1)

Publication Number Publication Date
WO2022028666A1 true WO2022028666A1 (en) 2022-02-10

Family

ID=71944136

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2020/071784 WO2022028666A1 (en) 2020-08-03 2020-08-03 Using non-uniform weight distribution to increase efficiency of fixed-point neural network inference

Country Status (2)

Country Link
CN (1) CN116249990A (en)
WO (1) WO2022028666A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3496008A1 (en) * 2017-12-05 2019-06-12 Samsung Electronics Co., Ltd. Method and apparatus for processing convolution operation in neural network
US20190378001A1 (en) * 2018-06-12 2019-12-12 Samsung Electronics Co., Ltd. Neural network hardware acceleration with stochastic adaptive resource allocation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3496008A1 (en) * 2017-12-05 2019-06-12 Samsung Electronics Co., Ltd. Method and apparatus for processing convolution operation in neural network
US20190378001A1 (en) * 2018-06-12 2019-12-12 Samsung Electronics Co., Ltd. Neural network hardware acceleration with stochastic adaptive resource allocation

Also Published As

Publication number Publication date
CN116249990A (en) 2023-06-09

Similar Documents

Publication Publication Date Title
CN111758106B (en) Method and system for massively parallel neuro-reasoning computing elements
Fu et al. Deep learning with int8 optimization on xilinx devices
US20120124117A1 (en) Fused multiply-add apparatus and method
CN109543815B (en) Neural network acceleration method and device
US10409556B2 (en) Division synthesis
US11544521B2 (en) Neural network layer processing with scaled quantization
CN113126953A (en) Method and apparatus for floating point processing
Piestrak A note on RNS architectures for the implementation of the diagonal function
EP2940576A2 (en) Approximating functions
WO2022028666A1 (en) Using non-uniform weight distribution to increase efficiency of fixed-point neural network inference
US10303439B2 (en) Logarithm and power (exponentiation) computations using modern computer architectures
Matutino et al. An efficient scalable RNS architecture for large dynamic ranges
CN110796247A (en) Data processing method, device, processor and computer readable storage medium
CN114341796A (en) Signed multiword multiplier
WO2021197562A1 (en) Efficient initialization of quantized neural networks
Barik et al. Area-time efficient square architecture
CN111191779B (en) Data processing method, device, processor and computer readable storage medium
US9134958B2 (en) Bid to BCD/DPD converters
US20240104356A1 (en) Quantized neural network architecture
JP2021501406A (en) Methods, devices, and systems for task processing
US11455142B2 (en) Ultra-low precision floating-point fused multiply-accumulate unit
US11347478B2 (en) Analog arithmetic unit
CN112204516B (en) Enhanced low precision binary floating point formatting
Pal et al. FPGA implementation of DSP applications using HUB floating point technique
CN111191766A (en) Data processing method, device, processor and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20750656

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20750656

Country of ref document: EP

Kind code of ref document: A1