WO2020201791A1 - Trainable threshold for ternarized neural networks - Google Patents

Trainable threshold for ternarized neural networks Download PDF

Info

Publication number
WO2020201791A1
WO2020201791A1 PCT/IB2019/000367 IB2019000367W WO2020201791A1 WO 2020201791 A1 WO2020201791 A1 WO 2020201791A1 IB 2019000367 W IB2019000367 W IB 2019000367W WO 2020201791 A1 WO2020201791 A1 WO 2020201791A1
Authority
WO
WIPO (PCT)
Prior art keywords
cnn
threshold
temarization
processor
computing device
Prior art date
Application number
PCT/IB2019/000367
Other languages
French (fr)
Inventor
Andrey ANUFRIEV
Original Assignee
Intel Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corporation filed Critical Intel Corporation
Priority to PCT/IB2019/000367 priority Critical patent/WO2020201791A1/en
Publication of WO2020201791A1 publication Critical patent/WO2020201791A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • Embodiments described herein relate to the field of neural networks. More specifically, the embodiments relate to methods and apparatuses for training thresholds for network temarization.
  • Neural networks are tools for solving complex problems across a wide range of domains such as computer vision, image recognition, speech processing, natural language processing, language translation, and autonomous vehicles.
  • a NN with multiple layers between the input layer and the output layer may be referred to as Deep Neural Network (DNN). Due to the number of layers in a DNN, execution of DNNs often require significant amounts of processing and memory. The processing and memory requirements of DNNs often mean that execution of DNNs edge or Intemet-of-things (IoT) devices is impractical.
  • IoT Intemet-of-things
  • Figure 1 illustrates a computing system
  • Figure 2 illustrates an inference environment
  • FIG. 3 illustrates a logic flow
  • Figure 4 illustrates a logic flow
  • Figure 5 illustrates a storage medium
  • Figure 6 illustrates a system
  • Embodiments disclosed herein provide a temarized DNN, that is, a DNN where the weight space has been converted from a full precision weight space to a discrete weight space using a network temarization threshold (or a ternary threshold).
  • a network temarization threshold is trained simultaneously with network training. Accordingly, the full precision weights are converted to discrete weighs (or ternary weights) based on the ternary threshold that has been trained simultaneously with the network weights.
  • a CNN can be temarized as disclosed herein to speed up network inference, reduce network size, and/or reduce
  • the reduction in network size due to temarization can facilitate inference on IoT devices, particularly those devices with an architecture that supports convolutions with ternary weights (e.g., FPGA, ASICs, deep net accelerators, or the like).
  • ternary weights e.g., FPGA, ASICs, deep net accelerators, or the like.
  • the present disclosure can provide temarized CNNs where the multiplication in the convolution operation can be replaced with addition, due to the temarized weight space.
  • the present disclosure provides a significant advantage over conventional network temarization techniques, in that the present disclosure does not require manual selection of temarization parameters. Furthermore, the present disclosure does not require pre-existing assumptions regarding weight distribution in the network, which may often be incorrect.
  • embodiments disclosed herein provide for training ternary thresholds simultaneously with network training, or fine-tuning.
  • original float convolutional weights are divided into three groups during training.
  • equation shown below which is an approximation of the sum of two Heaviside step functions could be used to represent the conversion of full precision weights to ternary weights:
  • FIG. 1 illustrates an embodiment of a computing system 100.
  • the computing system 100 is representative of any number and type of computing systems, such as a server, workstation, laptop, a virtualized computing system, a cloud computing system, and edge computing system, or the like.
  • the computing system 100 may be a server arranged to train a DNN, such as, a CNN.
  • Computing system 100 can include processor 1 10, memory 120, input/output (I/O) components 130, and interface 140, among other components not depicted.
  • I/O input/output
  • the processor 1 10 may include circuity or processor logic, such as, for example, any of a variety of commercial processors.
  • the processor 1 10 may include multiple processors, a multi-threaded processor, a multi-core processor (whether the multiple cores coexist on the same or separate dies), and/or a multi-processor architecture of some other variety by which multiple physically separate processors are in some way linked.
  • the processor 1 10 may include graphics processing portions and may include dedicated memory, multiple-threaded processing and/or some other parallel processing capability.
  • the memory 120 may include logic, a portion of which includes arrays of integrated circuits, forming non-volatile memory to persistently store data or a combination of non-volatile memory and volatile memory. It is to be appreciated, that the memory 120 may be based on any of a variety of technologies. In particular, the arrays of integrated circuits included in memory 120 may be arranged to form one or more types of memory, such as, for example, dynamic random access memory (DRAM), NAND memory, NOR memory, or the like.
  • DRAM dynamic random access memory
  • NAND memory NAND memory
  • NOR memory or the like.
  • the I/O component(s) 130 may include one or more components to provide input to or to provide output from the server 100.
  • the I/O component(s) 130 may be a keyboard (hardware, virtual, etc.), mouse, joystick, microphone, track pad, button, touch layers of a display, haptic feedback device, camera, microphone, speaker, or the like.
  • Interface 140 may include logic and/or features to support a communication interface.
  • the interface 140 may include one or more interfaces that operate according to various communication protocols or standards to communicate over direct or network communication links. Direct communications may occur via use of communication protocols or standards described in one or more industry standards (including progenies and variants).
  • the interface 140 may facilitate communication over a bus, such as, for example, peripheral component interconnect express (PCIe), non-volatile memory express (NVMe), universal serial bus (USB), system management bus (SMBus), SAS (e.g., serial attached small computer system interface (SCSI)) interfaces, serial AT attachment (SATA) interfaces, or the like.
  • PCIe peripheral component interconnect express
  • NVMe non-volatile memory express
  • USB universal serial bus
  • SMBs system management bus
  • SAS e.g., serial attached small computer system interface (SCSI) interfaces
  • SATA serial AT attachment
  • Memory 120 stores instructions 122, a CNN model 121 comprising full precision network weight set 124 and ternary weight set 128. Furthermore, memory 120 stores temarization threshold 126 and training/testing data set 150.
  • Processor 110 in executing instructions 122, can train full precision weight set 124 and temarization threshold 126, simultaneously.
  • processor 110 in executing instructions 122, can apply a training algorithm or methodology to CNN model 121 based on testing/training data set 150 to“train” full precision weight set 124.
  • CNN model 121, and full precision weight set 124 can be trained using any of a variety of network training algorithms, such as, for example, back propagation, or the like. An example of training algorithm is given below with reference to FIG. 3.
  • Processor 110, in executing instructions 122, can“train” temarization threshold 126 simultaneously with training full precision weight set 124.
  • a neural network includes two processing phases, a training phase and an inference phase.
  • a deep learning expert will typically architect the network, establish the number of layers in the neural network, the operation performed by each layer, and the connectivity between layers.
  • Many layers have parameters, typically weights, that determine the exact computation performed by the layer.
  • the objective of the training process is to learn the weights, usually via a stochastic gradient descent-based excursion through the space of weights.
  • the training phase generates an output feature map, also referred to as an activation tensor.
  • An activation tensor may be generated for each convolutional layer of a CNN model (e.g., CNN model 121).
  • the output feature map of a given convolutional layer may be the input to the next convolutional layer.
  • inference based on the trained neural network typically employs a forward-propagation calculation for input data to generate output data.
  • TWNs once training is complete, the full precision weight set generated during training is temarized to a generate ternary weight set.
  • processor 1 10 in executing instructions 122, can generate temarized weight set 128 from full precision weight set 124 and ternarization threshold 126. Temarized weight set 128 can be used to generate inferences on resource constrained devices, such as, for example, edge computing devices.
  • ternary weight networks TWNs are neural networks with weights constrained to +1, 0, and— 1. The aim of network ternarization is to minimize the function ⁇ W— a * W t ⁇ , where W is the trained full precision weight set of a layer of the network.
  • W t E ⁇ —1, 0, 1 ⁇ and a E R
  • ternarization threshold 126 is arranged to train ternarization threshold 126 simultaneously with full precision weights 124. This is described in greater detail below, for example, with respect to FIG. 3. Accordingly, a TWN, comprising temarized weights 128 generated from full precision weights 124 and ternarization threshold 126 could be generated without significant manual input from a user, as conventionally, required.
  • CNN model 121 may provide cascaded stages for face detection, character recognition, speech recognition, or the like.
  • training full precision weight set 124 for CNN model 121 may be based on a training/testing dataset 150 (e.g., images of faces, handwriting, printed information, etc.) that is in the form of tensor data.
  • a tensor is a geometric object that describes linear relations between geometric vectors, scalars, and other tensors.
  • An organized multidimensional array of numerical values, or tensor data may represent a tensor.
  • the training may produce the full precision weight set 124.
  • the full precision weight set 124 may specify features that are characteristic of numerals and/or each letter in the English alphabet.
  • the full precision weight set 124 can be temarized as described above.
  • ternary weight set 128 can be generated based on full precision weight set 124 and ternarization threshold 126.
  • a TWN corresponding to CNN model 121 and temarized weight set 128 may receive images as input and perform desired processing on the input images.
  • the input images may depict handwriting, and the TWN may identify numerals and/or letters of the English alphabet included in the handwriting.
  • FIG. 2 illustrates an example inference environment 200 including computing system 100 coupled to an edge computing device 201.
  • edge computing device 201 can be arranged to receive CNN model 121 including temarized weight set 128 (e.g., the TWN model 221 representative of CNN model 121).
  • edge computing device 201 can be arranged to generate an inference (e.g., execute TWN model 221).
  • Edge computing device 201 may be any computing device arranged as an“edge” type, such as a gateway, a network attached processing device, an IoT device, or the like.
  • Edge computing device 201 can include processor 210, memory 220, interface 240, and sensor 260, among other components not depicted.
  • the processor 210 may include circuity or processor logic, such as, for example, any of a variety of commercial processors.
  • the processor 210 may include multiple processors, a multi-threaded processor, a multi-core processor (whether the multiple cores coexist on the same or separate dies), and/or a multi-processor architecture of some other variety by which multiple physically separate processors are in some way linked.
  • the processor 210 may include graphics processing portions and may include dedicated memory, multiple-threaded processing and/or some other parallel processing capability.
  • processor 210 may be a custom or specific processor circuit arranged to execute TWN model 221.
  • processor 210 can be a field programmable gate array (FPGA), an application specific integrated circuits (ASIC), a neural accelerator circuit, arrange to support convolution operations with ternary weights.
  • FPGA field programmable gate array
  • ASIC application specific integrated circuits
  • neural accelerator circuit arrange to support convolution operations with ternary weights.
  • the memory 220 may include logic, a portion of which includes arrays of integrated circuits, forming non-volatile memory to persistently store data or a combination of non-volatile memory and volatile memory. It is to be appreciated, that the memory 220 may be based on any of a variety of technologies. In particular, the arrays of integrated circuits included in memory 220 may be arranged to form one or more types of memory, such as, for example, dynamic random access memory (DRAM), NAND memory, NOR memory, or the like.
  • DRAM dynamic random access memory
  • NAND memory NAND memory
  • NOR memory NOR memory
  • Interface 240 may include circuitry and/or logic to support a communication interface.
  • the interface 240 may include one or more interfaces that operate according to various communication protocols or standards to communicate over direct or network communication links. Direct communications may occur via use of communication protocols or standards described in one or more industry standards (including progenies and variants).
  • the interface 140 may facilitate communication over a bus, such as, for example, peripheral component interconnect express (PCIe), non-volatile memory express (NVMe), universal serial bus (USB), system management bus (SMBus), SAS (e.g., serial attached small computer system interface (SCSI)) interfaces, serial AT attachment (SATA) interfaces, or the like.
  • PCIe peripheral component interconnect express
  • NVMe non-volatile memory express
  • USB universal serial bus
  • SMBs system management bus
  • SAS e.g., serial attached small computer system interface (SCSI) interfaces, serial AT attachment (SATA) interfaces, or the like.
  • Sensor 260 may include circuitry and/or logic to support collections of sensor data 262.
  • sensor 260 could be a camera, a microphone, a gyroscope, a global positioning sensor, a biometric sensor, a temperature sensor, or the like. Examples are not limited in this context.
  • Memory 220 stores instructions 222, a TWN model 221 comprising ternary weight set 128. Furthermore, memory 220 stores sensor data 262 and TWN model inference 264. Processor 210, in executing instructions 222, can receive TWN 221 from computing system 100. For example, edge computing device 201 can be communicatively coupled to computing system 100, via interface 240, network 299, and interface 140. Processor 210, in executing instructions 222 can receive an information element comprising indications of TWN model 221 architecture and ternary weight set 128.
  • Processor 210 in executing instructions 222 can receive sensor data 262 from sensor 260. Furthermore, processor 210, in executing instructions 222 can generate TWN model inference 264 from TWN model 221 and sensor data 262.
  • FIG. 3 illustrates an embodiment of a logic flow 300.
  • the logic flow 300 may be representative of some or all the operations executed by one or more embodiments described herein.
  • the computing system 100 (or components thereof) may perform the operations in logic flow 300 to train full precision weights for a CNN while simultaneously training ternary threshold to generate a TWN from the fully trained CNN.
  • logic flow 300 is described with reference to training a DNN having the structure of a convolutional neural network (CNN).
  • CNN convolutional neural network
  • Logic flow 300 can begin at block 310.
  • processor 110 in executing instructions 122, can initialize ternary thresholds 126.
  • processor 1 10 in executing instructions 122, can derive CNN model output via a forward pass through the network.
  • processor 110 in executing instructions 122 can compute the forward pass through selected convolutional layers (e.g., interior layers, or the like) according to the following equation, where W are the trainable originals weights of a pretrained full precision CNN, a > 0, A neg > 0, A pos > 0, and H is the Heavyside function from Equation 1 above.
  • processor 110 in executing instructions 122, can derive network loss (L).
  • processor 110 in executing instructions 122 can derive network loss based on any of a variety of loss functions, such as, mean squared error (MSE), cross entropy loss, average binary cross entropy loss, LI loss for position regressor, or the like.
  • MSE mean squared error
  • cross entropy loss average binary cross entropy loss
  • LI loss for position regressor or the like.
  • processor 110 in executing instructions 122, can update full precision weight set 124.
  • processor 1 10 in executing instructions 122 can apply a backpropagation algorithm to CNN model 121 to update full precision weight set 124.
  • processor 1 10 in executing instructions 122, can update the slope is all layers that are to be temarized.
  • processor 110 in executing instructions 122 can update the slope using the following equation, where t is the number of iterations and C is a precomputed constant.
  • processor 1 10 in executing instructions 122, can determine whether a minimum error has been reached (e.g., based on the loss function, or the like). For example, processor 110, in executing instructions 122 can determine whether a minimum has been reached and whether to continue training (e.g., of full precision weight set 124 and ternary threshold 126) based on the reaching the minimum or not.
  • processor 110 in executing instructions 122 can determine whether a minimum has been reached and whether to continue training (e.g., of full precision weight set 124 and ternary threshold 126) based on the reaching the minimum or not.
  • logic flow 300 can return to block 320 to continue training as described or can continue to block 380.
  • processor 1 10 in executing instructions 122, can generate ternary weight set 128 from full precision weight set 124 and ternary threshold 126.
  • FIG. 4 illustrates an embodiment of a logic flow 400.
  • the logic flow 400 may be representative of some or all the operations executed by one or more embodiments described herein.
  • the computing system 100 (or components thereof) may perform the operations in logic flow 400 to replace full precision weight set 124 with ternary weight set 128 (e.g., at block 380 of logic flow 300, or the like).
  • Logic flow 400 can begin at block 410.
  • processor 110 in executing instructions 122, can retrieve a modified convolutional layer (layer i).
  • layer i a modified convolutional layer
  • some of the convolutional layer e.g., interior layers, etc.
  • other layers e.g., input and output layers
  • processor 1 10 in executing instructions 122 can retrieve full precision weights from full precision weight set 124 corresponding to a first one of the modified convolutional layers of CNN model 121.
  • processor 110 in executing instructions 122, can replace full precision weights with ternary weight for the retrieved convolutional layer.
  • processor 110 in executing instructions 122 can replace full precision weights with ternary weight based on the following equation.
  • processor 1 10 in executing instructions 122, can determine whether there are more modified convolutional layers to replace full precision weights with ternary weights. From block 430, logic flow 400 can return to block 410 to continue replacing full precision weights with ternary weights as described or can end.
  • FIG. 5 illustrates an embodiment of a storage medium 500.
  • Storage medium 500 may comprise any non-transitory computer-readable storage medium or machine-readable storage medium, such as an optical, magnetic or semiconductor storage medium.
  • machine-readable storage medium such as an optical, magnetic or semiconductor storage medium.
  • storage medium 500 may comprise an article of manufacture.
  • storage medium 500 may store computer-executable instructions, such as computer-executable instructions to implement one or more of logic flows or operations described herein, such as with respect to 300 and/or 400 of FIGS. 3-4.
  • Examples of a computer- readable storage medium or machine-readable storage medium may include any tangible media capable of storing electronic data, including volatile memory or non-volatile memory, removable or non-removable memory, erasable or non-erasable memory, writeable or re-writeable memory, and so forth.
  • Examples of computer-executable instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, object-oriented code, visual code, and the like. The embodiments are not limited in this context.
  • FIG. 6 illustrates an embodiment of a system 3000.
  • the system 3000 is a computer system with multiple processor cores such as a distributed computing system, supercomputer, high-performance computing system, computing cluster, mainframe computer, mini-computer, client-server system, personal computer (PC), workstation, server, portable computer, laptop computer, tablet computer, handheld device such as a personal digital assistant (PDA), or other device for processing, displaying, or transmitting information.
  • Similar embodiments may comprise, e.g., entertainment devices such as a portable music player or a portable video player, a smart phone or other cellular phone, a telephone, a digital video camera, a digital still camera, an external storage device, or the like. Further embodiments implement larger scale server configurations.
  • the system 3000 may have a single processor with one core or more than one processor.
  • the term“processor” refers to a processor with a single core or a processor package with multiple processor cores.
  • the computing system 3000 is representative of the computing system 100 and/or the edge computing device 201. More generally, the computing system 3000 is configured to implement all logic, systems, logic flows, methods, apparatuses, and functionality described herein with reference to FIGS. 1-5.
  • a component can be, but is not limited to being, a process running on a processor, a processor, a hard disk drive, multiple storage drives (of optical and/or magnetic storage medium), an object, an executable, a thread of execution, a program, and/or a computer.
  • a component can be, but is not limited to being, a process running on a processor, a processor, a hard disk drive, multiple storage drives (of optical and/or magnetic storage medium), an object, an executable, a thread of execution, a program, and/or a computer.
  • an application running on a server and the server can be a component.
  • One or more components can reside within a process and/or thread of execution, and a component can be localized on one computer and/or distributed between two or more computers. Further, components may be communicatively coupled to each other by various types of communications media to coordinate operations. The coordination may involve the unidirectional or bi-directional exchange of information. For instance, the components may communicate information in the form of signals communicated over the communications media. The information can be implemented as signals allocated to various signal lines. In such allocations, each message is a signal. Further embodiments, however, may alternatively employ data messages. Such data messages may be sent across various connections. Exemplary connections include parallel interfaces, serial interfaces, and bus interfaces.
  • system 3000 comprises a motherboard 3005 for mounting platform components.
  • the motherboard 3005 is a point-to-point interconnect platform that includes a first processor 3010 and a second processor 3030 coupled via a point-to-point interconnect 3056 such as an Ultra Path Interconnect (UPI).
  • UPI Ultra Path Interconnect
  • the system 3000 may be of another bus architecture, such as a multi-drop bus.
  • each of processors 3010 and 3030 may be processor packages with multiple processor cores including processor core(s) 3020 and 3040, respectively.
  • the system 3000 is an example of a two- socket (2S) platform, other embodiments may include more than two sockets or one socket.
  • some embodiments may include a four-socket (4S) platform or an eight-socket (8S) platform.
  • Each socket is a mount for a processor and may have a socket identifier.
  • platform refers to the motherboard with certain components mounted such as the processors 3010 and the chipset 3060. Some platforms may include additional components and some platforms may only include sockets to mount the processors and/or the chipset.
  • the processors 3010, 3020 can be any of various commercially available processors, including without limitation an Intel® Celeron®, Core®, Core (2) Duo®, Itanium®, Pentium®, Xeon®, and XScale® processors; AMD® Athlon®, Duron® and Opteron® processors; ARM® application, embedded and secure processors; IBM® and Motorola® DragonBall® and
  • processors 3010, 3020 PowerPC® processors; IBM and Sony® Cell processors; and similar processors. Dual microprocessors, multi-core processors, and other multi processor architectures may also be employed as the processors 3010, 3020.
  • the first processor 3010 includes an integrated memory controller (IMC) 3014 and point- to-point (P-P) interfaces 3018 and 3052.
  • the second processor 3030 includes an IMC 3034 and P-P interfaces 3038 and 3054.
  • the IMC's 3014 and 3034 couple the processors 3010 and 3030, respectively, to respective memories, a memory 3012 and a memory 3032.
  • the memories 3012 and 3032 may be portions of the main memory (e.g., a dynamic random-access memory (DRAM)) for the platform such as double data rate type 3 (DDR3) or type 4 (DDR4) synchronous DRAM (SDRAM).
  • DRAM dynamic random-access memory
  • SDRAM synchronous DRAM
  • the memories 3012 and 3032 locally attach to the respective processors 3010 and 3030.
  • the main memory may couple with the processors via a bus and shared memory hub.
  • the processors 3010 and 3030 comprise caches coupled with each of the processor core(s) 3020 and 3040, respectively.
  • the processor cores 3020, 3040 may include memory management logic circuitry (not pictured) which may represent circuitry configured to implement the functionality of the logic flows 300 and/or 400, or may represent a combination of the circuitry within a processor and a medium to store all or part of the functionality of these logic flows in memory such as cache, the memory 3012, buffers, registers, or storage medium 800 attached to the processors 3010 and/or 3030 via a chipset 3060.
  • the functionality of these logic flows may also reside in whole or in part in memory such as the memory 3012 and/or a cache of the processor.
  • these logic flows may also reside in whole or in part as circuitry within the processor 3010 and may perform operations, e.g., within registers or buffers such as the registers 3016 within the processors 3010, 3030, or within an instruction pipeline of the processors 3010, 3030. Further still, the functionality of these logic flows may be integrated a processor of the hardware accelerator 106 for performing training and temarization of a CNN (e.g., CNN model 121, etc.)
  • a CNN e.g., CNN model 121, etc.
  • More than one of the processors 3010 and 3030 may comprise the functionality of these logic flows, such as the processor 3030 and/or a processor within the hardware accelerator 106 coupled with the chipset 3060 via an interface (I/F) 3066.
  • the I/F 3066 may be, for example, a Peripheral Component Interconnect-enhanced (PCI-e).
  • PCI-e Peripheral Component Interconnect-enhanced
  • the first processor 3010 couples to a chipset 3060 via P-P interconnects 3052 and 3062 and the second processor 3030 couples to a chipset 3060 via P-P interconnects 3054 and 3064.
  • Direct Media Interfaces (DMIs) 3057 and 3058 may couple the P-P interconnects 3052 and 3062 and the P-P interconnects 3054 and 3064, respectively.
  • the DMI may be a high-speed interconnect that facilitates, e.g., eight Giga Transfers per second (GT/s) such as DMI 3.0.
  • GT/s Giga Transfers per second
  • the processors 3010 and 3030 may interconnect via a bus.
  • the chipset 3060 may comprise a controller hub such as a platform controller hub (PCH).
  • the chipset 3060 may include a system clock to perform clocking functions and include interfaces for an I/O bus such as a universal serial bus (USB), peripheral component
  • the chipset 3060 may comprise more than one controller hub such as a chipset with a memory controller hub, a graphics controller hub, and an input/output (I/O) controller hub.
  • PCIs peripheral interconnects
  • SPIs serial peripheral interconnects
  • I2Cs integrated interconnects
  • the chipset 3060 may comprise more than one controller hub such as a chipset with a memory controller hub, a graphics controller hub, and an input/output (I/O) controller hub.
  • the chipset 3060 couples with a trusted platform module (TPM) 3072 and the UEFI, BIOS, Flash component 3074 via an interface (I/F) 3070.
  • TPM trusted platform module
  • the TPM 3072 is a dedicated microcontroller designed to secure hardware by integrating cryptographic keys into devices.
  • the UEFI, BIOS, Flash component 3074 may provide pre-boot code.
  • chipset 3060 includes an I/F 3066 to couple chipset 3060 with a high- performance graphics engine, graphics card 3065.
  • the system 3000 may include a flexible display interface (FDI) between the processors 3010 and 3030 and the chipset 3060.
  • the FDI interconnects a graphics processor core in a processor with the chipset 3060.
  • Various I/O devices 3092 couple to the bus 3081, along with a bus bridge 3080 which couples the bus 3081 to a second bus 3091 and an I/F 3068 that connects the bus 3081 with the chipset 3060.
  • the second bus 3091 may be a low pin count (LPC) bus.
  • Various devices may couple to the second bus 3091 including, for example, a keyboard 3082, a mouse 3084, communication devices 3086 and the storage medium 700 that may store computer executable code as previously described herein.
  • an audio I/O 3090 may couple to second bus 3091.
  • Many of the I/O devices 3092, communication devices 3086, and the storage medium 800 may reside on the motherboard 3005 while the keyboard 3082 and the mouse 3084 may be add-on peripherals. In other embodiments, some or all the I/O devices 3092, communication devices 3086, and the storage medium 800 are add-on peripherals and do not reside on the motherboard 3005.
  • One or more aspects of at least one example may be implemented by representative instructions stored on at least one machine-readable medium which represents various logic within the processor, which when read by a machine, computing device or system causes the machine, computing device or system to fabricate logic to perform the techniques described herein.
  • Such representations known as“IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that make the logic or processor.
  • hardware elements may include devices, components, processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), memory units, logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth.
  • ASIC application specific integrated circuits
  • PLD programmable logic devices
  • DSP digital signal processors
  • FPGA field programmable gate array
  • software elements may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. Determining whether an example is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints, as desired for a given implementation.
  • a computer-readable medium may include a non-transitory storage medium to store logic.
  • the non-transitory storage medium may include one or more types of computer-readable storage media capable of storing electronic data, including volatile memory or non-volatile memory, removable or non-removable memory, erasable or non-erasable memory, writeable or re-writeable memory, and so forth.
  • the logic may include various software elements, such as software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, API, instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof.
  • a computer-readable medium may include a non-transitory storage medium to store or maintain instructions that when executed by a machine, computing device or system, cause the machine, computing device or system to perform methods and/or operations in accordance with the described examples.
  • the instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, and the like.
  • the instructions may be implemented according to a predefined computer language, manner or syntax, for instructing a machine, computing device or system to perform a certain function.
  • the instructions may be implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.
  • Coupled and “connected” along with their derivatives. These terms are not necessarily intended as synonyms for each other. For example, descriptions using the terms“connected” and/or“coupled” may indicate that two or more elements are in direct physical or electrical contact with each other. The term “coupled,” however, may also mean that two or more elements are not in direct contact with each other, yet still co-operate or interact with each other.
  • a data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus.
  • the memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code must be retrieved from bulk storage during execution.
  • the term“code” covers a broad range of software components and constructs, including applications, drivers, processes, routines, methods, modules, firmware, microcode, and subprograms. Thus, the term“code” may be used to refer to any collection of instructions which, when executed by a processing system, perform a desired operation or operations.
  • Circuitry is hardware and may refer to one or more circuits. Each circuit may perform a particular function.
  • a circuit of the circuitry may comprise discrete electrical components interconnected with one or more conductors, an integrated circuit, a chip package, a chip set, memory, or the like.
  • Integrated circuits include circuits created on a substrate such as a silicon wafer and may comprise components. And integrated circuits, processor packages, chip packages, and chipsets may comprise one or more processors.
  • Processors may receive signals such as instructions and/or data at the input(s) and process the signals to generate the at least one output. While executing code, the code changes the physical states and characteristics of transistors that make up a processor pipeline. The physical states of the transistors translate into logical bits of ones and zeros stored in registers within the processor. The processor can transfer the physical states of the transistors into registers and transfer the physical states of the transistors to another storage medium.
  • a processor may comprise circuits to perform one or more sub-functions implemented to perform the overall function of the processor.
  • One example of a processor is a state machine or an application-specific integrated circuit (ASIC) that includes at least one input and at least one output.
  • a state machine may manipulate the at least one input to generate the at least one output by performing a predetermined series of serial and/or parallel manipulations or transformations on the at least one input.
  • the logic as described above may be part of the design for an integrated circuit chip.
  • the chip design is created in a graphical computer programming language and stored in a computer storage medium or data storage medium (such as a disk, tape, physical hard drive, or virtual hard drive such as in a storage access network). If the designer does not fabricate chips or the photolithographic masks used to fabricate chips, the designer transmits the resulting design by physical means (e.g., by providing a copy of the storage medium storing the design) or electronically (e.g., through the Internet) to such entities, directly or indirectly. The stored design is then converted into the appropriate format (e.g., GDSII) for the fabrication.
  • GDSII GDSI
  • the resulting integrated circuit chips can be distributed by the fabricator in raw wafer form (that is, as a single wafer that has multiple unpackaged chips), as a bare die, or in a packaged form.
  • the chip is mounted in a single chip package (such as a plastic carrier, with leads that are affixed to a motherboard or other higher level carrier) or in a multichip package (such as a ceramic carrier that has either or both surface interconnections or buried interconnections).
  • the chip is then integrated with other chips, discrete circuit elements, and/or other signal processing devices as part of either (a) an intermediate product, such as a processor board, a server platform, or a motherboard, or (b) an end product.
  • Example 1 An apparatus, comprising: a processor; and a memory storing instructions, which when executed by the processor cause the processor to: implement at least one training epoch for a convolutional neural network (CNN), each of the at least one training epochs comprising: updating full precision weights for the CNN; updating a temarization threshold for the CNN; and replace the full precisions weights for at least one layer of the CNN with ternary weights based in part on the temarization threshold.
  • CNN convolutional neural network
  • Example 2 The apparatus of example 1, the memory storing instructions, which when executed by the processor cause the processor to initialize the temarization threshold.
  • Example 3 The apparatus of example 2, the memory storing instructions, which when executed by the processor cause the processor to initialize a positive temarization threshold and a negative temarization threshold based in part on the following equation, where A pos in the positive temarization threshold, D he ⁇ is the negative temarization threshold, and W is the weight space of the
  • Example 4 The apparatus of examples 1, 2, or 3, wherein each of the at least one epochs comprising: deriving output from the CNN based in part on a forward pass calculation through the CNN; deriving an error of the output based in part on a loss function; and updating the full precision weights for the CNN based in part on the error and a backward pass through the CNN.
  • Example 5 The apparatus of example 4, the memory storing instructions, which when executed by the processor cause the processor to update the temarization threshold based in part on the following equations, where IT is a weight space of the CNN, L is the loss function, D is the temarization threshold, slope is the slope of a layer of the CNN, and t is the one of the at least one epochs:
  • Example 7 The apparatus of examples 4, 5, or 6, wherein the loss function is a mean squared error function.
  • Example 8 The apparatus of example 4, 5, 6, or 7, wherein the backward pass through the CNN is based in part on a backpropagation algorithm.
  • Example 9 A non-transitory computer-readable storage medium comprising instructions that when executed by a computing device, cause the computing device to: update full precision weights for the CNN; update a temarization threshold for the CNN; and replace the full precisions weights for at least one layer of the CNN with ternary weights based in part on the temarization threshold.
  • Example 10 The non-transitory computer-readable storage medium of example 10, comprising instructions that when executed by the computing device, cause the computing device to initialize the temarization threshold.
  • Example 12 The non-transitory computer-readable storage medium of examples 9, 10, or 1 1, wherein each of the at least one epochs comprises: deriving output from the CNN based in part on a forward pass calculation through the CNN; deriving an error of the output based in part on a loss function; and updating the full precision weights for the CNN based in part on the error and a backward pass through the CNN.
  • Example 15 The non-transitory computer-readable storage medium of examples 12, 13, or 14, wherein the loss function is a mean squared error function.
  • Example 16 The non-transitory computer-readable storage medium of examples 12, 13,
  • Example 17 An edge computing device: circuitry; and memory coupled to the circuitry, the memory comprising: a ternary weight network (TWN) based in part on a convolutional neural network (CNN) wherein full precision weights of the CNN were trained simultaneously with a temarization threshold with which ternary weights of the TWN were generated; and instructions, which when executed by the circuitry cause the circuitry to generate an inference based in part on the TWN.
  • TWN ternary weight network
  • CNN convolutional neural network
  • Example 18 The edge computing device of example 17, wherein the circuitry is an application specific integrated circuit or a neural accelerator.
  • Example 19 The edge computing device of examples 17 or 18, the memory storing instructions, which when executed by the circuitry cause the circuitry to: receive input data; and generate the inference based in providing the input data as input to the TWN.
  • Example 20 The edge computing device of examples 17, 18, or 19, the memory storing instructions, which when executed by the circuitry cause the circuitry to receive, from a computing device, the TWN.
  • Example 21 A system comprising: a computing system, comprising: a processor; and memory storing instructions, which when executed by the processor cause the processor to: implement at least one training epoch for a convolutional neural network (CNN), each of the at least one training epochs comprising: updating full precision weights for the CNN; updating a temarization threshold for the CNN; and generate a ternary weight network (TWN) based in part on replacing, based in part on the temarization threshold, the full precisions weights for at least one layer of the CNN with ternary weights; and an edge computing device coupled to the computing system, the edge computing device comprising: circuitry; and edge memory coupled to the circuitry, the edge memory storing instructions, which when executed by the circuitry cause the circuitry to: receive the TWN from the computing system; and generate an inference based in part on the TWN.
  • CNN convolutional neural network
  • Example 22 The system of example 21, the memory storing instructions, which when executed by the processor cause the processor to initialize the temarization threshold.
  • Example 24 The system of examples 21, 22, or 23, wherein each of the at least one epochs comprises: deriving output from the CNN based in part on a forward pass calculation through the CNN; deriving an error of the output based in part on a loss function; and updating the full precision weights for the CNN based in part on the error and a backward pass through the CNN.
  • Example 25 The system of example 24, the memory storing instructions, which when executed by the processor cause the processor to update the temarization threshold based in part on the following equations, where W is a weight space of the CNN, L is the loss function, D is the temarization threshold, slope is the slope of a layer of the CNN, and t is the one of the at least one epochs:
  • Example 27 The system of examples 24, 25, or 26, wherein the loss function is a mean squared error function.
  • Example 28 The system of examples 24, 25, 26 or 27, wherein the backward pass through the CNN is based in part on a backpropagation algorithm.
  • Example 29 The system of examples 21, 22, 23, 24, 25, 26, 27, or 28, wherein the circuitry is an application specific integrated circuit or a neural accelerator.
  • Example 30 The system of examples 21, 22, 23, 24, 25, 26, 27, 28, or 29, the edge memory storing instructions, which when executed by the circuitry cause the circuitry to: receive input data; and generate the inference based in part on providing the input data as input to the TWN.
  • Example 31 A non-transitory computer-readable storage medium comprising instructions that when executed by an edge computing device, cause the edge computing device to: generate an inference based in part on a ternary weight network (TWN), the TWN based in part on a convolutional neural network (CNN) wherein full precision weights of the CNN were trained simultaneously with a ternarization threshold with which ternary weights of the TWN were generated.
  • TWN ternary weight network
  • CNN convolutional neural network
  • Example 32 The non-transitory computer-readable storage medium of example 31, wherein the circuitry is an application specific integrated circuit or a neural accelerator.
  • Example 33 The non-transitory computer-readable storage medium of examples 31 or 32, comprising instructions that when executed by the edge computing device, cause the edge computing device to: receive input data; and generate the inference based in providing the input data as input to the TWN.
  • Example 34 The non-transitory computer-readable storage medium of examples 31, 32, or 33, comprising instructions that when executed by the edge computing device, cause the edge computing device to receive, from a computing device, the TWN.
  • Example 35 A method comprising: generating an inference based in part on a ternary weight network (TWN), the TWN based in part on a convolutional neural network (CNN) wherein full precision weights of the CNN were trained simultaneously with a ternarization threshold with which ternary weights of the TWN were generated.
  • TWN ternary weight network
  • CNN convolutional neural network
  • Example 36 The method of example 35, wherein the circuitry is an application specific integrated circuit or a neural accelerator.
  • Example 37 The method of examples 35 or 36, comprising: receiving input data; and generating the inference based in providing the input data as input to the TWN.
  • Example 38 A method, comprising: updating full precision weights for the CNN; updating a ternarization threshold for the CNN; and replacing the full precisions weights for at least one layer of the CNN with ternary weights based in part on the ternarization threshold.
  • Example 39 The method of example 38, comprising instructions that when executed by the computing device, cause the computing device to initialize the ternarization threshold.
  • Example 41 The method of examples 38, 39, or 40, wherein each of the at least one epochs comprises: deriving output from the CNN based in part on a forward pass calculation through the CNN; deriving an error of the output based in part on a loss function; and updating the full precision weights for the CNN based in part on the error and a backward pass through the CNN.
  • Example 42 The method of example 12, comprising updating the temarization threshold based in part on the following equations, where W is a weight space of the CNN, L is the loss function, D is the temarization threshold, slope is the slope of a layer of the CNN, and t is the one of the at least one epochs:
  • Example 43 The method of examples 41 or 42, comprising computing the forward pass through the CNN based in part on the following equations, where IE is a weight space of the
  • Example 44 The method of examples 41, 42, or 43, wherein the loss function is a mean squared error function.
  • Example 45 The method of examples 41, 42, 43, or 44, wherein the backward pass through the CNN is based in part on a backpropagation algorithm.
  • Example 46 An apparatus, comprising means arranged to implement the function of any one of examples 35 to 45.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

The disclosure is generally directed to ternary weight networks (TWN); and provides processes and systems arranged to train a convolutional neural network (CNN) simultaneously with a ternarization threshold. The ternarization threshold can be used to replace full precision weights in the trained CNN with ternary weight to form a TWN based on the trained CNN.

Description

TRAINABLE THRESHOLD FOR TERNARIZED NEURAL NETWORKS
TECHNICAL FIELD
Embodiments described herein relate to the field of neural networks. More specifically, the embodiments relate to methods and apparatuses for training thresholds for network temarization.
BACKGROUND
Neural networks (NNs) are tools for solving complex problems across a wide range of domains such as computer vision, image recognition, speech processing, natural language processing, language translation, and autonomous vehicles. A NN with multiple layers between the input layer and the output layer may be referred to as Deep Neural Network (DNN). Due to the number of layers in a DNN, execution of DNNs often require significant amounts of processing and memory. The processing and memory requirements of DNNs often mean that execution of DNNs edge or Intemet-of-things (IoT) devices is impractical.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 illustrates a computing system.
Figure 2 illustrates an inference environment.
Figure 3 illustrates a logic flow.
Figure 4 illustrates a logic flow.
Figure 5 illustrates a storage medium.
Figure 6 illustrates a system.
DETAILED DESCRIPTION
Embodiments disclosed herein provide a temarized DNN, that is, a DNN where the weight space has been converted from a full precision weight space to a discrete weight space using a network temarization threshold (or a ternary threshold). As provided herein, the network temarization threshold is trained simultaneously with network training. Accordingly, the full precision weights are converted to discrete weighs (or ternary weights) based on the ternary threshold that has been trained simultaneously with the network weights.
One type of DNN is a convolutional neural network (CNN). A CNN can be temarized as disclosed herein to speed up network inference, reduce network size, and/or reduce
computational requirements for the network. In some examples, the reduction in network size due to temarization can facilitate inference on IoT devices, particularly those devices with an architecture that supports convolutions with ternary weights (e.g., FPGA, ASICs, deep net accelerators, or the like). As a specific example, the present disclosure can provide temarized CNNs where the multiplication in the convolution operation can be replaced with addition, due to the temarized weight space.
It is noted, that the present disclosure provides a significant advantage over conventional network temarization techniques, in that the present disclosure does not require manual selection of temarization parameters. Furthermore, the present disclosure does not require pre-existing assumptions regarding weight distribution in the network, which may often be incorrect.
Generally, embodiments disclosed herein provide for training ternary thresholds simultaneously with network training, or fine-tuning. In some examples, original float convolutional weights are divided into three groups during training. As a specific example, the equation shown below, which is an approximation of the sum of two Heaviside step functions could be used to represent the conversion of full precision weights to ternary weights:
Figure imgf000003_0001
0— ternary thresholds for negative and positive weights.
With general reference to notations and nomenclature used herein, one or more portions of the detailed description which follows may be presented in terms of program procedures executed on a computer or network of computers. These procedural descriptions and
representations are used by those skilled in the art to most effectively convey the substances of their work to others skilled in the art. A procedure is here, and generally, conceived to be a self- consistent sequence of operations leading to a desired result. These operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical, magnetic, or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It proves convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like. It should be noted, however, that these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to those quantities.
Further, these manipulations are often referred to in terms, such as adding or comparing, which are commonly associated with mental operations performed by a human operator.
However, no such capability of a human operator is necessary, or desirable in most cases, in any of the operations described herein that form part of one or more embodiments. Rather, these operations are machine operations. Useful machines for performing operations of various embodiments include general purpose digital computers as selectively activated or configured by a computer program stored within that is written in accordance with the teachings herein, and/or include apparatus specially constructed for the required purpose. Various embodiments also relate to apparatus or systems for performing these operations. These apparatuses may be specially constructed for the required purpose or may include a general-purpose computer. The required structure for a variety of these machines will be apparent from the description given.
Reference is now made to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purpose of explanation, numerous specific details are set forth in order to provide a thorough understanding thereof. It may be evident, however, that the novel embodiments can be practiced without these specific details. In other instances, well known structures and devices are shown in block diagram form to facilitate a description thereof. The intention is to cover all modification, equivalents, and alternatives within the scope of the claims.
FIG. 1 illustrates an embodiment of a computing system 100. The computing system 100 is representative of any number and type of computing systems, such as a server, workstation, laptop, a virtualized computing system, a cloud computing system, and edge computing system, or the like. For example, the computing system 100 may be a server arranged to train a DNN, such as, a CNN. Computing system 100 can include processor 1 10, memory 120, input/output (I/O) components 130, and interface 140, among other components not depicted.
With some examples, the processor 1 10 may include circuity or processor logic, such as, for example, any of a variety of commercial processors. In some examples, the processor 1 10 may include multiple processors, a multi-threaded processor, a multi-core processor (whether the multiple cores coexist on the same or separate dies), and/or a multi-processor architecture of some other variety by which multiple physically separate processors are in some way linked. Additionally, in some examples, the processor 1 10 may include graphics processing portions and may include dedicated memory, multiple-threaded processing and/or some other parallel processing capability.
The memory 120 may include logic, a portion of which includes arrays of integrated circuits, forming non-volatile memory to persistently store data or a combination of non-volatile memory and volatile memory. It is to be appreciated, that the memory 120 may be based on any of a variety of technologies. In particular, the arrays of integrated circuits included in memory 120 may be arranged to form one or more types of memory, such as, for example, dynamic random access memory (DRAM), NAND memory, NOR memory, or the like.
The I/O component(s) 130 may include one or more components to provide input to or to provide output from the server 100. For example, the I/O component(s) 130 may be a keyboard (hardware, virtual, etc.), mouse, joystick, microphone, track pad, button, touch layers of a display, haptic feedback device, camera, microphone, speaker, or the like.
Interface 140 may include logic and/or features to support a communication interface. For example, the interface 140 may include one or more interfaces that operate according to various communication protocols or standards to communicate over direct or network communication links. Direct communications may occur via use of communication protocols or standards described in one or more industry standards (including progenies and variants). For example, the interface 140 may facilitate communication over a bus, such as, for example, peripheral component interconnect express (PCIe), non-volatile memory express (NVMe), universal serial bus (USB), system management bus (SMBus), SAS (e.g., serial attached small computer system interface (SCSI)) interfaces, serial AT attachment (SATA) interfaces, or the like.
Memory 120 stores instructions 122, a CNN model 121 comprising full precision network weight set 124 and ternary weight set 128. Furthermore, memory 120 stores temarization threshold 126 and training/testing data set 150. Processor 110, in executing instructions 122, can train full precision weight set 124 and temarization threshold 126, simultaneously. For example, processor 110, in executing instructions 122, can apply a training algorithm or methodology to CNN model 121 based on testing/training data set 150 to“train” full precision weight set 124. In general, CNN model 121, and full precision weight set 124 can be trained using any of a variety of network training algorithms, such as, for example, back propagation, or the like. An example of training algorithm is given below with reference to FIG. 3. Processor 110, in executing instructions 122, can“train” temarization threshold 126 simultaneously with training full precision weight set 124.
Generally, a neural network includes two processing phases, a training phase and an inference phase. During training, a deep learning expert will typically architect the network, establish the number of layers in the neural network, the operation performed by each layer, and the connectivity between layers. Many layers have parameters, typically weights, that determine the exact computation performed by the layer. The objective of the training process is to learn the weights, usually via a stochastic gradient descent-based excursion through the space of weights. The training phase generates an output feature map, also referred to as an activation tensor. An activation tensor may be generated for each convolutional layer of a CNN model (e.g., CNN model 121). The output feature map of a given convolutional layer may be the input to the next convolutional layer. Once the training process is complete, inference based on the trained neural network typically employs a forward-propagation calculation for input data to generate output data. However, in the case of TWNs, once training is complete, the full precision weight set generated during training is temarized to a generate ternary weight set.
Additionally, processor 1 10, in executing instructions 122, can generate temarized weight set 128 from full precision weight set 124 and ternarization threshold 126. Temarized weight set 128 can be used to generate inferences on resource constrained devices, such as, for example, edge computing devices. In general, ternary weight networks (TWNs) are neural networks with weights constrained to +1, 0, and— 1. The aim of network ternarization is to minimize the function \\W— a * Wt\\, where W is the trained full precision weight set of a layer of the network. Where W t E {—1, 0, 1} and a E R:
Figure imgf000006_0001
h = {i- Wi \ > D}
As described, conventionally, finding the solution to Equation 1 requires a lot of time and is often not effective due to the need to compute Equation (2) for a wide range of ternarization threshold (As). Computing system 100, however, is arranged to train ternarization threshold 126 simultaneously with full precision weights 124. This is described in greater detail below, for example, with respect to FIG. 3. Accordingly, a TWN, comprising temarized weights 128 generated from full precision weights 124 and ternarization threshold 126 could be generated without significant manual input from a user, as conventionally, required.
In general, CNN model 121 may provide cascaded stages for face detection, character recognition, speech recognition, or the like. Accordingly, training full precision weight set 124 for CNN model 121 may be based on a training/testing dataset 150 (e.g., images of faces, handwriting, printed information, etc.) that is in the form of tensor data. A tensor is a geometric object that describes linear relations between geometric vectors, scalars, and other tensors. An organized multidimensional array of numerical values, or tensor data, may represent a tensor. The training may produce the full precision weight set 124. For example, the full precision weight set 124 may specify features that are characteristic of numerals and/or each letter in the English alphabet. The full precision weight set 124 can be temarized as described above. For example, ternary weight set 128 can be generated based on full precision weight set 124 and ternarization threshold 126. During the inference phase, a TWN corresponding to CNN model 121 and temarized weight set 128 may receive images as input and perform desired processing on the input images. For example, the input images may depict handwriting, and the TWN may identify numerals and/or letters of the English alphabet included in the handwriting.
FIG. 2 illustrates an example inference environment 200 including computing system 100 coupled to an edge computing device 201. In general, edge computing device 201 can be arranged to receive CNN model 121 including temarized weight set 128 (e.g., the TWN model 221 representative of CNN model 121). Furthermore, edge computing device 201 can be arranged to generate an inference (e.g., execute TWN model 221).
Edge computing device 201 may be any computing device arranged as an“edge” type, such as a gateway, a network attached processing device, an IoT device, or the like. Edge computing device 201 can include processor 210, memory 220, interface 240, and sensor 260, among other components not depicted.
With some examples, the processor 210 may include circuity or processor logic, such as, for example, any of a variety of commercial processors. In some examples, the processor 210 may include multiple processors, a multi-threaded processor, a multi-core processor (whether the multiple cores coexist on the same or separate dies), and/or a multi-processor architecture of some other variety by which multiple physically separate processors are in some way linked. Additionally, in some examples, the processor 210 may include graphics processing portions and may include dedicated memory, multiple-threaded processing and/or some other parallel processing capability. Furthermore, with some examples, processor 210 may be a custom or specific processor circuit arranged to execute TWN model 221. For example, processor 210 can be a field programmable gate array (FPGA), an application specific integrated circuits (ASIC), a neural accelerator circuit, arrange to support convolution operations with ternary weights.
The memory 220 may include logic, a portion of which includes arrays of integrated circuits, forming non-volatile memory to persistently store data or a combination of non-volatile memory and volatile memory. It is to be appreciated, that the memory 220 may be based on any of a variety of technologies. In particular, the arrays of integrated circuits included in memory 220 may be arranged to form one or more types of memory, such as, for example, dynamic random access memory (DRAM), NAND memory, NOR memory, or the like.
Interface 240 may include circuitry and/or logic to support a communication interface. For example, the interface 240 may include one or more interfaces that operate according to various communication protocols or standards to communicate over direct or network communication links. Direct communications may occur via use of communication protocols or standards described in one or more industry standards (including progenies and variants). For example, the interface 140 may facilitate communication over a bus, such as, for example, peripheral component interconnect express (PCIe), non-volatile memory express (NVMe), universal serial bus (USB), system management bus (SMBus), SAS (e.g., serial attached small computer system interface (SCSI)) interfaces, serial AT attachment (SATA) interfaces, or the like.
Sensor 260 may include circuitry and/or logic to support collections of sensor data 262.
For example, sensor 260 could be a camera, a microphone, a gyroscope, a global positioning sensor, a biometric sensor, a temperature sensor, or the like. Examples are not limited in this context.
Memory 220 stores instructions 222, a TWN model 221 comprising ternary weight set 128. Furthermore, memory 220 stores sensor data 262 and TWN model inference 264. Processor 210, in executing instructions 222, can receive TWN 221 from computing system 100. For example, edge computing device 201 can be communicatively coupled to computing system 100, via interface 240, network 299, and interface 140. Processor 210, in executing instructions 222 can receive an information element comprising indications of TWN model 221 architecture and ternary weight set 128.
Processor 210, in executing instructions 222 can receive sensor data 262 from sensor 260. Furthermore, processor 210, in executing instructions 222 can generate TWN model inference 264 from TWN model 221 and sensor data 262.
FIG. 3 illustrates an embodiment of a logic flow 300. The logic flow 300 may be representative of some or all the operations executed by one or more embodiments described herein. For example, the computing system 100 (or components thereof) may perform the operations in logic flow 300 to train full precision weights for a CNN while simultaneously training ternary threshold to generate a TWN from the fully trained CNN.
It is noted, that logic flow 300 is described with reference to training a DNN having the structure of a convolutional neural network (CNN). However, the present disclosure could be extended to apply to other types of DNNs. Examples are not limited in this context.
Logic flow 300 can begin at block 310. At block 310“initialize ternary thresholds” processor 110, in executing instructions 122, can initialize ternary thresholds 126. With some examples, processor 1 10, in executing instructions 122, can initialize ternary thresholds 126 as Apos, neg= 0.5 * E(\W\) , where“E” is the expectation of average function, and a can be derived according to Equation 3 detailed above.
Continuing to block 320“forward pass through network to compute output” processor 1 10, in executing instructions 122, can derive CNN model output via a forward pass through the network. With some examples, processor 110, in executing instructions 122 can compute the forward pass through selected convolutional layers (e.g., interior layers, or the like) according to the following equation, where W are the trainable originals weights of a pretrained full precision CNN, a > 0, Aneg> 0, Apos> 0, and H is the Heavyside function from Equation 1 above.
Figure imgf000009_0003
Continuing to block 330“derive network loss” processor 110, in executing instructions 122, can derive network loss (L). With some examples, processor 110, in executing instructions 122 can derive network loss based on any of a variety of loss functions, such as, mean squared error (MSE), cross entropy loss, average binary cross entropy loss, LI loss for position regressor, or the like.
Continuing to block 340“backward pass through network to update full precision weight set” processor 110, in executing instructions 122, can update full precision weight set 124. With some examples, processor 1 10, in executing instructions 122 can apply a backpropagation algorithm to CNN model 121 to update full precision weight set 124.
Continuing to block 350“update ternary threshold” processor 1 10, in executing
instructions 122, can update ternary threshold 126 (D) using the following equation, where L is the loss function (e.g., from block 330, or the like), s = a(W, A, slope), and g is the learning rate.
Figure imgf000009_0001
Continuing to block 360“increase the slope in all ternary layers” processor 1 10, in executing instructions 122, can update the slope is all layers that are to be temarized. With some examples, processor 110, in executing instructions 122 can update the slope using the following equation, where t is the number of iterations and C is a precomputed constant.
Figure imgf000009_0002
Continuing to block 370“minimum error reaches” processor 1 10, in executing instructions 122, can determine whether a minimum error has been reached (e.g., based on the loss function, or the like). For example, processor 110, in executing instructions 122 can determine whether a minimum has been reached and whether to continue training (e.g., of full precision weight set 124 and ternary threshold 126) based on the reaching the minimum or not.
From block 370, logic flow 300 can return to block 320 to continue training as described or can continue to block 380. At block 380“replace full precision weight set with ternary weight set” processor 1 10, in executing instructions 122, can generate ternary weight set 128 from full precision weight set 124 and ternary threshold 126.
FIG. 4 illustrates an embodiment of a logic flow 400. The logic flow 400 may be representative of some or all the operations executed by one or more embodiments described herein. For example, the computing system 100 (or components thereof) may perform the operations in logic flow 400 to replace full precision weight set 124 with ternary weight set 128 (e.g., at block 380 of logic flow 300, or the like).
Logic flow 400 can begin at block 410. At block 410“retrieve modified convolutional layer G processor 110, in executing instructions 122, can retrieve a modified convolutional layer (layer i). As detailed above in conjunction with logic flow 300 and FIG. 3, some of the convolutional layer (e.g., interior layers, etc.) may be modified for temarization as detailed herein while other layers (e.g., input and output layers) may not be modified. Accordingly, at block 410, processor 1 10 in executing instructions 122 can retrieve full precision weights from full precision weight set 124 corresponding to a first one of the modified convolutional layers of CNN model 121.
Continuing to block 420“replace full precision weights with ternary weights” processor 110, in executing instructions 122, can replace full precision weights with ternary weight for the retrieved convolutional layer. With some examples, processor 110 in executing instructions 122 can replace full precision weights with ternary weight based on the following equation.
Figure imgf000010_0001
Continuing to block 430“more modified convolutional layers” processor 1 10, in executing instructions 122, can determine whether there are more modified convolutional layers to replace full precision weights with ternary weights. From block 430, logic flow 400 can return to block 410 to continue replacing full precision weights with ternary weights as described or can end.
FIG. 5 illustrates an embodiment of a storage medium 500. Storage medium 500 may comprise any non-transitory computer-readable storage medium or machine-readable storage medium, such as an optical, magnetic or semiconductor storage medium. In various
embodiments, storage medium 500 may comprise an article of manufacture. In some embodiments, storage medium 500 may store computer-executable instructions, such as computer-executable instructions to implement one or more of logic flows or operations described herein, such as with respect to 300 and/or 400 of FIGS. 3-4. Examples of a computer- readable storage medium or machine-readable storage medium may include any tangible media capable of storing electronic data, including volatile memory or non-volatile memory, removable or non-removable memory, erasable or non-erasable memory, writeable or re-writeable memory, and so forth. Examples of computer-executable instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, object-oriented code, visual code, and the like. The embodiments are not limited in this context.
FIG. 6 illustrates an embodiment of a system 3000. The system 3000 is a computer system with multiple processor cores such as a distributed computing system, supercomputer, high-performance computing system, computing cluster, mainframe computer, mini-computer, client-server system, personal computer (PC), workstation, server, portable computer, laptop computer, tablet computer, handheld device such as a personal digital assistant (PDA), or other device for processing, displaying, or transmitting information. Similar embodiments may comprise, e.g., entertainment devices such as a portable music player or a portable video player, a smart phone or other cellular phone, a telephone, a digital video camera, a digital still camera, an external storage device, or the like. Further embodiments implement larger scale server configurations. In other embodiments, the system 3000 may have a single processor with one core or more than one processor. Note that the term“processor” refers to a processor with a single core or a processor package with multiple processor cores. In at least one embodiment, the computing system 3000 is representative of the computing system 100 and/or the edge computing device 201. More generally, the computing system 3000 is configured to implement all logic, systems, logic flows, methods, apparatuses, and functionality described herein with reference to FIGS. 1-5.
As used in this application, the terms“system” and“component” and“module” are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution, examples of which are provided by the exemplary system 3000. For example, a component can be, but is not limited to being, a process running on a processor, a processor, a hard disk drive, multiple storage drives (of optical and/or magnetic storage medium), an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a server and the server can be a component. One or more components can reside within a process and/or thread of execution, and a component can be localized on one computer and/or distributed between two or more computers. Further, components may be communicatively coupled to each other by various types of communications media to coordinate operations. The coordination may involve the unidirectional or bi-directional exchange of information. For instance, the components may communicate information in the form of signals communicated over the communications media. The information can be implemented as signals allocated to various signal lines. In such allocations, each message is a signal. Further embodiments, however, may alternatively employ data messages. Such data messages may be sent across various connections. Exemplary connections include parallel interfaces, serial interfaces, and bus interfaces.
As shown in this figure, system 3000 comprises a motherboard 3005 for mounting platform components. The motherboard 3005 is a point-to-point interconnect platform that includes a first processor 3010 and a second processor 3030 coupled via a point-to-point interconnect 3056 such as an Ultra Path Interconnect (UPI). In other embodiments, the system 3000 may be of another bus architecture, such as a multi-drop bus. Furthermore, each of processors 3010 and 3030 may be processor packages with multiple processor cores including processor core(s) 3020 and 3040, respectively. While the system 3000 is an example of a two- socket (2S) platform, other embodiments may include more than two sockets or one socket. For example, some embodiments may include a four-socket (4S) platform or an eight-socket (8S) platform. Each socket is a mount for a processor and may have a socket identifier. Note that the term platform refers to the motherboard with certain components mounted such as the processors 3010 and the chipset 3060. Some platforms may include additional components and some platforms may only include sockets to mount the processors and/or the chipset.
The processors 3010, 3020 can be any of various commercially available processors, including without limitation an Intel® Celeron®, Core®, Core (2) Duo®, Itanium®, Pentium®, Xeon®, and XScale® processors; AMD® Athlon®, Duron® and Opteron® processors; ARM® application, embedded and secure processors; IBM® and Motorola® DragonBall® and
PowerPC® processors; IBM and Sony® Cell processors; and similar processors. Dual microprocessors, multi-core processors, and other multi processor architectures may also be employed as the processors 3010, 3020.
The first processor 3010 includes an integrated memory controller (IMC) 3014 and point- to-point (P-P) interfaces 3018 and 3052. Similarly, the second processor 3030 includes an IMC 3034 and P-P interfaces 3038 and 3054. The IMC's 3014 and 3034 couple the processors 3010 and 3030, respectively, to respective memories, a memory 3012 and a memory 3032. The memories 3012 and 3032 may be portions of the main memory (e.g., a dynamic random-access memory (DRAM)) for the platform such as double data rate type 3 (DDR3) or type 4 (DDR4) synchronous DRAM (SDRAM). In the present embodiment, the memories 3012 and 3032 locally attach to the respective processors 3010 and 3030. In other embodiments, the main memory may couple with the processors via a bus and shared memory hub.
The processors 3010 and 3030 comprise caches coupled with each of the processor core(s) 3020 and 3040, respectively. The processor cores 3020, 3040 may include memory management logic circuitry (not pictured) which may represent circuitry configured to implement the functionality of the logic flows 300 and/or 400, or may represent a combination of the circuitry within a processor and a medium to store all or part of the functionality of these logic flows in memory such as cache, the memory 3012, buffers, registers, or storage medium 800 attached to the processors 3010 and/or 3030 via a chipset 3060. The functionality of these logic flows may also reside in whole or in part in memory such as the memory 3012 and/or a cache of the processor. Furthermore, the functionality of these logic flows may also reside in whole or in part as circuitry within the processor 3010 and may perform operations, e.g., within registers or buffers such as the registers 3016 within the processors 3010, 3030, or within an instruction pipeline of the processors 3010, 3030. Further still, the functionality of these logic flows may be integrated a processor of the hardware accelerator 106 for performing training and temarization of a CNN (e.g., CNN model 121, etc.)
More than one of the processors 3010 and 3030 may comprise the functionality of these logic flows, such as the processor 3030 and/or a processor within the hardware accelerator 106 coupled with the chipset 3060 via an interface (I/F) 3066. The I/F 3066 may be, for example, a Peripheral Component Interconnect-enhanced (PCI-e).
The first processor 3010 couples to a chipset 3060 via P-P interconnects 3052 and 3062 and the second processor 3030 couples to a chipset 3060 via P-P interconnects 3054 and 3064. Direct Media Interfaces (DMIs) 3057 and 3058 may couple the P-P interconnects 3052 and 3062 and the P-P interconnects 3054 and 3064, respectively. The DMI may be a high-speed interconnect that facilitates, e.g., eight Giga Transfers per second (GT/s) such as DMI 3.0. In other embodiments, the processors 3010 and 3030 may interconnect via a bus.
The chipset 3060 may comprise a controller hub such as a platform controller hub (PCH). The chipset 3060 may include a system clock to perform clocking functions and include interfaces for an I/O bus such as a universal serial bus (USB), peripheral component
interconnects (PCIs), serial peripheral interconnects (SPIs), integrated interconnects (I2Cs), and the like, to facilitate connection of peripheral devices on the platform. In other embodiments, the chipset 3060 may comprise more than one controller hub such as a chipset with a memory controller hub, a graphics controller hub, and an input/output (I/O) controller hub.
In the present embodiment, the chipset 3060 couples with a trusted platform module (TPM) 3072 and the UEFI, BIOS, Flash component 3074 via an interface (I/F) 3070. The TPM 3072 is a dedicated microcontroller designed to secure hardware by integrating cryptographic keys into devices. The UEFI, BIOS, Flash component 3074 may provide pre-boot code.
Furthermore, chipset 3060 includes an I/F 3066 to couple chipset 3060 with a high- performance graphics engine, graphics card 3065. In other embodiments, the system 3000 may include a flexible display interface (FDI) between the processors 3010 and 3030 and the chipset 3060. The FDI interconnects a graphics processor core in a processor with the chipset 3060.
Various I/O devices 3092 couple to the bus 3081, along with a bus bridge 3080 which couples the bus 3081 to a second bus 3091 and an I/F 3068 that connects the bus 3081 with the chipset 3060. In one embodiment, the second bus 3091 may be a low pin count (LPC) bus. Various devices may couple to the second bus 3091 including, for example, a keyboard 3082, a mouse 3084, communication devices 3086 and the storage medium 700 that may store computer executable code as previously described herein. Furthermore, an audio I/O 3090 may couple to second bus 3091. Many of the I/O devices 3092, communication devices 3086, and the storage medium 800 may reside on the motherboard 3005 while the keyboard 3082 and the mouse 3084 may be add-on peripherals. In other embodiments, some or all the I/O devices 3092, communication devices 3086, and the storage medium 800 are add-on peripherals and do not reside on the motherboard 3005.
One or more aspects of at least one example may be implemented by representative instructions stored on at least one machine-readable medium which represents various logic within the processor, which when read by a machine, computing device or system causes the machine, computing device or system to fabricate logic to perform the techniques described herein. Such representations, known as“IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that make the logic or processor.
Various examples may be implemented using hardware elements, software elements, or a combination of both. In some examples, hardware elements may include devices, components, processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), memory units, logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. In some examples, software elements may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. Determining whether an example is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints, as desired for a given implementation.
Some examples may include an article of manufacture or at least one computer-readable medium. A computer-readable medium may include a non-transitory storage medium to store logic. In some examples, the non-transitory storage medium may include one or more types of computer-readable storage media capable of storing electronic data, including volatile memory or non-volatile memory, removable or non-removable memory, erasable or non-erasable memory, writeable or re-writeable memory, and so forth. In some examples, the logic may include various software elements, such as software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, API, instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof.
According to some examples, a computer-readable medium may include a non-transitory storage medium to store or maintain instructions that when executed by a machine, computing device or system, cause the machine, computing device or system to perform methods and/or operations in accordance with the described examples. The instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, and the like. The instructions may be implemented according to a predefined computer language, manner or syntax, for instructing a machine, computing device or system to perform a certain function. The instructions may be implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.
Some examples may be described using the expression“in one example” or“an example” along with their derivatives. These terms mean that a particular feature, structure, or
characteristic described in connection with the example is included in at least one example. The appearances of the phrase“in one example” in various places in the specification are not necessarily all referring to the same example.
Some examples may be described using the expression "coupled" and "connected" along with their derivatives. These terms are not necessarily intended as synonyms for each other. For example, descriptions using the terms“connected” and/or“coupled” may indicate that two or more elements are in direct physical or electrical contact with each other. The term "coupled,” however, may also mean that two or more elements are not in direct contact with each other, yet still co-operate or interact with each other.
In addition, in the foregoing Detailed Description, various features are grouped together in a single example to streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed examples require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed example. Thus, the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separate example. In the appended claims, the terms "including" and "in which" are used as the plain- English equivalents of the respective terms "comprising" and "wherein," respectively.
Moreover, the terms "first," "second," "third," and so forth, are used merely as labels, and are not intended to impose numerical requirements on their objects.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code must be retrieved from bulk storage during execution. The term“code” covers a broad range of software components and constructs, including applications, drivers, processes, routines, methods, modules, firmware, microcode, and subprograms. Thus, the term“code” may be used to refer to any collection of instructions which, when executed by a processing system, perform a desired operation or operations.
Logic circuitry, devices, and interfaces herein described may perform functions implemented in hardware and implemented with code executed on one or more processors. Logic circuitry refers to the hardware or the hardware and code that implements one or more logical functions. Circuitry is hardware and may refer to one or more circuits. Each circuit may perform a particular function. A circuit of the circuitry may comprise discrete electrical components interconnected with one or more conductors, an integrated circuit, a chip package, a chip set, memory, or the like. Integrated circuits include circuits created on a substrate such as a silicon wafer and may comprise components. And integrated circuits, processor packages, chip packages, and chipsets may comprise one or more processors.
Processors may receive signals such as instructions and/or data at the input(s) and process the signals to generate the at least one output. While executing code, the code changes the physical states and characteristics of transistors that make up a processor pipeline. The physical states of the transistors translate into logical bits of ones and zeros stored in registers within the processor. The processor can transfer the physical states of the transistors into registers and transfer the physical states of the transistors to another storage medium.
A processor may comprise circuits to perform one or more sub-functions implemented to perform the overall function of the processor. One example of a processor is a state machine or an application-specific integrated circuit (ASIC) that includes at least one input and at least one output. A state machine may manipulate the at least one input to generate the at least one output by performing a predetermined series of serial and/or parallel manipulations or transformations on the at least one input.
The logic as described above may be part of the design for an integrated circuit chip. The chip design is created in a graphical computer programming language and stored in a computer storage medium or data storage medium (such as a disk, tape, physical hard drive, or virtual hard drive such as in a storage access network). If the designer does not fabricate chips or the photolithographic masks used to fabricate chips, the designer transmits the resulting design by physical means (e.g., by providing a copy of the storage medium storing the design) or electronically (e.g., through the Internet) to such entities, directly or indirectly. The stored design is then converted into the appropriate format (e.g., GDSII) for the fabrication.
The resulting integrated circuit chips can be distributed by the fabricator in raw wafer form (that is, as a single wafer that has multiple unpackaged chips), as a bare die, or in a packaged form. In the latter case, the chip is mounted in a single chip package (such as a plastic carrier, with leads that are affixed to a motherboard or other higher level carrier) or in a multichip package (such as a ceramic carrier that has either or both surface interconnections or buried interconnections). In any case, the chip is then integrated with other chips, discrete circuit elements, and/or other signal processing devices as part of either (a) an intermediate product, such as a processor board, a server platform, or a motherboard, or (b) an end product.
The following examples pertain to further embodiments, from which numerous
permutations and configurations will be apparent.
Example 1. An apparatus, comprising: a processor; and a memory storing instructions, which when executed by the processor cause the processor to: implement at least one training epoch for a convolutional neural network (CNN), each of the at least one training epochs comprising: updating full precision weights for the CNN; updating a temarization threshold for the CNN; and replace the full precisions weights for at least one layer of the CNN with ternary weights based in part on the temarization threshold.
Example 2. The apparatus of example 1, the memory storing instructions, which when executed by the processor cause the processor to initialize the temarization threshold.
Example 3. The apparatus of example 2, the memory storing instructions, which when executed by the processor cause the processor to initialize a positive temarization threshold and a negative temarization threshold based in part on the following equation, where Apos in the positive temarization threshold, Dhe§ is the negative temarization threshold, and W is the weight space of the
Figure imgf000018_0001
Example 4. The apparatus of examples 1, 2, or 3, wherein each of the at least one epochs comprising: deriving output from the CNN based in part on a forward pass calculation through the CNN; deriving an error of the output based in part on a loss function; and updating the full precision weights for the CNN based in part on the error and a backward pass through the CNN.
Example 5. The apparatus of example 4, the memory storing instructions, which when executed by the processor cause the processor to update the temarization threshold based in part on the following equations, where IT is a weight space of the CNN, L is the loss function, D is the temarization threshold, slope is the slope of a layer of the CNN, and t is the one of the at least one epochs:
s = a(W, A,
Figure imgf000018_0002
Example 6. The apparatus of examples 4 or 5, the memory storing instmctions, which when executed by the processor cause the processor to compute the forward pass through the CNN based in part on the following equations, where W is a weight space of the CNN and D is the temarization threshold: Wconv = Aneg, slope)— 1 + a(w,—Apos, slope) );
Figure imgf000018_0003
Figure imgf000019_0001
Example 7. The apparatus of examples 4, 5, or 6, wherein the loss function is a mean squared error function.
Example 8. The apparatus of example 4, 5, 6, or 7, wherein the backward pass through the CNN is based in part on a backpropagation algorithm.
Example 9. A non-transitory computer-readable storage medium comprising instructions that when executed by a computing device, cause the computing device to: update full precision weights for the CNN; update a temarization threshold for the CNN; and replace the full precisions weights for at least one layer of the CNN with ternary weights based in part on the temarization threshold.
Example 10. The non-transitory computer-readable storage medium of example 10, comprising instructions that when executed by the computing device, cause the computing device to initialize the temarization threshold.
Example 11. The non-transitory computer-readable storage medium of examples 10 or 11, comprising instructions that when executed by the computing device, cause the computing device to initialize a positive temarization threshold and a negative temarization threshold based in part on the following equation, where Apos in the positive temarization threshold, Aneg is the negative temarization threshold, and IT is the weight space of the CNN: Apos, Aneg= 0.5 * E(\W\).
Example 12. The non-transitory computer-readable storage medium of examples 9, 10, or 1 1, wherein each of the at least one epochs comprises: deriving output from the CNN based in part on a forward pass calculation through the CNN; deriving an error of the output based in part on a loss function; and updating the full precision weights for the CNN based in part on the error and a backward pass through the CNN.
Example 13. The non-transitory computer-readable storage medium of example 12, comprising instructions that when executed by the computing device, cause the computing device to update the temarization threshold based in part on the following equations, where W is a weight space of the CNN, L is the loss function, D is the temarization threshold, slope is the slope of a layer of the CNN, and t is the one of the at least one epochs: At= At 1— y * ^ =
Figure imgf000019_0002
Example 14. The non-transitory computer-readable storage medium of examples 12 or 13, comprising instructions that when executed by the computing device, cause the computing device to compute the forward pass through the CNN based in part on the following equations, where W is a weight space of the CNN and D is the temarization threshold: Wconv = a *
Figure imgf000020_0001
lim a(W, A, slope) = H(W + A) ; and ^ = slope * s * (1— s).
slope®+¥
Example 15. The non-transitory computer-readable storage medium of examples 12, 13, or 14, wherein the loss function is a mean squared error function.
Example 16. The non-transitory computer-readable storage medium of examples 12, 13,
14, or 15, wherein the backward pass through the CNN is based in part on a backpropagation algorithm.
Example 17. An edge computing device: circuitry; and memory coupled to the circuitry, the memory comprising: a ternary weight network (TWN) based in part on a convolutional neural network (CNN) wherein full precision weights of the CNN were trained simultaneously with a temarization threshold with which ternary weights of the TWN were generated; and instructions, which when executed by the circuitry cause the circuitry to generate an inference based in part on the TWN.
Example 18. The edge computing device of example 17, wherein the circuitry is an application specific integrated circuit or a neural accelerator.
Example 19. The edge computing device of examples 17 or 18, the memory storing instructions, which when executed by the circuitry cause the circuitry to: receive input data; and generate the inference based in providing the input data as input to the TWN.
Example 20. The edge computing device of examples 17, 18, or 19, the memory storing instructions, which when executed by the circuitry cause the circuitry to receive, from a computing device, the TWN.
Example 21. A system comprising: a computing system, comprising: a processor; and memory storing instructions, which when executed by the processor cause the processor to: implement at least one training epoch for a convolutional neural network (CNN), each of the at least one training epochs comprising: updating full precision weights for the CNN; updating a temarization threshold for the CNN; and generate a ternary weight network (TWN) based in part on replacing, based in part on the temarization threshold, the full precisions weights for at least one layer of the CNN with ternary weights; and an edge computing device coupled to the computing system, the edge computing device comprising: circuitry; and edge memory coupled to the circuitry, the edge memory storing instructions, which when executed by the circuitry cause the circuitry to: receive the TWN from the computing system; and generate an inference based in part on the TWN.
Example 22. The system of example 21, the memory storing instructions, which when executed by the processor cause the processor to initialize the temarization threshold.
Example 23. The system of example 22, the memory storing instructions, which when executed by the processor cause the processor to initialize a positive temarization threshold and a negative temarization threshold based in part on the following equation, where Ap0s in the positive temarization threshold, Aneg is the negative temarization threshold, and W is the weight space of the CNN: Ap0S, Aneg= 0.5 * E \W\).
Example 24. The system of examples 21, 22, or 23, wherein each of the at least one epochs comprises: deriving output from the CNN based in part on a forward pass calculation through the CNN; deriving an error of the output based in part on a loss function; and updating the full precision weights for the CNN based in part on the error and a backward pass through the CNN.
Example 25. The system of example 24, the memory storing instructions, which when executed by the processor cause the processor to update the temarization threshold based in part on the following equations, where W is a weight space of the CNN, L is the loss function, D is the temarization threshold, slope is the slope of a layer of the CNN, and t is the one of the at least one epochs:
Figure imgf000021_0001
s = a(W, A, slope).
Example 26. The apparatus of examples 24 or 25, the memory storing instructions, which when executed by the processor cause the processor to compute the forward pass through the CNN based in part on the following equations, where W is a weight space of the CNN and D is the temarization threshold: Wconv = a * (a(W, Aneg, slope)— 1 + a(W,— Apos, slope)
Figure imgf000021_0002
Example 27. The system of examples 24, 25, or 26, wherein the loss function is a mean squared error function.
Example 28. The system of examples 24, 25, 26 or 27, wherein the backward pass through the CNN is based in part on a backpropagation algorithm.
Example 29. The system of examples 21, 22, 23, 24, 25, 26, 27, or 28, wherein the circuitry is an application specific integrated circuit or a neural accelerator. Example 30. The system of examples 21, 22, 23, 24, 25, 26, 27, 28, or 29, the edge memory storing instructions, which when executed by the circuitry cause the circuitry to: receive input data; and generate the inference based in part on providing the input data as input to the TWN.
Example 31. A non-transitory computer-readable storage medium comprising instructions that when executed by an edge computing device, cause the edge computing device to: generate an inference based in part on a ternary weight network (TWN), the TWN based in part on a convolutional neural network (CNN) wherein full precision weights of the CNN were trained simultaneously with a ternarization threshold with which ternary weights of the TWN were generated.
Example 32. The non-transitory computer-readable storage medium of example 31, wherein the circuitry is an application specific integrated circuit or a neural accelerator.
Example 33. The non-transitory computer-readable storage medium of examples 31 or 32, comprising instructions that when executed by the edge computing device, cause the edge computing device to: receive input data; and generate the inference based in providing the input data as input to the TWN.
Example 34. The non-transitory computer-readable storage medium of examples 31, 32, or 33, comprising instructions that when executed by the edge computing device, cause the edge computing device to receive, from a computing device, the TWN.
Example 35. A method comprising: generating an inference based in part on a ternary weight network (TWN), the TWN based in part on a convolutional neural network (CNN) wherein full precision weights of the CNN were trained simultaneously with a ternarization threshold with which ternary weights of the TWN were generated.
Example 36. The method of example 35, wherein the circuitry is an application specific integrated circuit or a neural accelerator.
Example 37. The method of examples 35 or 36, comprising: receiving input data; and generating the inference based in providing the input data as input to the TWN.
Example 38. A method, comprising: updating full precision weights for the CNN; updating a ternarization threshold for the CNN; and replacing the full precisions weights for at least one layer of the CNN with ternary weights based in part on the ternarization threshold.
Example 39. The method of example 38, comprising instructions that when executed by the computing device, cause the computing device to initialize the ternarization threshold.
Example 40. The method of examples 38 or 39, comprising initializing a positive ternarization threshold and a negative ternarization threshold based in part on the following equation, where Apos in the positive ternarization threshold, Aneg is the negative temarization threshold, and lEis the weight space of the CNN: Apos, Aneg = 0.5 * E(\W\).
Example 41. The method of examples 38, 39, or 40, wherein each of the at least one epochs comprises: deriving output from the CNN based in part on a forward pass calculation through the CNN; deriving an error of the output based in part on a loss function; and updating the full precision weights for the CNN based in part on the error and a backward pass through the CNN.
Example 42. The method of example 12, comprising updating the temarization threshold based in part on the following equations, where W is a weight space of the CNN, L is the loss function, D is the temarization threshold, slope is the slope of a layer of the CNN, and t is the one of the at least one epochs:
Figure imgf000023_0001
Figure imgf000023_0002
Example 43. The method of examples 41 or 42, comprising computing the forward pass through the CNN based in part on the following equations, where IE is a weight space of the
CNN and D is the temarization threshold: Wconv = a * (o(W, Aneg, slope )— 1 +
Figure imgf000023_0003
slope * s * (1— s).
Example 44. The method of examples 41, 42, or 43, wherein the loss function is a mean squared error function.
Example 45. The method of examples 41, 42, 43, or 44, wherein the backward pass through the CNN is based in part on a backpropagation algorithm.
Example 46. An apparatus, comprising means arranged to implement the function of any one of examples 35 to 45.

Claims

CLAIMS What is claimed is:
1. An apparatus, comprising:
a processor; and
a memory storing instructions, which when executed by the processor cause the processor to:
implement at least one training epoch for a convolutional neural network (CNN), each of the at least one training epochs comprising:
updating full precision weights for the CNN;
updating a temarization threshold for the CNN; and replace the full precisions weights for at least one layer of the CNN with ternary weights based in part on the temarization threshold.
2. The apparatus of claim 1 , the memory storing instructions, which when executed by the processor cause the processor to initialize the temarization threshold.
3. The apparatus of claim 2, the memory storing instructions, which when executed by the processor cause the processor to initialize a positive temarization threshold and a negative temarization threshold based in part on the following equation, where Apos in the positive temarization threshold, AnCg is the negative temarization threshold, and W is the weight space of the CNN: D pos, Aneg= 0.5 * E(\W\).
4. The apparatus of claim 1, wherein each of the at least one epochs comprising:
deriving output from the CNN based in part on a forward pass calculation through the
CNN;
deriving an error of the output based in part on a loss function; and
updating the full precision weights for the CNN based in part on the error and a backward pass through the CNN.
5. The apparatus of claim 4, the memory storing instructions, which when executed by the processor cause the processor to update the temarization threshold based in part on the following equations, where I is a weight space of the CNN, L is the loss function, D is the temarization threshold, slope is the slope of a layer of the CNN, and t is the one of the at least one epochs:
Figure imgf000025_0001
6. The apparatus of claim 4, the memory storing instructions, which when executed by the processor cause the processor to compute the forward pass through the CNN based in part on the following equations, where IF is a weight space of the CNN and D is the temarization threshold:
Figure imgf000025_0002
lim a(W , A, slope) = H(W + D); and
slope®+ oo
^ = slope * s * (1— s).
7. The apparatus of claim 4, wherein the loss function is a mean squared error function.
8. The apparatus of claim 4, wherein the backward pass through the CNN is based in part on a backpropagation algorithm.
9. A non-transitory computer-readable storage medium comprising instructions that when executed by a computing device, cause the computing device to:
updating full precision weights for the CNN;
updating a temarization threshold for the CNN; and
replace the full precisions weights for at least one layer of the CNN with ternary weights based in part on the temarization threshold.
10. The non-transitory computer-readable storage medium of claim 10, comprising instmctions that when executed by the computing device, cause the computing device to initialize the temarization threshold.
11. The non-transitory computer-readable storage medium of claim 10, comprising instmctions that when executed by the computing device, cause the computing device to initialize a positive temarization threshold and a negative temarization threshold based in part on the following equation, where Apos in the positive temarization threshold, Ancg is the negative temarization threshold, and fVis the weight space of the CNN: Ap0S, Aneg= 0.5 * E(\W\).
12. The non-transitory computer-readable storage medium of claim 9, wherein each of the at least one epochs comprises:
deriving output from the CNN based in part on a forward pass calculation through the
CNN;
deriving an error of the output based in part on a loss function; and
updating the full precision weights for the CNN based in part on the error and a backward pass through the CNN.
13. The non-transitory computer-readable storage medium of claim 12, comprising instructions that when executed by the computing device, cause the computing device to update the temarization threshold based in part on the following equations, where W is a weight space of the CNN, L is the loss function, D is the temarization threshold, slope is the slope of a layer of the CNN, and t is the one of the at least one epochs:
Figure imgf000026_0001
s = a(W, A, slope).
14. The non-transitory computer-readable storage medium of claim 12, comprising instructions that when executed by the computing device, cause the computing device to compute the forward pass through the CNN based in part on the following equations, where W is a weight space of the CNN and D is the temarization threshold:
Figure imgf000026_0002
15. The non-transitory computer-readable storage medium of claim 12, wherein the loss function is a mean squared error function.
16. The non-transitory computer-readable storage medium of claim 12, wherein the backward pass through the CNN is based in part on a backpropagation algorithm.
17. An edge computing device:
circuitry; and
memory coupled to the circuitry, the memory comprising:
a ternary weight network (TWN) based in part on a convolutional neural network (CNN) wherein full precision weights of the CNN were trained simultaneously with a ternarization threshold with which ternary weights of the TWN were generated; and instructions, which when executed by the circuitry cause the circuitry to generate an inference based in part on the TWN.
18. The edge computing device of claim 17, wherein the circuitry is an application specific integrated circuit or a neural accelerator.
19. The edge computing device of claim 17, the memory storing instructions, which when executed by the circuitry cause the circuitry to:
receive input data; and
generate the inference based in providing the input data as input to the TWN.
20. The edge computing device of claim 17, the memory storing instructions, which when executed by the circuitry cause the circuitry to receive, from a computing device, the TWN.
PCT/IB2019/000367 2019-03-29 2019-03-29 Trainable threshold for ternarized neural networks WO2020201791A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2019/000367 WO2020201791A1 (en) 2019-03-29 2019-03-29 Trainable threshold for ternarized neural networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2019/000367 WO2020201791A1 (en) 2019-03-29 2019-03-29 Trainable threshold for ternarized neural networks

Publications (1)

Publication Number Publication Date
WO2020201791A1 true WO2020201791A1 (en) 2020-10-08

Family

ID=66676836

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2019/000367 WO2020201791A1 (en) 2019-03-29 2019-03-29 Trainable threshold for ternarized neural networks

Country Status (1)

Country Link
WO (1) WO2020201791A1 (en)

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHENZHUO ZHU ET AL: "Trained Ternary Quantization", 23 February 2017 (2017-02-23), pages 1 - 10, XP055491016, Retrieved from the Internet <URL:https://arxiv.org/pdf/1612.01064.pdf> [retrieved on 20180709] *
SAMBHAV R JAIN ET AL: "Trained Uniform Quantization for Accurate and Efficient Neural Network Inference on Fixed-Point Hardware", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 19 March 2019 (2019-03-19), XP081155400 *
STEVEN K ESSER ET AL: "Learned Step Size Quantization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 21 February 2019 (2019-02-21), XP081031941 *
ZHEZHI HE ET AL: "Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network using Truncated Gaussian Approximation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 2 October 2018 (2018-10-02), XP081054844 *

Similar Documents

Publication Publication Date Title
US11544191B2 (en) Efficient hardware architecture for accelerating grouped convolutions
JP7119107B2 (en) Method and Apparatus for Preserving Statistical Inference Accuracy in 8-Bit Winograd Convolution
US11216719B2 (en) Methods and arrangements to quantize a neural network with machine learning
US20210216871A1 (en) Fast Convolution over Sparse and Quantization Neural Network
KR102648665B1 (en) Architecture-optimized training of neural networks
US11544461B2 (en) Early exit for natural language processing models
CN111724832B (en) Apparatus, system, and method for memory array data structure positive number operation
WO2020243922A1 (en) Automatic machine learning policy network for parametric binary neural networks
CN111625183A (en) Systems, devices, and methods involving acceleration circuitry
CN111696610A (en) Apparatus and method for bit string conversion
CN113805974A (en) Application-based data type selection
CN104823153B (en) Processor, method, communication equipment, machine readable media, the equipment and equipment for process instruction of normalization add operation for execute instruction
US11941371B2 (en) Bit string accumulation
WO2020201791A1 (en) Trainable threshold for ternarized neural networks
CN113918117B (en) Dynamic precision bit string accumulation
US20220365751A1 (en) Compressed wallace trees in fma circuits
US20210209473A1 (en) Generalized Activations Function for Machine Learning
US11662981B2 (en) Low-power programmable truncated multiplication circuitry
CN113553278A (en) Acceleration circuitry for posit operations
US20190228326A1 (en) Deep learning data manipulation for multi-variable data providers
WO2021056134A1 (en) Scene retrieval for computer vision
CN113508363B (en) Arithmetic and logical operations in a multi-user network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19727481

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19727481

Country of ref document: EP

Kind code of ref document: A1