CN117677958A - Neural processing unit including a programmed activation function execution unit - Google Patents

Neural processing unit including a programmed activation function execution unit Download PDF

Info

Publication number
CN117677958A
CN117677958A CN202280047322.4A CN202280047322A CN117677958A CN 117677958 A CN117677958 A CN 117677958A CN 202280047322 A CN202280047322 A CN 202280047322A CN 117677958 A CN117677958 A CN 117677958A
Authority
CN
China
Prior art keywords
activation function
segment
programmable
value
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280047322.4A
Other languages
Chinese (zh)
Inventor
金錄元
朴仁洙
金浩承
田振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tipu Aikesi Co ltd
Original Assignee
Tipu Aikesi Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020220165012A external-priority patent/KR102651560B1/en
Application filed by Tipu Aikesi Co ltd filed Critical Tipu Aikesi Co ltd
Priority claimed from PCT/KR2022/019376 external-priority patent/WO2023101472A1/en
Publication of CN117677958A publication Critical patent/CN117677958A/en
Pending legal-status Critical Current

Links

Landscapes

  • Stored Programmes (AREA)

Abstract

An activation function programming method is provided according to an embodiment of the present disclosure. The method comprises the following steps: generating fragment data for segmenting the activation function; segmenting the activation function into a plurality of segments by using the generated segment data; and approximating at least one segment among the plurality of segments as a programmable segment.

Description

Neural processing unit including a programmed activation function execution unit
Technical Field
The present disclosure relates to a Neural Processing Unit (NPU) including a programmable activation function execution unit.
Background
Humans are equipped with intelligence that can perform recognition, classification, reasoning, prediction, and control/decision-making. Artificial Intelligence (AI) refers to the imitation of human intelligence.
The human brain is composed of many nerve cells called neurons. Each neuron is connected to hundreds to thousands of other neurons through a connection called a synapse. Modeling the principle of operation of biological neurons and the connection between neurons in order to mimic human intelligence is known as an Artificial Neural Network (ANN) model. In other words, ANN is a system in which nodes mimicking neurons are connected in a layer structure.
An ANN-specific processor developed to accelerate the computation of ANNs is a Neural Processing Unit (NPU).
Disclosure of Invention
Technical problem
ANNs are classified into "single-layer neural networks" and "multi-layer neural networks" according to the number of layers. A typical multi-layer neural network consists of an input layer, a hidden layer, and an output layer. The input layers are layers that receive input values, and the number of input layers is the same as the number of input variables. The hidden layer is located between the input layer and the output layer and is the layer that receives signals from the input layer, extracts features and passes them to the output layer. The output layer is a layer that receives a signal from the hidden layer and outputs it to the outside.
As signals are transmitted between neurons in the human brain, the transmission intensity of the signals varies. By mimicking this, the transmission strength variation of the signal transmitted between the layers, i.e. the activation, is determined by an activation function in the ANN.
The inference accuracy of the ANN may vary depending on the nature of the activation function implemented in the NPU. That is, the performance and efficiency of the ANN is determined based on the hardware implementation characteristics of the NPU's activation function processing circuitry. In addition, artificial neural networks that handle complex mathematical activation functions may be handled by hardware accelerators. When an ANN-specific processor is implemented in hardware, the ANN-specific processor may require a significant chip area (i.e., a large number of logic gates). Furthermore, these chips can have significant power consumption.
To achieve higher artificial intelligence, deep Neural Networks (DNNs) have been disclosed with an increased number of hidden layers. The activation function of the DNN is used to determine the transfer strength of the calculated values to which the weights and biases are applied. DNNs are being developed in different structures.
For example, a Convolutional Neural Network (CNN), which is known as an example of DNN, easily extracts features of an input value (i.e., video or image) and identifies a pattern (pattern) of the extracted features. CNNs may be configured to handle the form of convolution operations, activation function operations, pooling operations, etc. in a particular order.
For example, in each layer of DNN, the input values and parameters (i.e., weights or kernels) may be a matrix made up of multiple channels. The input values and parameters may be processed in the NPU by convolution or matrix multiplication. The calculation value is generated after processing the calculation in each layer. An activation function may be applied to these calculated values.
For example, the converter (transducer) is a DNN based on attention technology. The converter uses a number of matrix multiplication operations. The converter may obtain an operational value of the attention (Q, K, V) by using parameters such as the input value and the query (Q), the key (K), and the value (V). The converter may process various inference operations based on the operational values, i.e., attention (Q, K, V). Converters tend to exhibit better reasoning performance than CNNs.
The above-described neural network may be referred to as DNN. Meanwhile, the activation function may be selectively applied to an operation value of a specific layer among a plurality of layers of the DNN.
May be configured to include an X-axis value corresponding to an input value of the activation function (i.e., an operational value of a particular layer) and a Y-axis value corresponding to an activation value of the activation function. The activation function functions to convert a mathematical linear combination of input values into various types of linear combinations or non-linear combinations. Accordingly, the DNN may be designed to perform various inference functions by applying the appropriate activation function to the operational values of a particular layer.
Most complex functions to be solved in DNN exhibit non-linearities. To solve this problem, most activation functions are nonlinear functions.
The performance and efficiency of the DNN model processed in hardware may vary according to the nonlinearity of the activation function applied to at least one DNN model processed by the NPU.
The activation function may increase or decrease the inference accuracy by emphasizing more features of a particular region of the input value of the activation function and less features of other regions of the input value of the activation function.
The nonlinearity of at least some of the various activation functions may include logarithmic operations, exponential operations, and the like. In terms of digital logic design, it is very complex to implement activation functions including logarithmic and exponential operations in hardware. For example, for logarithmic and exponential operations, the configuration of the hardware operators becomes very complex. Accordingly, the inventors of the present disclosure recognize that power consumption of hardware may increase and calculation processing speed may slow down.
In the case of an NPU, each activation function processing module may need to be designed for each activation function processing. Furthermore, the hardwired processor may process only predefined activation functions using corresponding hardwired dedicated activation function processing logic units. At this time, the inventors of the present disclosure recognized that there is a disadvantage in that the number of gates increases rapidly in a hardwired processor according to the computational complexity of the activation function.
Without hardware modification, the hardwired processor cannot independently handle the new activation function. The activation functions that cannot be handled by the hardwired processor must be calculated using separate software. For example, the hardwired processor may be an Application Specific Integrated Circuit (ASIC) dedicated to artificial intelligence. That is, the hardwired processor may be an NPU.
Various methods have been proposed to handle various types of activation functions in hardwired processors. For example, conventionally, an activation function has been handled by a method using a lookup table (LUT), a method using a nonlinear approximation equation, a method using polynomial approximation, or the like.
However, the inventors of the present disclosure have recognized that conventional approaches to approximating an activation function, in which the activation function is processed in hardware using polynomial approximation or the like, require extensive computations from a processor to improve inference accuracy.
Accordingly, the inventors of the present disclosure have recognized that there is a need to improve the problem of deterioration in inference accuracy of a DNN model to which a conventional activation function approximation technique is applied, the problem of an increase in the number of gates in an activation function processing unit of a processor, and the problem of an increase in power consumption of the processor.
Furthermore, the inventors of the present disclosure have recognized that there is a need for a programming method that can approximate any activation function, as well as a hardware design for driving the activation function, in order for a processor to process independently: 1) activation functions that are not included in predetermined data, such as a look-up table, that cannot be processed by a processor applying the conventional activation function processing method, 2) new activation functions, and/or 3) activation functions in which some of the conventional activation functions have been modified.
Furthermore, the inventors of the present disclosure have recognized a need to design an NPU that is capable of driving an approximation algorithm optimized for the characteristics of the activation function.
Furthermore, the inventors of the present disclosure have recognized that if hardware optimized for such a programming method is provided, the activation function can be programmed efficiently and flexibly in hardware.
Furthermore, each region may be set based on the shape of the activation function to be programmed, and the approximation parameters may be programmed for each set region. The inventors of the present disclosure have recognized that by considering the characteristics of each region of the activation function, the activation function can be programmed efficiently and with low approximation errors.
Furthermore, the inventors of the present disclosure have recognized that a Programmable Activation Function (PAF) may be provided in a hardwired processor that includes a Programmed Activation Function Execution (PAFE) unit.
It is therefore an object of the present disclosure to provide a method that is relatively superior to conventional approximation methods and that is capable of programming nonlinear activation functions in hardware with various hardware options.
Further, it is an object of the present disclosure to provide a method of approximating a nonlinear activation function in a more customized manner by considering characteristics of the activation function itself, approximation errors, hardware option information, and the like.
Further, a problem addressed by the present disclosure is to provide a hardwired processor including a PAFE unit.
Further, a problem addressed by the present disclosure is to provide a hardwired processor including a PAFE unit configured to process at least one programmed activation function.
However, the tasks of the present disclosure are not limited to the above-described tasks, and other tasks not mentioned will be clearly understood by those skilled in the art from the following description.
Technical proposal
The detailed description of other examples is included in the detailed description and the accompanying drawings.
Technical effects
In accordance with the present disclosure, the NPU may receive programming parameters of the activation function and process the activation function.
By using fragment data, various nonlinear activation functions, particularly known activation functions that are newly proposed or have some modifications, can be programmed to be processable in hardware in accordance with the present disclosure.
Further, according to the present disclosure, when approximating various nonlinear activation functions, fragment data including characteristics of the activation function itself, approximation errors, hardware option information, and the like may be used. Thus, the nonlinear activation function can be programmed in a more customized manner while ensuring high performance and high efficiency of the DNN.
Further, according to the present disclosure, when approximating various nonlinear activation functions, it is possible to minimize an approximation error while minimizing hardware costs by using fragment data including characteristics of the activation function itself, an approximation error, hardware option information, and the like.
Furthermore, each segment of the activation function may be programmed with various algorithms in accordance with the present disclosure. The NPU may provide a hardware option for an algorithm capable of processing each fragment of the programmed activation function.
Further, in accordance with the present disclosure, a hardwired processor including PAFE units may be implemented. Thus, the processor can handle any activation function by changing only the programmable parameters without requiring hardware changes.
Further, in accordance with the present disclosure, a hardwired processor may be implemented that includes a PAFE unit configured to process at least one programming activation function. Thus, the processor may utilize the PAFE units to process different activation functions simultaneously or sequentially without hardware changes.
Effects according to the present disclosure are not limited to the contents of the above examples, and further various effects are included in the present disclosure.
Drawings
Fig. 1 is a schematic conceptual diagram illustrating an apparatus for performing a method of programming an activation function according to an example of the present disclosure.
Fig. 2 is a schematic flow chart illustrating a method of programming an activation function according to an example of the present disclosure.
Fig. 3 is a graph illustrating a process of programming an activation function by a method of programming an activation function according to an example of the present disclosure.
Fig. 4 is a graph illustrating various cases of segmenting an activation function into a plurality of segments by a programming method of the activation function according to an example of the present disclosure.
Fig. 5 is a graph illustrating an example of segmenting an activation function into a linear section and a nonlinear section using slope change data among segment data in an activation function programming method according to an example of the present disclosure.
Fig. 6 is a graph illustrating an example of segmenting an activation function into a substantially linear section and a nonlinear section using slope change data among segment data in an activation function programming method according to an example of the present disclosure.
Fig. 7 is a graph illustrating another example of segmenting an activation function into a substantially linear section and a nonlinear section using slope change data among segment data in an activation function programming method according to an example of the present disclosure.
Fig. 8 is a graph illustrating another example of segmenting an activation function into a substantially linear section and a nonlinear section using slope change data among segment data in an activation function programming method according to an example of the present disclosure.
Fig. 9 is a graph illustrating an example of converting one segment into one programmable segment using an error value in an activation function programming method according to an example of the present disclosure.
Fig. 10 is a graph illustrating an example of converting one segment into one programmable segment using a maximum error value in an activation function programming method according to an example of the present disclosure.
Fig. 11 is a graph showing an example of converting one segment into one programmable segment using an integrated value of an error value in an activation function programming method according to an example of the present disclosure.
Fig. 12 is a graph illustrating an example of approximating a segment to an optimally programmable segment using machine learning in an activation function programming method according to an example of the present disclosure.
Fig. 13 is a graph illustrating an example of segmenting an activation function using an accumulated value of a second derivative of the activation function and a threshold value in an activation function programming method according to an example of the present disclosure.
Fig. 14 and 15 are graphs showing the ELU activation function and the hardswick activation function.
Fig. 16 is a conceptual diagram illustrating a PAFE cell configured to handle a programming activation function according to an example of the present disclosure.
Fig. 17 is a conceptual diagram illustrating a PAFE cell of an NPU configured to process an apparatus to program an activation function according to an example of the present disclosure.
Fig. 18 is a conceptual diagram illustrating an NPU of an apparatus for processing a programming activation function according to another example of the present disclosure.
Fig. 19 is a conceptual diagram illustrating a PAFE cell of an NPU for processing an apparatus for programming an activation function according to another example of the present disclosure.
Fig. 20 is a conceptual diagram illustrating an NPU of an apparatus for processing a programming activation function according to another example of the present disclosure.
Fig. 21 is a conceptual diagram illustrating an NPU of an apparatus for processing a programming activation function according to another example of the present disclosure.
Fig. 22 is a conceptual diagram illustrating a PAFE cell configured to handle a programming activation function according to another example of the present disclosure.
Fig. 23 is a conceptual diagram illustrating a PAFE cell of an NPU for processing an apparatus for programming an activation function according to another example of the present disclosure.
Fig. 24 is a graph illustrating an example of a device for processing a programmed activation function approximating a sigmoid activation function to a programmable activation function according to another example of the present disclosure.
Fig. 25 is a conceptual diagram illustrating a PAFE cell of an NPU for processing an apparatus for programming an activation function according to another example of the present disclosure.
Detailed Description
The specific structures or step-wise descriptions of examples of the concepts according to the present disclosure disclosed in the present specification or the application are illustrated for the purpose of explanation only of the examples of the concepts according to the present disclosure.
Examples of the concepts according to the present disclosure may be embodied in various forms. Examples of concepts according to the present disclosure should not be construed as being limited to examples described in the present specification or application.
Various modifications may be applied to embodiments of the concepts according to the present disclosure. The present disclosure may take a variety of forms. Accordingly, specific examples are shown in the drawings and described in detail in this disclosure. However, this is not intended to limit examples of the concepts in accordance with the present disclosure to the particular disclosed forms. Accordingly, it should be understood that all changes, equivalents, or alternatives falling within the spirit and scope of the present disclosure are included in the present disclosure.
Terms such as first and/or second may be used to describe various components. However, the present disclosure should not be limited by the above terms.
These terms are only used for distinguishing one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the claims directed to the concepts according to the present disclosure.
When an element is referred to as being "connected" or "contacting" another element, it should be understood that it can be directly connected or contacting the other element but other elements can be disposed therebetween. On the other hand, when it is referred to that a certain element is "directly connected to" or "directly contacted with" another element, it should be understood that there are no other elements therebetween.
Other expressions describing the relationship between elements such as "between … …" and "immediately between … …" or "abutting" and "directly abutting" etc. should be interpreted similarly.
In the present disclosure, expressions such as "a or B", "at least one of a and/or B" or "one or more of a and/or B" may include all possible combinations thereof. For example, "a or B", "at least one of a and B" or "at least one of a or B" may mean (1) including at least one a, (2) including at least one B, or (3) including both at least one a and at least one B.
As used herein, expressions such as "first," "second," "first or second," may modify various elements regardless of order and/or importance. The expression is used merely to distinguish one element from another element and is not intended to limit the element. For example, the first user equipment and the second user equipment may represent different user devices, regardless of order or importance. For example, a first element could be termed a second element, and, similarly, a second element could be renamed to a first element, without departing from the scope of the claims described in this disclosure.
The terminology used in the present disclosure is used only for describing particular examples and is not intended to limit the scope of other examples.
Unless the context clearly indicates otherwise, singular expressions may include plural expressions. The terms used herein, including technical or scientific terms, may have the same meaning as commonly understood by one of ordinary skill in the art described in this document.
In terms used in this disclosure, terms defined in a general dictionary may be interpreted as having the same or similar meaning as in the context of the related art. It should not be interpreted in an idealized or overly formal sense unless expressly so defined herein. In some cases, even terms defined in the present disclosure cannot be construed as excluding examples of the present disclosure.
The terminology used herein is for the purpose of describing particular examples only and is not intended to be limiting of the disclosure.
Unless the context clearly indicates otherwise, singular expressions include plural expressions. In this specification, terms such as "comprising" or "having" are intended to indicate the presence of the stated features, numbers, steps, operations, components, parts or combinations thereof. Thus, it should be understood that the presence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof is not precluded.
Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. Terms such as terms defined in a general dictionary should be interpreted as having meanings consistent with the meanings in the context of the related art. No interpretation should be construed in an idealized or overly formal sense unless expressly so defined in this disclosure.
Each feature in the various examples of the present disclosure may be combined partially or completely or with each other. As will be fully appreciated by those skilled in the art, the various examples of the present disclosure are capable of being linked and driven differently in technology. Each example of the present disclosure may be implemented independently of each other or may be implemented together in an association relationship.
In describing examples, descriptions of technical contents that are well known in the technical field to which the present disclosure pertains and are not directly related to the present disclosure may be omitted. By omitting unnecessary descriptions, this is intended to more clearly convey the gist of the present disclosure without obscuring the gist of the present disclosure.
Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.
Fig. 1 illustrates an apparatus for performing an activation function programming method according to an example of the present disclosure.
Referring to fig. 1, an apparatus a for performing an activation function programming method may include a neural processing unit NPU 1000 and an activation function conversion programming unit 3000. Here, the device a may mean a system. Device a may also include a processor 2000, a main memory 4000, an image sensor 5000, and a decoder 6000. Thus, device a may be configured to perform various artificial neural network reasoning functions.
Each element that may be included in device a may communicate over bus 7000 to send and receive data.
Here, the NPU 1000, the processor 2000, the main memory 4000, the image sensor 5000, and the decoder 6000 may be configured as electronic circuits. The activation function conversion programming unit 3000 may be a computer program, software, firmware, application program, or executable code stored in a recording medium. However, the present disclosure is not limited thereto.
The activation function conversion programming unit 3000 may be a computer program configured to execute instructions for converting an activation function into a PAF expressed as a programmable parameter. The activation function conversion programming unit 3000 may be stored in a computer-readable recording medium. The computer readable recording medium may include ROM, RAM, SSD, HDD, CD-ROM, flash memory, magnetic tape, floppy disk, optical data storage device, etc.
The NPU 1000 is a processor separate from the processor 2000 that is dedicated to the operation of a Deep Neural Network (DNN). In particular, the NPU 1000 may include operators dedicated to convolution and matrix multiplication, which occupy most of the computational load of the DNN. The NPU 1000 and the processor 2000 may be semiconductor chips that include electronic circuitry.
NPU 1000 may include a controller 100, a Direct Memory Access (DMA) 200, a memory 300, at least one processing element 400, and a program activated function execution unit (PAFE unit) 500. Hereinafter, the program-activated function execution unit 500 is referred to as a PAFE unit and is described.
Controller 100 may be electrically coupled to DMA200, memory 300, at least one processing element 400, and a PAFE unit 500. The controller 100 may be configured to control operations related to DNN operations in the NPU 1000.
However, the present disclosure is not so limited, and at least one processing element 400 may be modified and implemented as an array of processing elements (e.g., a systolic array).
DMA200 is configured to cause NPU 1000 to directly access main memory 4000 external to NPU 1000 to perform read/write operations. NPU 1000 may read various data related to DNNs from main memory 4000 through DMA 200. DMA200 may be configured to perform tasks such as setting, generating, and controlling the addresses of internal memory 300.
The memory 300 may be a memory provided in an on-chip area of the NPU 1000, and may be a memory for caching or storing data processed in the on-chip area. The memory 300 may read and store data required for calculation of the artificial neural network model from the main memory 4000. The memory 300 may include one of memories such as ROM, SRAM, DRAM, resistive RAM, magnetoresistive RAM, phase change RAM, ferroelectric RAM, flash memory, and HBM. The memory 300 may be comprised of at least one memory cell. Memory 300 may be configured as a homogeneous memory unit or as a heterogeneous memory unit.
The at least one processing element 400 may be configured to process the operation of parameters (e.g., weights, cores, queries (Q), keys (K), values (V), etc.) corresponding to the input data of the DNN. The at least one processing element 400 may include a multiply-accumulate (MAC) operator and/or an Arithmetic Logic Unit (ALU) operator.
PAFE unit 500 is configured to receive data (i.e., programmable parameters) for a Programmable Activation Function (PAF) converted from an activation function.
For ease of illustration, the programmable activation function will be referred to as PAF.
The programmable parameter may be data generated by the activation function conversion programming unit 3000. The programmable parameters may be configured to have a form compatible with the circuitry of the PAFE unit 500 of the NPU 1000. The programmable parameters may be configured to implement at least one PAF. That is, PAFE unit 500 may be configured to receive programmable parameters corresponding to at least one PAF generated by activation function conversion programming unit 3000. In detail, the PAF programmed by the activation function conversion programming unit 3000 may include at least one programmable segment. That is, the programmable parameter may implement at least one programmable segment.
The NPU 1000 may perform DNN operations by receiving data for the PAF related to the activation function. PAFE unit 500 may generate an activation value (e.g., an activation map) by applying the PAF generated by activation function conversion programming unit 3000 to a calculated value (e.g., a feature map) output from at least one processing element 400. The PAFE unit uses at least one programmable parameter generated corresponding to the at least one PAF. Thus, PAFE unit 500 enables NPU 1000 to handle various activation functions, particularly newly proposed or known but partially modified activation functions.
PAFE unit 500 may be pipelined (pipelined) with at least one processing element 400. According to the above configuration, the value calculated by the at least one processing element 400 can be input through the pipeline. Accordingly, at least one pipelined processing element 400 and PAFE unit 500 may be configured to receive operational values from at least one processing element 400 and output an activation value to which PAF is applied. In this case, bottlenecks (bottlenecks) that may occur in at least one processing element 400 and PAFE unit 500 may be minimized or substantially eliminated. However, examples of the present disclosure are not limited to pipeline structures, and PAFE units may be implemented by incorporation with at least one processing element 400.
The activation function conversion programming unit 3000 may be operated by the processor 2000, but is not limited thereto. The processor 2000 may be an arithmetic device such as a Central Processing Unit (CPU) or an Application Processor (AP) capable of performing the activation function programming method disclosed in the present disclosure.
The activation function conversion programming unit 3000 may be stored in a computer-readable recording medium. The activation function conversion programming unit 3000 may be implemented as firmware or software included in hardware. Separate computing systems and operating systems may be provided to drive the activation function conversion programming unit 3000. Activation function conversion programming unit 3000 may be a program for operating NPU 1000 including PAFE unit 500. The activation function conversion programming unit 3000 may be configured to perform an activation function programming method. The activation function conversion programming unit 3000 may be executed by the processor 2000 or a processor external to the device a. The activation function conversion programming unit 3000 may be configured separately from a compiler configured to compile DNNs in the apparatus a. Alternatively, the activation function conversion programming unit 3000 may be integrated with a compiler.
The activation function conversion programming unit 3000 may be configured to program at least one activation function. Activation function conversion programming unit 3000 may be configured to provide a programmable parameter corresponding to at least one PAF to PAFE unit 500.
The activation function conversion programming unit 3000 may be configured to receive activation function information included in the DNN to be processed by the NPU 1000. The activation function conversion programming unit 3000 may obtain information about all activation functions to be processed by the NPU 1000 based on the provided information about at least one activation function. Thus, the activation function conversion programming unit 3000 may program at least one activation function required for the DNN to be processed by the NPU 1000.
In various examples, activation function conversion programming unit 3000 may generate fragment data for segmenting the activation function, segment the activation function into a plurality of fragments using the generated fragment data, and approximate at least one fragment among the plurality of fragments as a programmable fragment. When the values of the programmable parameters are determined, an approximate level of the programmable segment may be determined. The activation function conversion programming unit 3000 may determine the number and width of the plurality of clips based on the clip data.
The activation function conversion programming unit 3000 may be configured to analyze characteristics of the activation function. For example, the activation function conversion programming unit 3000 may be configured to analyze a gradient change of the activation function. Slope change data for an activation function may refer to various data from which slope changes for the activation function may be determined.
The activation function conversion programming unit 3000 may analyze characteristics of the activation function based on the slope change data. In other words, the approximation error tends to increase in a region where the slope of the activation function changes more severely, whereas in the case of a region where the slope does not change, the approximation error may be zero. Accordingly, the activation function conversion programming unit 3000 may be configured to approximate the activation function to the optimal condition by analyzing the slope change data.
For example, the slope change data of the activation function may be differential data of the activation function. The slope change data may include at least one of a slope change value, a first derivative value, a second derivative value, a third derivative value, and the like.
For example, the activation function conversion programming unit 3000 may determine a linear section and a nonlinear section of the PAF based on slope change data of the activation function.
In some examples, activation function conversion programming unit 3000 may determine a section having a substantially insignificant gradient change among the nonlinear sections of the PAF as a substantially linear section.
The activation function conversion programming unit 3000 may convert at least one fragment into a programmable fragment approximated by a specific equation.
For example, the activation function conversion programming unit 3000 may convert a specific segment of an activation function into a programmable segment approximated by a linear function.
Specifically, the activation function conversion programming unit 3000 may convert at least one segment into a programmable segment approximated with a specific gradient and a specific offset value. The activation function conversion programming unit 3000 may convert at least one fragment among the plurality of fragments into a programmable fragment using a specific nonlinear approximation equation. The activation function conversion programming unit 3000 may determine gradients and offsets for approximating at least one segment to a programmable segment corresponding to a linear function.
The activation function conversion programming unit 3000 may search for a minimum error value while converting the gradient value and the offset value of the programmable segment. Alternatively, the activation function conversion programming unit 3000 may search for the minimum error value by executing the cost function.
The activation function conversion programming unit 3000 may calculate an error value between at least one segment of the activation function to be transformed and at least one candidate segment having a candidate gradient and a candidate offset. The activation function conversion programming unit 3000 may determine at least one candidate segment as a programmable segment based on the calculated error value. The activation function conversion programming unit 3000 may search for at least one minimum error value between a segment of the activation function and each of the corresponding programmable segments. The activation function conversion programming unit 3000 may determine the programmable parameters of the programmable segment based on at least one searched minimum error value. Here, the determined error value may be a minimum error value. When the activation function conversion programming unit 3000 determines the programmable parameter based on the minimum error value, degradation of the inference accuracy of the DNN can be suppressed or minimized.
However, examples of the present disclosure are not limited to the minimum error value, and the programmable parameter may be determined differently according to different priorities between the calculated amount, the power consumption amount, and the approximate error value.
In other words, the activation function conversion programming unit 3000 may measure an approximation error value of a programmable segment obtained by converting a specific segment into a specific approximation function. For example, the activation function conversion programming unit 3000 may measure a first error value of a programmable segment by approximating a specific segment to a programmable segment of a linear function. Further, the activation function conversion programming unit 3000 may measure the second error value of the programmable segment by approximating the specific segment to a programmable segment of a quadratic function. The activation function conversion programming unit 3000 may compare the first error value and the second error value and select an approximation function having a relatively smaller error value as the programmable segment. Through the above-described procedure, the activation function conversion programming unit 3000 can select an activation function for artificial neural network operation and convert the activation function into PAF.
That is, when determining the approximation function of the programmable segment, the format of the programmable parameters may also be determined. For example, if a particular segment is approximated as a programmable segment of a linear function, the corresponding programmable parameters may include gradient and offset values. For example, if a particular segment is approximated by a programmable segment of a quadratic function, the corresponding programmable parameter may include coefficients of the quadratic term. An approximation function for each programmable segment may be selectively determined. That is, the approximation functions of the first programmable segment and the second programmable segment may be the same or different from each other.
The criteria for determining the characteristics of the approximation function for each programmable segment may be determined based on any of the computational effort, power consumption, and approximation error values of PAFE unit 500.
For example, the criteria used to determine the characteristics of the approximation function of the programmable segment may vary depending on the amount of computation, the amount of power consumption, and the relative priority of the approximation error values. The priority may be set in the activation function conversion programming unit 3000. In other words, the activation function conversion programming unit 3000 may search for programmable parameters that implement an approximation function of a programmable segment to achieve specific performance among high-speed operation, low power consumption, suppression of degradation of inference accuracy. However, examples of the present disclosure are not limited to a particular approximation standard.
Main memory 4000 may store data required for computation of an artificial neural network model. Main memory 4000 may include one of memories such as ROM, SRAM, DRAM, resistive RAM, magnetoresistive RAM, phase change RAM, ferroelectric RAM, flash memory, and HBM. The main memory 4000 may be composed of at least one memory cell. Main memory 4000 may be configured as either homogeneous memory units or heterogeneous memory units.
The image sensor 5000 generates image or video data from light entering through a lens. The NPU 1000 may use image or video data as input data for the DNN processed in the NPU 1000.
The decoder 6000 decodes the input data of the encoded bitstream and the decoded input data may be used as an input of DNN.
The bitstream may be a bitstream encoded to perform at least one task.
Tasks that may be included in the bitstream may include object detection, object segmentation, image/video reconstruction, image/video enhancement, object tracking, event recognition, event prediction, anomaly detection, density estimation, event search, measurement, and the like.
The bitstream may include a plurality of encoded operand values capable of handling a plurality of tasks.
The output data of the decoder 6000 may be the calculated values of a specific layer of an image, video, DNN, etc.
Hereinafter, the activation function programming method will be described in detail with reference to fig. 2 to 4.
FIG. 2 illustrates an activation function programming method in accordance with an example of the present disclosure.
Referring to fig. 2, the activation function programming method includes a step S200 of generating segment data for segmenting an activation function, a step S210 of segmenting the activation function into a plurality of segments using the generated segment data, and a step S220 of approximating at least one of the plurality of segments to a programmable segment.
In step S200, fragment data is generated. The fragment data is data generated for segmenting the activation function into a plurality of fragments. The fragment data will be described later.
In step S210, the activation function is segmented into a plurality of segments using the generated segment data. In this disclosure, the term "segment" means a portion of an activation function divided into a plurality of segments, and can be distinguished from "candidate segments" or "programmable segments", which are terms related to approximation of the activation function.
In various examples, step S210 may include a step of determining the number and width of the plurality of clips based on the clip data. In step S210, the number of fragments of the plurality of fragments that segment the activation function to be transformed and the width of each fragment may be determined using the fragment data. At least one of the plurality of segments may have the same or different widths as the other segments.
In the present disclosure, a fragment of a plurality of fragments may be expressed as coordinates of a start point and an end point along an x-axis. Further, it should be understood that when determining the number of the plurality of segments and the width of each segment, the coordinates of the segments in the plurality of segments may be obtained using the number and the width of the plurality of segments.
In step S220, at least one segment among the plurality of segments is approximated as a programmable segment. The programmable segments may be programmed according to the hardware configuration of PAFE unit 500. That is, activation function conversion programming unit 3000 may be configured to program an activation function to be processed in NPU 1000 based on the hardware configuration of PAFE unit 500.
For example, PAFE unit 500 may be configured with hardware configured to calculate each segment with a particular gradient and a particular offset. Activation function conversion programming unit 3000 may be configured to receive configuration information for PAFE unit 500.
In this case, the activation function conversion programming unit 3000 may program a segment of the corresponding activation function having a form of a linear function or higher than a quadratic function with a slope and an offset. For example, the programmable segments may be approximated by a linear function according to certain criteria. In this case, the activation function conversion programming unit 3000 may generate a programmable fragment expressed in the form of "(gradient a) × (input value x) + (offset b)". The specific gradient and the specific offset may be programmable parameters. In the case of determining a programmable segment to be approximated with a linear function, step S220 may include a step of approximating a selected segment having a specific gradient and a specific offset value.
For purposes of detailed description, in some examples, steps 210 and 220 may be performed substantially in one step. This is because the step of segmenting the segments and the step of generating programmable parameters for the corresponding programmable segments can be performed simultaneously. To elaborate, in some examples, steps 210 and 220 may be modified to a step of segmenting the activation function into a plurality of segments using the generated segment data and approximating at least one of the plurality of segments to a programmable segment.
Fig. 3 illustrates a process in which an activation function is approximated by an activation function programming method according to an example of the present disclosure.
The activation function as shown in fig. 3 (a) may be segmented into a plurality of segments s1, s2, s3 and s4 using segment data as shown in fig. 3 (b). The plurality of segments s1, s2, s3, and s4 are approximated as programmable segments a1x+b1, a programmable segment a2x+b2, a programmable segment a3x+b3, and a programmable segment a4x+b4 as shown in (c) of fig. 3. Here, an example in which the activation function conversion programming unit 3000 generates programmable parameters such that all programmable segments correspond to a linear function will be described.
Each programmable segment includes a corresponding programmable parameter. In fig. 3 (c), all of the plurality of segments are approximated as programmable segments in the form of linear functions. However, in various examples, some of the plurality of fragments may be approximated with other types of programmable fragments. For example, the activation function conversion programming unit 3000 may program each programmable segment in the form of a linear function, a quadratic function, a cubic function, a logarithmic function, or the like.
For example, only segment s1, segment s3, and segment s4 are approximated as programmable segments, and segment s2 may be approximated using various methods available in the device that is to process the activation function. Specifically, if a predetermined and stored lookup table, a nonlinear approximation equation, or the like is available in hardware for the section of the segment s2, the predetermined and stored lookup table may be used to approximate the segment s 2.
In other words, the activation function conversion programming unit 3000 may be configured to independently program each of the segments s1, s2, s3, and s 4. At this time, activation function conversion programming unit 3000 receives hardware configuration information of PAFE unit 500. Activation function conversion programming unit 3000 may be configured to independently determine an approximation method for each of segments s1, s2, s3, and s4 based on the hardware configuration information of PAFE unit 500.
For example, PAFE unit 500 may be configured to include circuitry that supports linear function operations. In this case, the activation function conversion programming unit 3000 may program each of the segments s1, s2, s3, and s4 in the form of a linear function.
For example, PAFE unit 500 may be configured to include circuitry that supports both linear and quadratic function operations. In this case, the activation function conversion programming unit 3000 may program each of the segments s1, s2, s3, and s4 in the form of a linear function or a quadratic function.
For example, PAFE unit 500 may be configured to include circuitry that supports linear, quadratic, and logarithmic function operations. In this case, the activation function conversion programming unit 3000 may selectively program each of the segments s1, s2, s3, and s4 in the form of a linear function, a quadratic function, or a logarithmic function.
For example, PAFE unit 500 may be configured to include circuitry that supports linear, quadratic, logarithmic, and exponential function operations. In this case, the activation function conversion programming unit 3000 may selectively program each of the segments s1, s2, s3, and s4 in the form of a linear function, a quadratic function, a logarithmic function, or an exponential function.
For example, if PAFE unit 500 is configured to include circuitry configured to support at least one particular function operation, activation function conversion programming unit 3000 may program each of segments s1, s2, s3, and s4 in the form of a corresponding particular function.
For example, PAFE unit 500 may be configured to include at least one of linear function computation circuitry, quadratic function computation circuitry, cubic function computation circuitry, logarithmic function computation circuitry, exponential function computation circuitry, or similar function computation circuitry designed as hardware.
For example, the activation function conversion programming unit 3000 may program the same activation function in different ways.
For example, the activation function conversion programming unit 3000 may program a specific activation function as a linear function only.
For example, the activation function conversion programming unit 3000 may program a specific activation function as a quadratic function only.
For example, the activation function conversion programming unit 3000 may program a specific activation function as only a cubic function.
For example, the activation function conversion programming unit 3000 may program a specific activation function as only a logarithmic function.
For example, the activation function conversion programming unit 3000 may program a specific activation function as only an exponential function.
For example, the activation function conversion programming unit 3000 may program each of a plurality of segments of a specific activation function as a corresponding approximation function.
For example, activation function conversion programming unit 3000 may program multiple fragments of a particular activation function as a set of approximation functions having different functions.
FIG. 4 illustrates various cases of segmenting an activation function into multiple segments by an activation function programming method according to an example of the present disclosure.
Referring to fig. 4 (a), the PAF may be segmented into four segments having a uniform width.
Referring to fig. 4 (b), the PAF may be segmented into four segments having different widths.
Referring to fig. 4 (c), the PAF may be segmented into four segments having different widths.
Referring to fig. 4 (d), the PAF may be segmented into six segments having different widths.
The number of segments and the width of each segment may be determined using the segment data.
The activation function conversion programming unit 3000 may be configured to segment a plurality of segments having different widths by analyzing nonlinearity of the activation function. However, the present disclosure is not limited thereto.
The activation function conversion programming unit 3000 may be configured to analyze the nonlinearity of the activation function such that each of the plurality of segments is segmented with an optimal width. However, the present disclosure is not limited thereto.
In this disclosure, the activation function may be implemented in various forms including feature sections. When the activation function is segmented into a plurality of segments, the number and width of the plurality of segments may be differently determined according to various shapes of the activation function.
For example, various activation functions such as a swish function, a mix function, a sigmoid function, a hyperbolic tangent (tanh) function, a SELU function, a Gaussian Error Linear Unit (GELU) function, a SOFTPLUS function, a ReLU function, a leak ReLU function, a Maxout function, an ELU function, etc., may have various shapes divided into a plurality of feature segments including (substantially) linear segments and/or non-linear segments. Thus, when approximating the non-linear activation functions to be processed in hardware, the feature segments are considered for segmentation, i.e. if the number and width of segments are determined taking into account the (substantially) linear segments and the non-linear segments, the activation functions can be approximated more efficiently in response to the features of each activation function.
Thus, in a method of approximating an activation function according to the present disclosure, the concept of fragment data is presented to segment the activation function in view of these feature segments of the activation function. The fragment data may include discontinuity information of the activation function, derivative data, information about hardware in which the activation function is processed, and the like, and may include data processed thereof.
Hereinafter, a detailed process of segmenting the activation function into a plurality of segments using discontinuity information among segment data will be described with reference to fig. 5 to 7.
Fig. 5 illustrates an example of segmenting an activation function into linear or nonlinear segments by using slope change data of segment data of an activation function programming method according to an example of the present disclosure.
The gradient change point of the activation function may mean a point at which the gradient of the activation function changes. For example, the activation function conversion programming unit 3000 may be configured to generate slope change data (e.g., differential data) for analyzing gradient change points of the activation function. However, the slope change data of the present disclosure is not limited to differential data and may include similar data.
Slope change data according to examples of the present disclosure may include n-order derivative values of the activation function, such as first order derivative, second order derivative, third order derivative, and the like. Here, the slope change data may indicate a gradient change rate and a gradient change point associated with the activation function.
Slope change data according to examples of the present disclosure may include nth derivative values of an activation function, such as linear derivative values, second derivative values, and third derivative values. Here, the slope change data may indicate a gradient change rate and a gradient change point associated with the activation function.
The process of searching for gradient change points will be described below with reference to fig. 5.
Among the differential data of the activation function f (x) for fig. 5 (a), the first derivative f' (x) is shown in fig. 5 (b). Furthermore, among the differential data of the activation function f (x) for fig. 5 (a), the second derivative f "(x) is shown in fig. 5 (c).
For example, the activation function conversion programming unit 3000 may be configured to extract the start point and the end point of a section whose first derivative value does not change. As shown in fig. 5 (b), the activation function conversion programming unit 3000 generates slope change data corresponding to the first derivative value. Then, the activation function conversion programming unit 3000 recognizes that there is no change in the first derivative value in each of the sections w2 and w3, although the first derivative values are different from each other. Thus, the activation function conversion programming unit 3000 may determine each of the sections w2 and w3 as a linear section. That is, the slope change data corresponding to the first derivative value within the linear section is not changed. However, since the first derivative value is different in each of the sections w2 and w3, slope change data corresponding to the first derivative value at the boundary between the sections w2 and w3 has discontinuity points d1 and d2. That is, since the slope change data corresponding to the first derivative value at the boundary of each of the sections w2 and w3 is a discontinuous point, the boundary of each of the sections w2 and w3 may correspond to a gradient change point.
In this case, the activation function conversion programming unit 3000 may convert the linear section into a programmable parameter in the form of a corresponding linear function. Thus, the linear section of the activation function to be programmed can be segmented into linear functions with a specific slope and a specific offset. The first derivative of the linear section may be a constant value. In other words, even if a linear function is used to approximate a linear section, the approximation error value may be zero. Thus, activation function conversion programming unit 3000 may determine that there is substantially no approximation error in each of sections w2 and w 3. That is, when activation function conversion programming unit 3000 approximates each of sections w2 and w3 with a linear function, the amount of computation and power consumption of PAFE unit 500 is minimized, and the approximation error value may also be zero.
The activation function conversion programming unit 3000 may be configured to determine a section where the first derivative of the activation function is constant or non-zero as a section or curve of a quadratic function or higher (nonlinear function).
In the present disclosure, the term "linear section" related to differential data means a section in which the first derivative of the activation function is an integer or zero, or a section in which the activation function is expressed as a linear function, and the term "nonlinear section" may mean a section in which the first derivative of the activation function is not an integer or zero. However, the determination of the linear section of examples of the present disclosure is not determined by the differential value alone. That is, the activation function conversion programming unit 3000 may be configured to determine or classify the linear section in various ways by receiving the activation function.
The activation function conversion programming unit 3000 may be configured to preferentially determine whether a linear section exists. The activation function conversion programming unit 3000 may be configured to convert a linear section into a programmable parameter in the form of a linear function and convert the remaining nonlinear section into a programmable parameter in the form of a specific function.
In detail, the differential data described in the examples of the present disclosure is just one mathematical calculation method for calculating the slope of the activation function. Thus, the present disclosure is not limited to differential values, and substantially similar slope calculation methods may be utilized.
The search gradient change point is not limited to the above method, and the activation function conversion programming unit 3000 may be configured to determine the corresponding point as the gradient change point when the change of the first derivative of the activation function becomes greater than a specific threshold along the x-axis.
Then, the activation function conversion programming unit 3000 may be configured to extract the start point and the end point of the section where the second derivative value does not change. As shown in fig. 5 (c), the activation function conversion programming unit 3000 generates slope change data corresponding to the second derivative. Then, activation function conversion programming unit 3000 determines a second derivative value in each of sections w1-1 and w1-2, which is different but does not change. However, since the second derivative value is different in each of the sections w1-1 and w1-2, slope change data corresponding to the second derivative at the boundary between the sections w1-1 and w1-2 has a discontinuity point d3. That is, since the slope change data corresponding to the second derivative at the boundary between the sections w1-1 and w1-2 is the discontinuity point d3, the boundary between the sections w1-1 and w1-2 may correspond to the gradient change point.
In this case, the activation function conversion programming unit 3000 may convert the nonlinear section into a corresponding quadratic form of the programmable parameter. Thus, the nonlinear section of the activation function to be programmed can be segmented into a quadratic function comprising coefficients of the quadratic term and coefficients of a linear function comprising a specific slope and a specific offset. The second derivative of the nonlinear section may be a constant value. In other words, even if a quadratic function is used to approximate the nonlinear section, the approximation error value may be zero. Thus, activation function conversion programming unit 3000 may determine that there is substantially no approximation error in each of sections w1-1 and w1-2. That is, when activation function conversion programming unit 3000 approximates each of sections w1-1 and w1-2 with a quadratic function, the computational effort and power consumption of PAFE unit 500 is minimized and the approximation error value may also be zero.
However, examples of the present disclosure are not limited thereto, and the sections w1-1 and w1-2 may be approximated with linear functions. In this case, the approximate error value may be increased, but the power consumption of the NPU 1000 may be reduced by reducing the computational effort of the PAFE unit 500 of the NPU 1000. That is, the activation function conversion programming unit 3000 may differently determine the programmable parameters according to different priorities among the calculated amount, the power consumption amount, and the approximate error value.
The second derivative of the activation function may be indicative of the rate of change of the slope of the activation function. Since the section where the second derivative of the activation function is relatively large is a section where the rate of change of the slope is large, the section of the activation function corresponding to the section has a large slope change, so that there is a significant increase or decrease. In contrast, since the section where the second derivative of the activation function is relatively small is a section where the rate of change of the slope is small, the section of the activation function corresponding to the section has a small slope change, so that there is a small increase or decrease.
In particular, the section where the second derivative of the activation function is less than or equal to the specific threshold is a section where the rate of change of the slope is very small.
Thus, the activation function conversion programming unit 3000 may be configured to determine the activation function of the section as a substantially linear function section in which the slope hardly changes.
For example, the activation function conversion programming unit 3000 may be configured to determine a section in which the second derivative of the activation function is less than or equal to the threshold value as a "substantially linear section". The threshold value of the second derivative of the activation function will be described later.
The differential order (order) at which the differential value of the activation function becomes zero or an integer may represent the degree of change in the slope of the activation function. Specifically, in general, since the gradient of a function changes rapidly as the number of times of the highest order item (degree) of the function increases, a section of the function having a higher number of times of the highest order item is activated is a section having a steep slope change, and can be segmented into a larger number of segments by distinguishing it from other sections.
The order of the highest order term of the activation function in the specific section may be determined by a differential order in which the differential value becomes zero or an integer in the specific section.
For example, in the case of an activation function in which the highest order term is third order in a specific section, since the third derivative of the activation function becomes an integer (i.e., the coefficient of the highest order term) and the fourth derivative of the activation function becomes zero in the specific section, an activation function in which the third derivative is an integer or the fourth derivative is zero in the specific section may be determined as the third order having the highest order term in the specific section.
In various examples, a section of the activation function having a third or higher order of the highest order term may be segmented to have a greater number of segments than other sections. For example, the number of fragments may be determined as the maximum number of segmentable fragments for the corresponding section in the hardware in which the activation function is to be processed.
Slope change data (i.e., first derivative f' (x)) may be used to identify gradient change points of the activation function. Using slope change data (i.e., first derivative f' (x)), the activation function f (x) may be segmented into three segments (w 1, w2, w 3) comprising two linear segments (w 2, w 3).
That is, the activation function conversion programming unit 3000 may determine and segment the linear section w2 and the linear section w3 and the nonlinear section w3 using slope change data of the activation function f (x) to be programmed.
That is, the activation function f (x) may be segmented according to points or segments where the first derivative f' (x) is constant (non-zero), zero, a curve below a threshold (non-linear function), or a curve (non-linear function). In other words, the activation function f (x) may be segmented according to points where the activation function f (x) is not differentiable or points where the first derivative f' (x) is discontinuous.
Although the result of segmentation into three sections is shown in (b) of fig. 5, this is for the sake of brief description of the process of segmentation into linear sections and nonlinear sections, and thus, it should be understood that the activation function f (x) may be segmented into four or more sections, that is, at least four sections using the fragment data.
For example, according to an activation function programming method according to an example of the present disclosure, the linear section w1 may be further segmented into a plurality of sections using segment data. The activation function may be segmented into a greater number of segments and approximated by additional segmentation of the linear section w1 such that approximation errors may be reduced. In this disclosure, the term "approximation error" refers to the difference between a particular segment of the activation function and a programmable segment that approximates that particular segment.
Fig. 6 illustrates an example of segmenting an activation function into a substantially linear section and a nonlinear section using slope change data among segment data in an activation function programming method according to an example of the present disclosure.
The absolute value of the second derivative f "(x) of the derivative data of the activation function f (x) of fig. 6 (a) is shown in fig. 6 (b). The activation function conversion programming unit 3000 may be configured to determine the substantially linear section by setting a specific threshold to the second derivative f "(x). Referring to (b) of fig. 6, when the maximum Max of the absolute value of the second derivative f "(x) of the activation function f (x) is 0.5, the threshold Th may be set to 0.05, which is 10% of the maximum Max. In other words, it may be determined such that the activation function has a linear characteristic as the second derivative f "(x) becomes smaller, and has a nonlinear characteristic as the second derivative f" (x) becomes larger.
That is, the threshold Th may be determined as a relative ratio of the maximum Max of the absolute value of the second derivative f "(x) of the activation function f (x). The threshold Th of the substantially linear section may be determined based on whether an error occurring when approximating the nonlinear section to a linear section is acceptable. For example, the threshold value of the substantially linear section may be determined from the level of error value of each segment that determines the degree of degradation of the inference accuracy of the DNN to which the PAF is applied.
In other words, as the threshold of the substantially linear section increases, the segment of the linear section can be programmed more widely. Meanwhile, as the width of the segments increases, the number of segments may decrease. That is, the total number and width of fragments of the PAF may vary depending on the threshold of the substantially linear section.
The search for the substantially linear section may be performed after the search for the linear section. However, the present disclosure is not limited to the order of linear section search and substantially linear section search.
In the example of fig. 6 (b), the relative ratio may be determined to be 10%. The present disclosure is not limited thereto and may be determined to be 5% of the maximum value Max according to the allowable error of DNN. With differential data, that is to say with the second derivative f "(x), the activation function f (x) can be segmented into a section w1 and a section w3 in which the second derivative f" (x) is smaller than the threshold Th of the substantially linear section, and a section w2 in which the second derivative f "(x) is greater than or equal to the threshold Th of the substantially linear section. In the activation function f (x), slope change data may be used to determine and segment the substantially linear segments w1 and w3 and the nonlinear segment w2. When determining the first to third sections w1, w2 and w3, the first to third segments s1, s2 and s3 may be programmed into programmable segments using corresponding programmable parameters.
In fig. 6 (b), the result of segmentation into three segments s1, s2 and s3 corresponding to the three segments w1, w2 and w3 is shown for the sake of briefly explaining the segmentation into a substantially linear segment and a nonlinear segment. That is, it should be understood that the activation function f (x) may be segmented into four or more sections, i.e., at least four segments, using the segment data.
For example, according to an activation function programming method according to an example of the present disclosure, the nonlinear section w2 may be further segmented into a plurality of sections using segment data. The approximation error may be reduced by additional segmentation of the nonlinear section w 2.
Fig. 7 illustrates another example of segmenting an activation function into a substantially linear section and a nonlinear section using slope change data among segment data in an activation function programming method according to an example of the present disclosure.
Referring to fig. 7, in the activation function f (x), the nonlinear section may be determined based on a threshold Th of a substantially linear section of the fragment data, that is, based on an absolute value of a second derivative value f "(x). That is, a section equal to or greater than the threshold Th of the substantially linear section may be determined as a nonlinear section. Specifically, referring to (b) of fig. 7, the activation function conversion programming unit 3000 may segment the activation function f (x) into a substantially linear section and a nonlinear section using differential data, that is, using the second derivative f "(x)). Further, as an example, the activation function conversion programming unit 3000 may segment the nonlinear section of the activation function f (x) into segments s2 and s3 corresponding to the two sections w2 and w 3.
That is, the activation function conversion programming unit 3000 may classify the basic linear sections w1 and w4 and the nonlinear sections w2 and w3 using the slope change data of the activation function f (x), and then segment the nonlinear sections w2 and w3.
The activation function conversion programming unit 3000 may be configured to search for the best programmable parameter corresponding to each segment in various ways. For example, the activation function conversion programming unit 3000 may search for an optimal programmable parameter capable of achieving a specific performance between high-speed operation, low power consumption, and suppression of degradation of inference accuracy.
In fig. 7 (b), the segment s1, the segment s2, the segment s3, and the segment s4 segmented into four segments w1, w2, w3, and w4 are shown, however, this is for the sake of brief description of the segmentation into the substantially linear segment and the nonlinear segment. Thus, it should be appreciated that the activation function f (x) may be segmented into five or more sections, that is, at least five segments, using the segment data.
For example, according to an activation function programming method according to an example of the present disclosure, the nonlinear sections w2 and w3 may be further segmented into a plurality of sections using segment data. Specifically, the nonlinear sections w2 and w3 may be segmented based on a maximum Max of the second derivative f "(x). That is, the region from the threshold Th of the substantially linear section to the maximum Max of the second derivative f "(x) is segmented into the section w2. Further, the threshold Th of the substantially linear section from the maximum value Max of the second derivative value f "(x) is segmented into the section w3.
The approximation error can be further reduced when performing additional segmentation in the non-linear sections w2 and w 3.
Fig. 8 illustrates another example of segmenting an activation function into nonlinear sections by using slope change data among segment data in an activation function programming method according to an example of the present disclosure.
Referring to fig. 8, in the activation function f (x), the nonlinear section may be determined based on a threshold Th of a substantially linear section of the fragment data, that is, based on an absolute value of a second derivative value f "(x). That is, a region equal to or greater than the threshold Th of the substantially linear section may be determined as the nonlinear section. Specifically, referring to (b) of fig. 8, the activation function conversion programming unit 3000 may segment the activation function f (x) into a substantially linear section and a nonlinear section using differential data, that is, using the second derivative f "(x). Furthermore, the activation function conversion programming unit 3000 may segment the nonlinear section of the activation function f (x) into segments s2, s3, and s4 corresponding to the three sections w2, w3, and w4, for example.
The activation function conversion programming unit 3000 may classify the basic linear sections w1 and w5 and the nonlinear sections w2, w3, and w4, and then segment the nonlinear sections w2, w3, and w4 using slope change data of the activation function f (x).
However, examples of the present disclosure are not limited to substantially linear sections, and substantially linear sections may also be segmented into non-linear sections. That is, the step of determining the substantially linear section may not be performed in some cases.
The activation function conversion programming unit 3000 may be configured to search for the best programmable parameter corresponding to each segment in various ways. For example, the activation function conversion programming unit 3000 may search for an optimal programmable parameter capable of achieving a specific performance between high-speed operation, low power consumption, and suppression of degradation of inference accuracy.
In fig. 8 (b), segments s1, s2, s3, s4, and s5 segmented into five segments w1, w2, w3, w4, and w5 are shown, however, this is for the sake of brief description of the process of segmentation into substantially linear segments and nonlinear segments. Thus, it should be appreciated that the activation function f (x) may be segmented into six or more segments, that is, at least six segments, using the segment data. However, examples of the present disclosure are not limited to substantially linear sections, and substantially linear sections may also be segmented into non-linear sections.
For example, according to an activation function programming method according to an example of the present disclosure, the nonlinear sections w2, w3, and w4 may be further segmented into a plurality of sections using segment data.
Specifically, the nonlinear sections w2, w3, and w4 are segmented based on the integral value (+_f "(x) dx) of the second derivative f" (x). In other words, the activation function conversion programming unit 3000 may segment the nonlinear section based on the integrated value of the slope change data.
The approximate error value between PAF and the activation function may increase when the integrated value of the second derivative f "(x) (≡f" (x) dx) is higher. That is, when the value (+_f "(x) dx) of the integral of the second derivative value f" (x) is high, an error may occur, resulting in deterioration of the inference accuracy. On the other hand, as the value (+_f "(x) dx) of the integral of the second derivative f" (x) increases, the width of the segment can be widened. Conversely, the smaller the value of the integral of the second derivative f "(x) (. DELTA.f" (x) dx), the narrower the width of the segment can be.
Thus, the activation function conversion programming unit 3000 may set the integral value (≡f "(x) dx) of the specific second derivative f" (x) as the integral threshold of the fragment approximation error. For example, the activation function conversion programming unit 3000 may integrate the second derivative f "(x) from the end of the section w 1. Thus, the segment w2 may be from the end of the segment w1 until the preset integral threshold of the segment approximation error reaches a particular value.
More specifically, in section w2, the integral of the second derivative f "(x)Can be used forSegmented as s2 to correspond to the integral threshold of the segment approximation error. Furthermore, in section w3, the integral of the second derivative f "(x)>May be segmented as s3 to correspond to the integral threshold of the segment approximation error. Furthermore, in section w4, the integral of the second derivative f "(x)>May be segmented as s4 to correspond to the integral threshold of the segment approximation error.
That is, the integral value of the second derivative f "(x) in the section w2Integral value of the second derivative f "(x) in section w3 +.>And the integral value of the second derivative f "(x) in section w4 +.>May be the same value as the integral threshold of the segment approximation error.
However, the integral threshold of the segment approximation error may be affected by hardware data including at least one of the number of comparators of the PAFE unit 500 of the NPU1000, the number of gates of the circuitry used to implement the PAFE unit 500, and the type of arithmetic circuitry implemented (linear function circuitry, quadratic function circuitry, cubic function circuitry, exponential circuitry, logarithmic circuitry, anti-logarithmic circuitry, etc.). That is, the activation function conversion programming unit 3000 may be configured to determine the integral threshold of the fragment approximation error in consideration of the hardware data.
That is, the smaller the integral threshold of the segment approximation error, the closer the PAF can be to the activation function. In other words, as the integral threshold of the segment approximation error decreases, the number of programmable segments increases, so the approximation error value of the PAF can be further reduced.
However, since the number of programmable segments is limited by hardware data, there is a limit to reducing the integral threshold of segment approximation errors. That is, the lowest limit of the integral threshold of the segment approximation error may be determined from the hardware data.
When the additional segmentation is performed in the nonlinear sections w2, w3, and w4 described above, the approximation error can be further reduced. However, examples of the present disclosure are not limited to substantially linear sections, and substantially linear sections may also be segmented into non-linear sections. That is, the step of determining the substantially linear section may not be performed in some cases.
As shown in fig. 5 to 8, by segmenting the activation function using the slope change data, the activation function conversion programming unit 3000 may determine a linear section from the activation function before approximating the activation function. When the activation function conversion programming unit 3000 segments the activation function using the slope change data, it may determine a nonlinear section from the activation function before approximating the activation function. When the activation function conversion programming unit 3000 segments the activation function using the slope change data, it may determine a substantially linear section from the activation function before approximating the activation function.
Fragments having distinct linear sections or substantially linear sections may be approximated as programmable fragments expressed in terms of "(slope a) × (input value x) + (offset b)".
At this time, the segment having the linear section or the substantially linear section is in the form of a linear function or a substantially linear function having a substantially constant slope. Thus, comparing the activation function to a programmable segment expressed as a slope and offset, the programmed segment has no approximation error or can be minimized.
Thus, if the activation function is programmed with slope change data, the amount of calculation and power consumption of the linear section or the substantially linear section can be greatly reduced.
Thus, the activation function programmed with linear or substantially linear segments according to examples of the present disclosure is efficient and approximation errors are minimized, thus may provide improvements in the operating speed of DNNs processed in the NPU 1000, minimization of degradation of inference accuracy, and reduction of power consumption of the NPU 1000.
In various examples, step S210 may further include the steps of: the linear section of the activation function is determined based on slope change data of the activation function.
In various examples, step S210 may further include the steps of: the non-linear section of the activation function is determined based on slope change data of the activation function.
In various examples, step S210 may further include the steps of: a substantially linear section of the activation function is determined based on slope change data of the activation function.
In various examples, step S210 may further include the steps of: the linear and nonlinear sections of the activation function are determined based on slope change data of the activation function.
In various examples, step S210 may further include the steps of: the substantially linear and nonlinear sections of the activation function are determined based on slope change data of the activation function.
In various examples, step S210 may further include the steps of: the linear, substantially linear, and nonlinear sections of the activation function are determined based on differential data of the activation function.
However, examples of the present disclosure are not limited to differential data of the activation function, and various mathematical analyses capable of analyzing slope changes and linearity of the activation function may also be performed.
In various examples, the fragment data may include information of hardware on which the activation function is processed. In an activation function programming method according to an example of the present disclosure, hardware information may be used to segment an activation function. The hardware data may include at least one of a number of comparators of the PAFE unit 500 of the NPU 1000, a number of gates of a circuit for implementing the PAFE unit 500, and a type of an arithmetic circuit implemented (linear function circuit, quadratic function circuit, cubic function circuit, exponential circuit, logarithmic circuit, and anti-logarithmic circuit).
For example, the number of fragments used to segment the activation function may be limited based on the number of comparators of the PAFE unit 500 of the NPU 1000. Thus, the activation function may be segmented into the maximum number of segments that may be processed by the NPU 1000 to be processed or the number of segments corresponding to the allocated resources of the NPU 1000. Thus, the activation function conversion programming unit 3000 may program the activation function using predetermined hardware resources more efficiently or in a more customized manner.
In various examples, step 220 may further include the steps of: at least one of the plurality of segments is approximated as a programmable segment based on the gradient change point.
In various examples, step 220 may further include the steps of: at least one of the plurality of segments is approximated as a programmable segment based on the error value.
In this disclosure, the term "error value" or "approximate error value" refers to the difference between a particular segment of the activation function and a programmable segment approximated by the particular segment. The approximate error values may also include average, minimum, maximum, and accumulated values. In other words, the activation function conversion programming unit 3000 may be configured to calculate an average error value, a minimum error value, a maximum error value, an accumulated error value, etc. between a specific segment and an approximated programmable segment. The accumulated error value may be a value obtained by integrating the error value between the particular segment and the approximated programmable segment.
Regarding the error value, various activation functions may be divided into a plurality of characteristic sections including (substantially) linear sections and/or nonlinear sections, and if the characteristic sections are segmented into segments of the same width, the error value of each segment is significantly different. Thus, in an activation function programming method according to an example of the present disclosure, to reduce approximation errors, at least one feature of the feature segments may be considered and approximated as a programmable segment.
In various examples, step S220 may further include the steps of: the error value is calculated by comparing the gradient and offset of the programmable segment with the corresponding segment of the activation function.
In various examples, step S220 may further include the steps of: a programmable parameter for converting at least one segment of the activation function into a programmable segment is determined. In other words, step S220 may further include the steps of: searching for the best programmable parameters for converting at least one segment of the activation function into a programmable segment. Here, when the programmable segment is a linear function, the programmable parameter may include a gradient and an offset corresponding to the linear function. Here, when the programmable segment is a quadratic function, the programmable parameter may include coefficients of quadratic terms corresponding to the quadratic function. The coefficients of the quadratic function may include quadratic coefficients, linear coefficients and constants. The approximate function of the programmable parameter may be determined in consideration of performance such as high-speed operation, low power consumption, and suppression of degradation of inference accuracy. For example, as the formulation of the approximation function becomes more complex, the calculation speed may decrease and the power consumption may increase. As the approximation error decreases, the degradation of the inference accuracy may be reduced.
In various examples, step S220 may further include the steps of: an error value between at least one segment of the activation function and at least one candidate segment having a (temporary) gradient and a (temporary) offset is calculated. As the number of candidate segments increases, the likelihood of searching for the best programmable parameter value increases, but the search time may increase.
In various examples, step S220 may include the steps of: parameters of at least one candidate segment are determined as programmable parameters of a programmable segment based on the calculated error value.
Thus, the activation function conversion programming unit 3000 may provide programming activation function data to the NPU 1000. Here, the program activation function data may include at least one program activation function. Here, the program-activated function data may include a programmable parameter corresponding to each programmable segment of the at least one program-activated function.
Hereinafter, a process of approximating at least one segment among the plurality of segments to a programmable segment based on the error value will be described in detail with reference to fig. 9 to 11.
In programming the activation function, a step may occur at the boundary between the programmable segments. In an activation function programming method according to an example of the present disclosure, approximation errors may be greatly reduced by generating predetermined steps between programmable segments or at the start and/or end of one programmable segment.
Thus, in the present disclosure, the error value may be significantly reduced by allowing a step between programmable segments in segmenting the activation function into a plurality of segments using segment data and approximating at least one segment among the plurality of segments to a programmable segment based on the error value.
Referring to FIG. 9, a plurality of candidate segments S for segment S of a nonlinear activation function are shownc1、Sc2And S isc3
In examples of the present disclosure, the term "candidate segment" refers to a function that can be changed to a programmable segment expressed by a "programmable parameter" using an activation function programming method.
For example, when a programmable segment is expressed as a linear function, the programmable segment may be expressed as "(gradient a) × (input value x) + (offset b)". Here, the programmable parameters include a gradient a and an offset b.
For example, when the programmable segment is expressed as a quadratic function, the programmable segment may be expressed as "(quadratic coefficient a) × (input value x 2) + (linear coefficient b) × (input value x) + (constant c)". Here, the programmable parameters include a quadratic coefficient a, a linear coefficient b, and a constant c.
Thus, the programmable parameters may be configured to have a form capable of expressing both first and second order functions. However, the present disclosure is not limited to the format of the programmable parameters.
Hereinafter, a linear function will be described as an example. The candidate segments may be in the form of a linear function corresponding to a programmable segment segmented using segment data. Candidate segments for a segment may be determined by a linear function that passes through the start and end points of a segment.
For example, a candidate segment of a segment may be a linear function having an adjusted offset while having the same gradient as a linear function passing through the start and end points of the segment.
For example, a candidate segment of a segment may be a linear function having an adjusted offset while having a different gradient than a linear function passing through the start and end points of one segment.
For example, a candidate segment for a segment may be determined to be one of the tangents to the segment.
In fig. 9, in order to briefly describe a process of determining a programmable segment among a plurality of candidate segments, three candidate segments having a common gradient through the start and end points of segment s are shown. First candidate segment Sc1Is a linear function of the start and end of the segment S, a second candidate segment Sc2And a third candidate segment Sc3Is adjusted by the offset and simultaneously with the first candidate segment Sc1A linear function with a common slope, and a third candidate segment S c3Having an offset such that the candidate segment Sc3Tangential to segment s. The candidate segments shown in fig. 9 are used to briefly describe segments that may be approximate programmable segments, and the gradients and/or offsets of the actual candidate segments may be adjusted in various ways to reduce the error value.
In various examples, at least one segment among the plurality of segments may be approximated as a programmable segment by searching for the error value Δy. At this time, the activation function conversion programming unit 3000 may determine the width of each of the plurality of segments as a uniform width. Subsequently, the activation function conversion programming unit 3000 may approximate at least one segment among the plurality of segments to a programmable segment by searching for an error value Δy of the at least one segment. However, the present disclosure is not limited thereto.
Fig. 10 illustrates an example of approximating a segment to a programmable segment by searching for a maximum error value max (Δy) that is the maximum value among error values Δy in an activation function programming method according to an example of the present disclosure.
Fig. 10 (a) shows segments S1 and S2 segmenting the activation function f (x), a first candidate segment S corresponding to the first segment S1c1(X) and a second sheetSecond candidate segment S corresponding to segment S2 c2(X). In FIG. 10 (a), candidate fragment Sc1(X) and Sc2Each of (X) searches for the best programmable parameters (i.e., gradients and offsets) representing each linear function passing through the start and end points of each of the segments s1 and s 2.
As in the example shown in fig. 10 (a), the activation function conversion programming unit 3000 calculates the second segment S2 and the second candidate segment Sc2Error value Δy between (X), i.e. "f (X) -Sc2Absolute value of (X) (|f (X) -S)c2(X) |. The activation function conversion programming unit 3000 may calculate a maximum error value max (Δy), which is the maximum value among the error values Δy. In order to reduce the maximum error value max (Δy) of the second segment S2, as shown in (b) of fig. 10, the candidate segment S is obtained by dividing the candidate segment S in the y-axis directionc2(X) the second candidate segment obtained by adjusting (i.e., adjusting the offset) max (Δy)/2 which is half of the maximum error value max (Δy) may be determined as the second programmable segment S obtained by approximating the second segment S2p2(X)。
When a first programmable segment S obtained by approximating the first segment S1 is shown as (b) of fig. 10p1(X) at the time of first programmable segment Sp1(X) and a second programmable segment Sp2A step may occur between (X).
In FIG. 10 (b), such a step at the intersection of adjacent programmable segments on the y-axis may be based on the error values |f (x) -S c2(X) | intentionally elicits the second segment s2 of the activation function f (X) during approximation to a programmable segment. That is, in approximating a particular programmable segment to reduce the maximum error value within the particular programmable segment, a step may be generated at a point of the boundary between adjacent programmable segments.
In other words, each programmable segment can be approximated independently of the other.
In other words, as the approximation error value of the PAF increases, the degradation of the inference accuracy of the NPU 1000 using the PAF may increase. Conversely, as the approximation error value of the PAF decreases, the degradation of the inference accuracy of the NPU 1000 using the PAF may decrease.
In various examples, at least one segment among the plurality of segments may be approximated as a programmable segment using an integral value of the error value ≡sc (x) -f (x) ] dx. The activation function conversion programming unit 3000 may be configured to integrate or accumulate the approximate error value of each segment.
In more detail, a first programmable segment Sp1(X) and a second programmable segment Sp2(X) may be programmed in different ways. That is, each programmable segment may be programmed by selecting a method such as a linear function, a quadratic function, a logarithmic function, an exponential function, or the like, respectively. Thus, each programmable segment may be programmed with the same function or may be programmed with a different function.
Fig. 11 shows an example of approximating one segment to a programmable segment using an integrated value ≡sc (x) -f (x) ] dx with respect to an error value in an activation function programming method according to an example of the present disclosure.
Fig. 11 (a) shows segments S1 and S2 segmenting the activation function f (x), a first candidate segment S corresponding to the first segment S1c1(X) and a second candidate segment S corresponding to the second segment S2c2(X). In FIG. 11 (a), for candidate segment Sc1(X) and Sc2(X) searching for the best programmable parameters (i.e., gradients and offsets) expressing a linear function for the start and end points of each of the segments s1 and s 2. In practice, the second candidate segment Sc2The offset of (X) can be adjusted while having the same gradient as a linear function through the start and end points of the second segment s 2. Alternatively, the offset may be adjusted while having a gradient different from that of the linear function passing through the start and end points of the second segment s 2.
Referring to fig. 10 and 11, the first segment s1 includes a start point x0 and an end point x1. Here, the start point x0 and the end point x1 may refer to segment boundary values.
Referring to fig. 10 and 11, the second segment s2 includes a start point x1 and an end point x2. Here, the start point x0 and the end point x1 may refer to segment boundary values.
For example, the first segment s1 may be set from the start point x0 to less than the end point x1.
For example, the second segment s2 may be set from the start point x1 to less than the end point x2.
The programmable parameters may be configured to include segment boundary values.
As shown in fig. 11 (a), the activation function conversion programming unit 3000 calculates a second segment S2 and a candidate segment Sc1Integral value between (X)As an approximate error value, and searching for the integral value +.> Of which there is an integral value->Is the candidate segment of the smallest absolute value of (a). As shown in fig. 11 (b), in order to reduce the error value, it is possible to have an integrated value +.>Candidate segment of minimum absolute value of (i.e. minimum)Determined as a second programmable segment Sp2(X)。
When the first programmable segment S approximating the first segment S1 is shown in fig. 11 (b)p1(X) at the time of the first programmable segment S, a predetermined step may occurp1(X) and a second programmable segment Sp2On the y-axis between (X). In fig. 11 (b), such a step may occur in approximating the second segment S2 of the activation function f (x) to a second programmable segment S based on the approximation error valuep2In the process of (X). Even if a step exists, however, if the approximate error value of each programmable segment is minimized,the degradation of the inference accuracy of the NPU 1000 using the PAF can be reduced.
In various examples, step S220 may further include the steps of: the minimum approximate error value between the programmable segment and the corresponding segment of the activation function is searched. The approximate error value may be at least one of an average error value, a minimum error value, a maximum error value, and an accumulated error value.
For example, step S220 may further include the steps of: at least one minimum error value between at least one programmable segment and a corresponding segment of at least one activation function is searched.
For example, step S220 may further include the steps of: the slope and offset of the programmable segment are determined based on the at least one minimum error value searched.
For example, step S220 may include the steps of: at least one segment is approximated as a programmable segment based on the determined gradient and offset.
In various examples, step S220 may further include the steps of: a programmable segment is determined using machine learning using an loss function.
FIG. 12 illustrates an example of approximating a segment to an optimally programmable segment using machine learning in an activation function programming method in accordance with examples of the present disclosure.
Referring to fig. 12, the activation function conversion programming unit 3000 may convert the candidate segment S of the activation function f (x) c(x) Set as the initial value of the loss function. The activation function conversion programming unit 3000 may determine a candidate segment having the minimum value of the loss function as the best programmable segment S through machine learningop(x) A. The invention relates to a method for producing a fibre-reinforced plastic composite Thus, optimized programmable parameters can be explored.
For optimized parameter searching, learning may be repeatedly performed. One learning may represent one round (epoch). As the number of learning increases, the error value may decrease. If the training times are too small, a lack of fit may result. Too many training times can result in an overfitting.
As the loss function, a Mean Square Error (MSE), a Root Mean Square Error (RMSE), or the like may be used, but is not limited thereto. In the present disclosure, the candidate segments used as initial values of the loss function may be, for example, linear functions, quadratic functions, cubic functions, or the like approximated to segments corresponding to segments using segment data. However, examples according to the present disclosure are not limited to the above functions. That is, the loss function may be used after the activation function f (x) is segmented into a plurality of segments using the segment data.
Thus, machine learning using the penalty function may be performed after taking into account characteristics, approximation errors, etc. of the activation function thereof, such as a plurality of characteristic sections including (substantially) linear sections and/or non-linear sections of the activation function. Accordingly, the calculation amount and search time of the optimized programmable parameter search can be reduced, and degradation of the inference accuracy of the NPU 1000 due to the use of the PAF can be minimized.
Further, according to the examples of the present disclosure, an effect of reducing the number of unnecessary fragments can be provided. That is, according to examples of the present disclosure, the number of fragments may also be minimized. In other words, if the sum of the approximate error values of two adjacent programmable segments is less than the preset threshold, the two programmable segments may be integrated into one programmable segment.
In various examples, step S210 may further include the steps of: the activation function is segmented into a plurality of segments using an integral (accumulated value) of the second derivative of the activation function. Here, the accumulated value of the second derivative may be used as the fragment data.
For example, step S210 may further include the steps of: an accumulated value of the second derivative of the activation function is calculated.
For example, step S210 may further include the steps of: the activation function is segmented into a plurality of segments based on an integral threshold of the segment approximation error (i.e., a threshold of the accumulated second derivative).
Furthermore, the activation function programming method according to the present disclosure may include the steps of: first, when the number of pieces determined by segmenting the activation function into pieces using the accumulated value of the second derivative is greater than or less than the target number, a threshold value of the accumulated value of the second derivative is adjusted, and the activation function is re-segmented into another number of pieces based on the adjusted threshold value. Specifically, the threshold may be adjusted such that: (1) The threshold is adjusted to increase when the determined number of the plurality of segments is greater than the target number, and (2) the threshold is adjusted to decrease when the determined number of the plurality of segments is less than the target number.
In various examples, the activation function conversion programming unit 3000 may segment the activation function into a plurality of segments based on a threshold value of the accumulated value of the second derivative. In this case, the activation function conversion programming unit 3000 may segment all sections of the activation function based on the threshold value of the accumulated value of the second derivative, or segment a part of the sections of the activation function based on the threshold value of the accumulated value of the second derivative. Specifically, the activation function conversion programming unit 3000 may determine some sections of the activation function as non-linear sections instead of (substantially) linear sections, and may segment only a partial section that is the non-linear section based on a threshold value of the accumulated value of the second derivative. The activation function conversion programming unit 3000 may segment the remaining sections that are not nonlinear sections by the activation function programming method described in various examples of the present disclosure.
Fig. 13 illustrates an example of segmenting an activation function using an integral threshold of a segment approximation error of the activation function in an activation function programming method according to an example of the present disclosure.
With reference to figure 13 of the drawings, the activation function f (x) may be segmented using an accumulated value of the second derivative of the activation function f (x), that is, ≡f "(x) dx. The point of the x-axis minimum (min) of the activation function f (x) may be determined as the start point, or the point of the x-axis maximum (max) may be determined as the start point. However, the present disclosure is not limited thereto, and the starting point may also be a specific point.
For example, the PAF may be programmed to include a plurality of segment boundary values x1, x2, x3, x4, and x5.
The PAF may be programmed to also include, for example, a minimum value (min) and a maximum value (max). The minimum value (min) and the maximum value (max) may be utilized in implementing clipping (clipping) for improving the programming efficiency of an activation function according to examples of the present disclosure. A value less than or equal to the minimum value may be output as the minimum value. A value greater than or equal to the maximum value may be output as the maximum value.
Starting from the starting point, the accumulated value for the second derivative of the activation function f (x) reaches a threshold EThEach segment (i.e., the integral threshold of the segment approximation error) segments the activation function f (x).
For example, activation function conversion programming unit 3000 may determine whenW1 whenW2 at that time, when->W3 at that time, when->W4 at that time, when->W5 at that time and when +.>W6 at that time. In detail, it is also possible to set a different E for each segmentThValues. That is, a plurality of E's may be set according to circumstancesThValues, e.g. ETh1And E isTh2Values.
Furthermore, the programmable activation function used in the artificial neural network operation may be configured to process only a limited range of input values. For example, the minimum value (min) of the x-axis, which is the input value of the programmable activation function, may be negative six, and the maximum value (max) may be six. According to the above configuration, there is an effect that the data size of the program-activated function can be reduced. However, the present disclosure is not limited thereto.
Referring to fig. 13, since the accumulated value of the second derivative of the activation function is the rate of change of the slope of the activation function, it is determined such that: (1) In the activation function f (x), the widths w2, w3, and w4 of the segments corresponding to the sections having a relatively large gradient change rate are determined to be relatively narrow, and (2) in the activation function f (x), the widths w1 and w6 of the segments including the portion as a linear function of the change rate without a slope are determined to be relatively wide.
Fig. 14 and 15 show the ELU activation function and the hardswick activation function, respectively.
When x is>At 0, ELU activation function f (x) is x; when x is less than or equal to 0, ELU activation function f (x) is alpha (ex-1) (where α is a hyper-parameter).
As shown in fig. 14, the ELU activation function has a linear section when the value of x is zero or greater, and a nonlinear section when the value of x is less than zero. That is, the ELU activation function has a characteristic of being divided into a linear section and a nonlinear section.
The Hardswick activation function f (x) is 0 at x.ltoreq.3, x at x.gtoreq. +3, x (x+3)/6 at-3<x < +3.
As shown in fig. 14, the Hardswish activation function has a linear section when the value of x is less than minus three or greater than three, and a nonlinear section otherwise. That is, the Hardswick activation function has a characteristic of being divided into a linear section and a nonlinear section.
However, the present disclosure is not limited to the ELU activation function and the Hardswish activation function, and there are various activation functions having characteristics divided into a linear section and a nonlinear section.
In particular, in the field of artificial neural networks, various custom activation functions have been proposed that combine various linear and nonlinear functions to improve the accuracy of the artificial neural network. In this case, the activation function programming method according to the example of the present disclosure may be more effective.
In the activation function programming method according to the present disclosure, the activation function conversion programming unit 3000 may distinguish between a linear section and a nonlinear section of an activation function, and further may distinguish between a basic linear section and a nonlinear section, so that the activation function may be selectively segmented into a plurality of segments. Thus, the activation function programming method according to the present disclosure is efficient and minimizes approximation errors, particularly in programming for approximating an activation function having a (substantially) linear section and a nonlinear section, and thus can provide an improvement in the operation speed of an artificial neural network model processed in the NPU 1000, a minimization of degradation of inference accuracy, and a reduction in the power consumption of the NPU 1000. In the activation function programming method according to the present disclosure, the activation function conversion programming unit 3000 may generate programmable parameters of at least one segment. The NPU 1000 may process at least one programming activation function based on the information described above. The NPU 1000 may receive information and process at least one programming activation function.
Coordinates of start and end points of segments of the plurality of segments may be defined as segment boundary values. That is, each segment may be displayed as a segment boundary value. That is, the programmable parameters may include segment boundary values according to an activation function programming method according to the present disclosure. In various examples, an activation function programming method according to the present disclosure may further include the steps of: at least one segment among the plurality of segments is approximated using a predetermined look-up table, a non-linear approximation equation, or the like.
In the activation function programming method according to the present disclosure, a plurality of fragments are segmented using fragment data, and since the segmented plurality of fragments can be selectively approximated with programmable fragments, there may be a section determined to be approximated without PAF. If a stored look-up table, non-linear approximation, etc. for this section is available in hardware in a predetermined manner, this section may be approximated using a predetermined and stored look-up table, non-linear approximation, etc.
In various examples, an activation function programming method according to the present disclosure may further include the steps of: it is determined that at least one of the plurality of segments is not to be approximated as a programmable segment. For example, a segment having a very complex shape or a segment having low importance in DNN may be determined not to be approximated as a programmable segment. The fragments may be processed in another predetermined manner, or if the number of fragments is large, they may be combined and processed in another predetermined manner.
In various examples, an activation function programming method according to the present disclosure may process the programming method of each segment in a separate manner.
An activation function programming method according to an example of the present disclosure may include the steps of: an activation function for the artificial neural network operation is selected and converted into a programmable activation function. Referring to fig. 13, as an example, the program activation function may include a plurality of segments having a specific width, and the specific width may be determined based on a specific threshold, that is, for each segment in which the accumulated value of the second derivative of the selected activation function reaches the threshold.
An apparatus including a programmable activation function generator according to another example of the present disclosure may be provided. The activation function conversion program may be configured to generate fragment data for segmenting the activation function, segment the activation function into a plurality of fragments using the generated fragment data, and convert at least one fragment among the plurality of fragments into a programmable fragment.
At least one of the plurality of segments may have a different width than the other segments.
The activation function conversion program may be configured to determine a number and a width of the plurality of fragments based on the fragment data, and segment the activation function into the plurality of fragments based on the determined number and width.
The segment data may include slope change data (e.g., derivative data) of the activation function.
The fragment data may include information of hardware capable of handling the activation function. The activation function conversion program may be configured to receive hardware information.
The activation function conversion program may be configured to determine a substantially linear section and a non-linear section of the activation function based on slope change data of the activation function, and segment the activation function into a plurality of segments according to the determined substantially linear section and non-linear section.
The activation function conversion program searches for programmable parameters for approximating at least one segment to a programmable segment. The activation function conversion program may be configured to approximate at least one segment to a programmable segment according to the searched optimal programmable parameters.
The apparatus may further include a PAFE unit, and the PAFE unit may be configured to approximate the at least one segment using a predetermined nonlinear approximation equation.
Hereinafter, an NPU configured to process an activation function programmed by an activation function programming method according to an example of the present disclosure will be described in detail.
For convenience of description, an NPU of an apparatus for performing an activation function programming method according to an example of the present disclosure will be described with reference to fig. 1.
Fig. 16 illustrates a PAFE cell configured to handle a programming activation function in accordance with an example of the present disclosure.
PAFE unit 500 according to one example of the present disclosure is an example of circuitry configured to program an activation function as a linear function. The activation function programming method may be performed by one of the various programming examples of the present disclosure described above. Hereinafter, PAFE unit 500 may be referred to as PAFE unit 500. The activation function conversion programming unit 3000 may be configured to determine the type of programmable parameter based on the provided hardware information. For example, when PAFE unit 500 includes only linear function computation circuitry, activation function conversion programming unit 3000 may operate such that all programmable segments become linear functions. For example, when PAFE unit 500 includes a linear function calculation circuit and a quadratic function calculation circuit, activation function conversion programming unit 3000 may operate such that all programmable segments become linear functions or quadratic functions.
Memory 300 may include a segment register 310, a first register 320, and a second register 330. For example, the at least one register may be implemented by setting an address or register map of the at least one memory. For example, the at least one register may be implemented by allocating a dedicated memory or at least one dedicated register. That is, memory 300 of PAFE unit 500 may be configured to store programming activation function data.
The fragment register 310 stores information about sections of a plurality of fragments.
Specifically, coordinates of start points and end points of x-axes of sections of a plurality of fragments determined by one of the methods proposed by the activation function conversion programming unit 3000 may be stored in the fragment register 310. Coordinates of start and end points of segments of the plurality of segments may be defined as segment boundary values (SB). That is, the sections of the plurality of segments may be determined by segment boundary values SB0 through SB (N-2).
For example, to define a segment of N segments, N-1 segment boundary values SB0 through SB (N-2) may be required.
For example, a section from minus infinity-infinity to the first segment boundary value SB0 may be defined based on the coordinates of the x-axis using the first segment boundary value SB 0. Furthermore, a section from the last segment boundary value SB (N-2) to plus infinity can be defined based on x-axis coordinates using the last segment boundary value SB (N-2). However, it is not limited thereto, and may also be appropriately limited by setting the maximum value and the minimum value of the infinite range.
Then, the sections of the N-1 segments existing between the first segment boundary value SB0 and the last segment boundary value SB (N-2) can be defined by using the segment boundary values (SB 1, SB2, and … …) between the first segment boundary value SB0 and the last segment boundary value SB (N-2). To this end, segment register 310 provides a plurality of segment boundary values SB0 through SB (N-2) to PAFE unit 500. Thus, PAFE unit 500 may obtain information regarding segments of multiple segments.
PAFE unit 500 may be configured to receive data from segment registers 310.
That is, a segment of a program activation function may be set in PAFE unit 500.
In the case of a first order polynomial, the first register 320 may be configured to store gradients A0 through A (N-1) of the plurality of programmable segments.
For example, in the case of a first order polynomial, the first register 320 may be used as a gradient register.
In other words, the first register 320 may be configured to store a specific value such as a gradient according to a programming method.
For a first order polynomial, the second register 330 may be configured to store offsets B0 through B (N-1) for a plurality of programmable segments.
For example, in the case of a first order polynomial, the second register 330 may be used as an offset register.
In other words, the second register 330 may be configured to store a specific value such as an offset according to a programming method.
Specifically, a section of N segments may be approximated as N programmable segments by the activation function conversion programming unit 3000. Furthermore, each programmable segment includes a specific gradient a and a specific offset B value. That is, a specific register of the memory 300 may selectively store a specific value.
In other words, in an example approximated by a linear function, in a section from the minimum value to the first segment boundary value SB0, the gradient of the programmable segment may be expressed as a first gradient A0, and the offset of the programmable segment is expressed as a first offset B0. Here, the minimum value Min may be minus infinity-infinity.
In the section between the last segment boundary value SB (N-2) and the maximum value, the gradient of the programmable segment may be expressed as the last slope A (N-1) and the offset of the programmable segment may be expressed as the last offset B (N-1). Here, the maximum value Max may be positive infinity.
Thus, the first register 320 may store gradients A0 through A (N-1) for each of the N programmable segments. Further, the second register 330 may store offsets B0 to B (N-1) for each of the N programmable segments.
The activation function conversion programming unit 3000 may be configured to provide programming activation function data to be processed by the NPU to the memory 300.
< Table 1>
Referring to < table 1>, data for driving the program activation function may be configured to be generated in the activation function conversion programming unit 3000 and stored in the memory 300, for example, in the fragment register 310, the first register 320, and the second register 330 of the NPU.
For example, segment register 310 may be configured to store segment boundary value SB of < table 1 >.
For example, the first register 320 may be configured to store the gradient a of < table 1 >. Gradient a may be referred to as a coefficient of the linear term.
For example, the second register 330 may be configured to store an offset B of < table 1 >. Offset B may be referred to as a bias.
Controller 100 and/or DMA 200 may instruct memory 300 to store data for the program activation function of < table 1 >. However, examples of the present disclosure are not limited thereto, and the data of the program activation function may be configured to be stored in at least one of a register inside the controller 100, a register inside the PAFE unit 500, a separate memory, and a separate register. That is, the storage location of the data programming the activation function is not limited to a particular location.
Referring to < table 1>, an example of programming activation function data is disclosed.
For example, the program activation function data may be configured to include a segment boundary value SB.
For example, the program activation function data may be configured to include a section of each segment S.
For example, the programming activation function data may include a gradient for each segment S.
For example, the program activation function data may include an offset B for each segment S.
Further, first register 320 may output gradients A0 through a (N-1) of each of the N programmable segments to PAFE unit 500 under the control of controller 100. Further, second register 330 may output offsets B0 through B (N-1) for each of the N programmable segments to PAFE unit 500 under the control of controller 100.
Thus, PAFE unit 500 may receive gradients A0 through A (N-1) and offsets B0 through B (N-1) for each of the programmable segments. That is, PAFE unit 500 may receive information regarding a plurality of programmable segments via first register 320 and second register 330.
< Table 2>
Referring to < table 2>, data for driving the programmed ReLU may be configured to be generated in the activation function conversion programming unit 3000 and stored in the memory 300, for example, in the segment register 310, the one register 320, and the second register 330 of the NPU.
For example, segment register 310 may be configured to store a segment boundary value SB of < table 2 >.
For example, the first register 320 may be configured to store the gradient a of < table 2 >.
For example, the second register 330 may be configured to store an offset B of < table 2 >.
In the case of a programmed ReLU, it can be programmed to have only one segment boundary value SB. As described above, the determination as having only one segment boundary value SB may be performed by an approximation method according to various examples of the present disclosure.
In the case of a programmed ReLU, only one comparator may be required for operation of the PAFE unit 500 since only the first segment boundary value SB0 is programmed. Thus, unnecessary comparators may be disabled.
With the comparator enable (En) signal of < table 2> input to the PAFE unit 500, unnecessary comparator power consumption may be reduced.
< Table 3>
Referring to < table 3>, data for driving the clip-applied programmed ReLU may be configured to be generated in the activation function conversion programming unit 3000 and stored in the memory 300, for example, in the clip register 310, the first register 320 and the second register 330 of the NPU.
For example, segment register 310 may be configured to store a segment boundary value SB of < table 3 >.
For example, the first register 320 may be configured to store the gradient a of < table 3 >.
For example, the second register 330 may be configured to store an offset B of < table 3 >. When clipping is applied, the minimum and maximum values of the input values of the activation function may be limited.
Further, in PAFE unit 500, both data for driving the programmed ReLU of < Table 2> and data for driving the programmed with clipping of < Table 3> may be stored in NPU 1000. Further, activation function conversion programming unit 3000 may be configured to provide NPU 1000 with both data for driving a programmed ReLU and data for driving a programmed ReLU with clipping.
NPU 1000 may be configured to selectively input a plurality of programming activation functions stored in NPU 1000 to PAFE unit 500 based on compiled DNN information.
For example, NPU 1000 may use the programming activation function data of < table 2> for a first artificial neural network operation and may control PAFE unit 500 to use the programming activation function data of < table 3> for a second artificial neural network operation.
< Table 4>
Referring to < table 4>, data for driving the programmed ReLU6 may be generated in the activation function conversion programming unit 3000 and stored in the memory 300, for example, in the fragment register 310, the first register 320, and the second register 330 of the NPU.
For example, segment register 310 may be configured to store segment boundary value SB of < table 4 >.
For example, the first register 320 may be configured to store the slope a of < table 4 >.
For example, the second register 330 may be configured to store an offset B of < table 4 >.
In the case of a programmed ReLU6, it can be programmed to have two segment boundary values SB. As described above, the determination as having the two segment boundary values SB may be performed by the approximation method according to various examples of the present disclosure.
In addition, PAFE unit 500 may store all of the data in NPU 1000 for driving the programmed ReLU in < Table 2>, the data in < Table 3> for driving the programmed ReLU with clipping, and the data in < Table 4> for driving the programmed ReLU 6. In addition, activation function conversion programming unit 3000 may be configured to provide NPU 1000 with all data for driving programmed ReLU, programmed ReLU with clipping, and programmed ReLU 6.
The NPU 1000 may be configured to selectively input a plurality of programming activation functions stored in the NPU 1000 according to compiled DNN information.
For example, NPU 1000 may control PAFE unit 500 to use data from a programming activation function of < table 2> for a first artificial neural network operation, from < table 3> for a subsequent second artificial neural network operation, and from < table 4> for a subsequent third artificial neural network operation. In the case of programmed ReLU6, only the first segment boundary value SB0 and the second segment boundary value SB1 are programmed and only two comparators may be required for operation of the PAFE unit 500. Thus, unnecessary comparators may be disabled.
In summary, the NPU 1000 may store a plurality of programming activation functions. NPU 1000 may selectively input data for a particular activation function in PAFE unit 500 to process a particular artificial neural network operation. Furthermore, PAFE unit 500 may input data from the programmed activation functions in real time without requiring hardware changes to handle artificial neural network operations.
Fig. 17 illustrates a PAFE unit of an NPU of a device configured to handle programming an activation function in accordance with an example of the present disclosure.
An exemplary PAFE unit 500 configured to program an activation function with a linear function process may be configured to include a plurality of comparators (comparators 0 through (N-2)) and (510 through 51 (N-2)), a selector 520, a multiplier 530, and an adder 540. However, examples of the present disclosure are not limited thereto, and the region of each segment may be distinguished by configuring the circuit in various ways. Furthermore, PAFE unit 500 may be modified to include additional circuit configurations to process activation functions using programming methods other than linear functions.
In examples of the present disclosure, because PAFE unit 500 is an example configured to process a linear function, PAFE unit 500 may be configured to process a linear function through the inputs of segment register 310, first register 320, and second register 330. However, PAFE unit 500 may be modified to also include additional registers to handle various approximation functions.
Each of the plurality of comparators 510 to 51 (N-2) compares the input value X calculated in the at least one processing element 400 with each of the plurality of segment boundary values SB0 to SB (N-2), respectively.
For example, if the input value X is greater than each of the segment boundary values SB0 to SB (N-2), each of the plurality of comparators 510 to 51 (N-2) may output an output value of the first level. On the other hand, if the input value X is less than or equal to each of the segment boundary values SB0 to SB (N-2), each of the plurality of comparators 510 to 51 (N-2) may output an output value of the second level.
The first level may represent a high level and the second level may represent a low level. Alternatively, the first level may represent a low level, and the second level may represent a high level.
Accordingly, the section of the segment, in which the input value X belongs to the sections of the plurality of segments, may be determined by the output value output from each of the plurality of comparators 510 to 51 (N-2). The above-described output value output from each of the plurality of comparators 510 to 51 (N-2) may be referred to as Section Determination Data (SDD).
For example, if the first segment boundary value SB0 is-4, the first segment boundary value SB0 is input to the first comparator 510. In the first comparator 510, the input value X calculated in the processing element is input.
For example, if the second segment boundary value SB1 is-2, the second segment boundary value SB1 is input to the second comparator 511. In the second comparator 511, the input value X calculated in the processing element is input.
In other words, the input value X calculated in the processing element may be input simultaneously with the plurality of comparators.
For example, when the first segment boundary value SB0 is-4, the second segment boundary value SB1 is-2, and the input value X is-3, the first segment determination data SDD1, that is, the output values of the first comparators (comparators 0 and 510) are output as the first level, and the plurality of segment determination data SDD1 to SDD (N-2) other than the first segment determination data SDD1, which are the output values of the remaining comparators (comparator 1 to comparator (N-2)) may be output as the second level. Thus, by the section determination data SDD, i.e., the output value output from each of the plurality of comparators 510 to 51 (N-2), the input value X can be determined as the segment boundary value SB corresponding to the segment between-4 and-2.
In the above < table 1> to < table 4>, the section determination data SDD1 to SDD (N-2) may correspond to the above-described fragment S.
< table 5> describes determining the segment S of the program activation function according to the results of the section determination data SDD1 to SDD (N-2).
< Table 5>
Range of SDD0 SDD1 SDD2 …… SDD(N-2)
Fragment (S0) min<X≤SB0 L L L …… L
Fragment (S1) SB0<X≤SB1 H L L …… L
Fragment (S2) SB1<X≤SB2 H H L …… L
Fragment (S (N-1)) SB(N-2)<X≤max H H H …… H
Referring to < table 5>, the fragment S illustrated in < table 1> or < table 4> may be determined according to the outputs of the section determination data SDD0, SDD1, SDD2, and SDD (N-2). When determining a particular segment S, a corresponding gradient a and offset B may be selected. However, examples of the present disclosure are not limited thereto, and the corresponding segments may also be determined by configuring a circuit that determines the segments in various ways. Furthermore, PAFE unit 500 may be modified by configuration circuitry to process the activation function in another manner than a comparator.
On the other hand, the operation state of each of the plurality of comparators 510 to 51 (N-2) may be determined according to each of the enable signals Comp En 1 to Comp En (N-2).
That is, if each of the plurality of enable signals Comp En 1 to Comp En (N-2) is at the first level, each of the plurality of comparators 510 to 51 (N-2) may operate to compare the input value X with the segment boundary values SB0 to SB (N-2). In contrast, if each of the plurality of enable signals Comp En 1 to Comp En (N-2) is at the second level, each of the plurality of comparators 510 to 51 (N-2) may operate not to compare the input value X with the segment boundary values SB0 to SB (N-2). That is, each comparator may be disabled.
As described above, the number of segment boundary values SB0 through SB (N-2) is determined based on the number of segments of the program-activated function. For example, when the number of segments is N, the number of segment boundary values SB0 to SB (N-2) is N-1.
For example, even when the activation function conversion programming unit 3000 programs the same activation function, the first programming activation function may be programmed to have ten fragments, and the second programming activation function may be programmed to have five fragments. Thus, even though the activation function is the same, PAFE unit 500 may control the number of comparators activated in PAFE unit 500 differently based on each programmed activation function data. Thus, the accuracy of the artificial neural network calculations and the power consumption of the NPU 1000 may also vary according to programming. That is, even if the same activation function is used, a high-performance activation function calculation function or a low-power activation function calculation function can be provided according to the user's demand.
Furthermore, the number of comparators using the segment boundary value SB as input should also vary depending on the maximum number of segment boundary values SB.
For example, when the maximum number of the segment boundary values SB is ten, at least eleven or more comparators may be provided. That is, the minimum number of comparators may be the maximum number of segment boundary values.
Accordingly, each of the plurality of comparators 510 to 51 (N-2) may determine whether to operate based on each of the plurality of comparator enable signals Comp En 1 to Comp En (N-2). Accordingly, the power consumption of the NPU can be reduced by controlling unnecessary comparator operations according to the number of segments.
However, the number of comparators may be limited due to hardware limitations. Thus, the number of fragments used to segment the activation function may be limited based on the number of comparators of PAFE unit 500. That is, the activation function may be segmented into the maximum number of segments that may be processed by the NPU 1000 to be processed or the number of segments corresponding to the allocated resources of the NPU 1000.
Meanwhile, according to a programming method according to an example of the present disclosure, a distinction can be made between a linear section and a nonlinear section of an activation function, and the number of fragments can be minimized by providing a variable fragment width while minimizing an error value. Thus, there is an advantage in that the number of gate counts (gates count) of hardware of the PAFE unit 500 of the NPU 1000 can be minimized by minimizing the number of comparators.
Further, an activation function programming method according to an example of the present disclosure may be configured to program a specific activation function based on information of a maximum comparator that can be provided.
Then, the selector 520 outputs gradients of the programmable segments corresponding to the section of the segment to which the input value X belongs among the gradients A0 to a (N-1) of the plurality of programmable segments according to the section determination data SDD0 to SDD (N-2).
Specifically, the first register 320 provides the plurality of gradients A0 through A (N-1) for each of the plurality of programmable segments to the selector 520. Then, the selector 520 may determine a section of a segment to which the input value X belongs among sections of a plurality of segments according to the section determination data SDD0 to SDD (N-2) output from each of the plurality of comparators 510 to 51 (N-2). Further, the selector 520 may output gradients of the programmable segments corresponding to the section of the determined segment among the gradients A0 to a (N-1) of the plurality of programmable segments.
The selector 520 outputs an offset B of a programmable segment corresponding to a section of a segment to which the input value X belongs, among a plurality of offsets B0 to B (N-1) of the plurality of programmable segments, according to the section determination data SDD0 to SDD (N-2).
Specifically, the second register 330 provides the plurality of offsets B0 through B (N-1) for each of the plurality of programmable segments to the selector 520. Further, the selector 520 may determine a section of a segment to which the input value X belongs among sections of a plurality of segments from the section determination data SDD0 to SDD (N-2) output from each of the plurality of comparators 510 to 51 (N-2). The selector 520 may then output an offset B of the programmable segment corresponding to the section of the determined segment among the plurality of offsets B0 to B (N-1) of the plurality of programmable segments.
Thus, the selector 520 may output the gradient a and the offset B of the programmable segment corresponding to the segment of the segment to which the input value X belongs.
Meanwhile, the selector 520 may be a multiplexer composed of a plurality of switching elements controlled according to the section determination data SDD0 to SDD (N-2), but the configuration of the selector 520 may be variously changed.
The program activation function calculation unit of PAFE unit 500 may refer to a circuit unit configured to receive an input value X, a gradient A, and an offset B and calculate an output value Y.
The program activation function calculator of PAFE unit 500 may include at least one multiplier 530 and adder 540.
The programmed activation function calculator of PAFE unit 500 may be a hardwired circuit.
The multiplier 530 of the program activated function operator multiplies the input value X by the gradient a of the programmable segment corresponding to the segment of the segment to which the input value X belongs.
Specifically, multiplier 530 multiplies the input value X calculated in at least one processing element 400 by the gradient a of the programmable segment output from selector 520. That is, the input value X may be a calculated value of at least one processing element 400. However, the present disclosure is not limited thereto.
Thus, multiplier 530 may multiply the input value X by the gradient a of the programmable segment and output the result. That is, the output of multiplier 530 may be expressed as a×x.
Then, the adder 540 of the program-activated function operator adds the offset B of the programmable segment corresponding to the segment of the segment to which the input value X belongs and the output value of the multiplier 530 of the program-activated function operator.
Specifically, the adder 540 adds the offset B of the programmable segment to a value obtained by multiplying the input value X by the gradient a of the programmable segment. That is, the output of adder 540 may be expressed as a×x+b.
Accordingly, the adder 540 may output an activation value of the input value X to which the PAF is applied to the calculated value.
That is, a PAFE unit 500 according to examples of the present disclosure may be a circuit configuration configured to implement an activation function programmed as a linear function.
For example, PAFE unit 500, which is pipelined with at least one processing element 400 according to examples of the present disclosure, may also be configured as hardwired circuitry configured to implement an activation function programmed as a linear function.
As described above, the PAFE unit 500 of the NPU of the apparatus for performing the activation function programming method according to the example of the present disclosure is configured only by the plurality of comparators 511 to 51 (N-2), the selector 520, the multiplier 530, and the adder 540, and all the activation functions may be programmed and applied to the input value X.
Since each of the above-described plurality of comparators 511 to 51 (N-2), the selector 520, the multiplier 530, and the adder 540 is relatively simplified hardware, the apparatus for performing the activation function programming method according to the example of the present disclosure has an effect of processing all activation functions with only simplified hardware.
Meanwhile, the conventional activation function processing apparatus can process only a predefined activation function. However, the device for performing the activation function programming method according to examples of the present disclosure may program and apply the activation function that is not predefined, such that all programmed activation functions may be applied. In particular, since PAFE unit 500 may adjust the number of segments according to the characteristics of the various activation functions, approximation errors may be minimized by using a minimum number of comparators. In particular, since PAFE unit 500 may adjust the width of each segment according to the characteristics of the various activation functions, approximation errors may be minimized by using a minimum number of comparators. In particular, since PAFE unit 500 may adjust the width and number of segments according to the characteristics of the various activation functions, approximation errors may be minimized by using a minimum number of comparators.
Hereinafter, an NPU of an apparatus for performing an activation function programming method according to another example of the present disclosure will be described in detail.
Since the NPU of the apparatus for performing the activation function programming method according to the example of the present disclosure and the NPU of the apparatus for performing the activation function programming method according to another example of the present disclosure are different only in technical characteristics of the PAFE unit, this will be mainly described.
Fig. 18 illustrates an NPU of an apparatus for processing a programmed activation function according to another example of the present disclosure.
Fig. 19 illustrates a PAFE unit of an NPU for processing a device to program an activation function in accordance with another example of the present disclosure.
PAFE units 500-1 through 500-N of NPUs for processing devices that program activation functions may be divided into multiple numbers. Specifically, PAFE units may include first PAFE unit 500-1 through N PAFE unit 500-N. Further, each of first through N-th PAFE units 500-1 through 500-N may process different activation functions or the same activation function. That is, the activation functions programmed in each of first through N-th PAFE units 500-1 through 500-N may be the same or different from each other.
The amount of data to be processed by PAFE units 500-1 through 500-N may be increased in terms of the number of processing elements 400. Thus, the number of PAFE units 500-1 through 500-N is determined taking into account the number of processing elements 400.
That is, if the maximum data bandwidth of processing element 400 corresponding to input value X, which is the output value of processing element 400, is greater than the maximum data bandwidth that PAFE unit 500 may handle, then the number of PAFE units 500-1 through 500-N may be increased. Thus, the bottleneck of insufficient data bandwidth of PAFE units 500-1 through 500-N may be resolved.
For example, as shown in fig. 19, a PAFE unit 500 may include a Demultiplexer (DEMUX) and a Multiplexer (MUX) as well as a plurality of PAFE units.
A Demultiplexer (DEMUX) distinguishes the input value X and the input value to which the nonlinear PAF should be applied from the input value X and the input value to which the linear PAF should be applied.
An input value to which a non-linear PAF should be applied is assigned to first PAFE element 500-1. In addition, an input value to which a linear PAF should be applied may be assigned to the second PAFE unit 500-2.
In addition, first PAFE unit 500-1 stores a programmed activation function of the nonlinear activation function. Thus, first PAFE unit 500-1 may handle a nonlinear PAF.
In addition, second PAFE unit 500-2 stores a programmed activation function of the linear activation function. Thus, second PAFE unit 500-2 may process a linear PAF.
Further, because first PAFE unit 500-1 may be configured to handle nonlinear activation functions, it may be configured to have relatively more comparators than second PAFE unit 500-2. On the other hand, because second PAFE unit 500-2 may be configured to have a relatively smaller number of comparators than first PAFE unit 500-1, it may operate with relatively less power consumption.
One of the first and second PAFE units 500-1 and 500-2 may be selectively disabled depending on the type of programming activation function handled by the NPU 1000.
Further, multiplexer MUX may receive output values with non-linear PAFs from first PAFE unit 500-1 and output values with linear PAFs from second PAFE unit 500-2.
In addition, multiplexer MUX may collect and output the nonlinear PAF application output from first PAFE unit 500-1 and the linear PAF application output from second PAFE unit 500-2.
Thus, the multiplexer MUX may output the activation values having the linear PAF and the nonlinear PAF to the calculated value as the input value X.
According to examples of the present disclosure, first and second PAFE units 500-1 and 500-2 may be configured to process specific sections of an activation function separately to process an activation function having both linear and non-linear sections.
For example, the ELU activation function as shown in FIG. 14 has a linear section when the X value is zero or greater, and a nonlinear section when the X value is less than zero. That is, the ELU activation function is characterized by a linear section and a nonlinear section. Here, the first PAFE unit 500-1 may be configured to handle a nonlinear section of the ELU activation function. The second PAFE unit 500-2 may be configured to process a linear section of the ELU activation function.
Hereinafter, an NPU of an apparatus for performing an activation function programming method according to another example of the present disclosure will be described in detail.
Since the NPU of the apparatus for performing the activation function programming method according to the example of the present disclosure and the NPU of the apparatus for performing the activation function programming method according to another example of the present disclosure are different only in technical characteristics of the PAF library 600, this will be mainly described.
Fig. 20 illustrates an NPU of an apparatus for processing a programmed activation function according to another example of the present disclosure.
The NPU may also include a controller 100, a memory 300, at least one processing element 400, and a PAFE unit 500, and a PAF library 600.
The PAF library 600 may store PAFs that approximate activation functions. Specifically, the PAF library 600 may store gradients A0 through A (N-1) and offsets B0 through B (N-1) information for a plurality of programmable segments that make up the PAF. As explained, the PAF library 600 may store a plurality of PAFs. In addition, the PAF library 600 may store gradients A0 through A (N-1) and offsets B0 through B (N-1) information for a plurality of programmable segments for each of a plurality of PAFs. However, by activating the function conversion program, the plurality of PAFs are not limited to linear functions and can be approximated by selectively combining second order polynomials, third order polynomials, logarithmic functions, and the like. For example, the PAF library 600 can be configured to store each of the programming activation function data shown in tables 2 through 4. Thus, the PAF library 600 may be configured to store a programmed ReLU, a programmed ReLU with clipping, and a programmed ReLU6. Further, controller 100 may control to select a particular activation function from PAF library 600 and input it into PAFE unit 500, as desired.
The plurality of programmed activation functions stored in the PAF library 600 can approximate a representative activation function. For example, representative activation functions may be Swish functions, mish functions, sigmoid functions, hyperbolic Tangent (TANH) functions, SELU functions, GELU (Gaussian error Linear Unit) functions, SOFTPLUS functions, reLU functions, leaky ReLU functions, maxout functions, ELU functions, and the like.
Accordingly, PAFE unit 500 may select a desired PAF from among the plurality of PAFs stored in PAF library 600 according to the control of controller 100. Additionally, PAFE unit 500 may import information such as gradients A0 through A (N-1) and offsets B0 through B (N-1) from multiple programmable segments of a selected PAF from PAF library 600.
As described above, an apparatus for performing an activation function programming method according to another example of the present disclosure may program and store frequently used activation functions in the PAF library 600.
Thus, in an apparatus for performing an activation function programming method according to another example of the present disclosure, the PAF library 600 may store PAFs without requiring an activation function conversion program to program all activation functions.
Accordingly, there is an advantage in that the processing speed of the apparatus for performing the activation function programming method according to another example of the present disclosure and the power consumption for driving the activation function conversion program can be improved.
Hereinafter, an NPU of an apparatus for performing an activation function programming method according to another example of the present disclosure will be described in detail.
Since the NPUs of the apparatus for performing the activation function programming method according to the example of the present disclosure and the apparatus for performing the activation function programming method according to another example of the present disclosure differ only in terms of the Processing Element (PE) array and the PAFE unit, this will be mainly described.
Fig. 21 illustrates an NPU of an apparatus for processing a programmed activation function according to another example of the present disclosure.
As shown in fig. 21, in an NPU of an apparatus for performing an activation function programming method according to another example of the present disclosure, a plurality of processing elements #0 to #n-1 may be grouped. The grouped processing elements may refer to at least one processing element.
In other words, the plurality of processing elements may include zeroth processing element #0 through N-1 th processing element #N-1. Each of the plurality of processing elements #0 through #n-1 may be referred to as a Processing Element (PE) thread or PE core. Hereinafter, the at least one plurality of processing elements will be referred to as PE cores.
On the other hand, each structure of each of the PE cores may be different from each other. For example, each of the plurality of PE cores may be one of an input fixed type, a weight fixed type, and an output fixed type.
Furthermore, each of the plurality of PE cores may be driven individually, depending on the optimization of the driving. That is, each of the plurality of PE cores is not driven simultaneously, and may be driven in sequence according to the operation of the PAFE unit.
Furthermore, the number of processing elements included in each of the plurality of PE cores, the multiply-accumulate (MAC) operator, and the Arithmetic Logic Unit (ALU) operator may be different. Thus, the size of each of the plurality of PE cores may be different.
In addition, each of the plurality of PE cores may be coupled to a PAFE unit through a Multiplexer (MUX). Specifically, a Multiplexer (MUX) receives the plurality of calculated values output from each of the plurality of PE cores and outputs at least one of the plurality of calculated values to the PAFE unit.
It may also be configured to provide more buffer memory between PAFE unit 500 and multiple PE cores. However, it is not limited thereto.
Thus, one PAFE unit may process multiple computed values output from each of multiple PE cores. Thus, the number of PAFE cells provided in a device for performing an activation function programming method according to another example may be minimized. Finally, this may minimize the manufacturing cost of the device for performing the activation function programming method.
Fig. 22 illustrates a PAFE cell configured to handle a program activation function according to another example of the present disclosure.
Fig. 23 illustrates a PAFE element of an NPU for a device to process a programmed activation function in accordance with another example of the present disclosure.
Each of the plurality of programmable segments of PAF applied to the PAFE cells shown in fig. 22 and 23 may operate as a linear function or a quadratic function. Thus, the coefficients A, B and C of the programmable segments described above may include the coefficients of the quadratic term a, the coefficients of the linear term B, and the offset C.
Thus, the activation function conversion programming unit 3000 may be configured to provide programming activation function data to be processed in the NPU and memory 300.
< Table 6>
Referring to < table 6>, data for driving the program activation function may be generated in the activation function conversion programming unit 3000 and configured to be stored in the memory 300, for example, in the fragment register 310, the first register 320, the second register 330, and the third register 340 of the NPU.
For example, segment register 310 may be configured to store a segment boundary value SB of < table 6 >.
For example, the first register 320 may be configured to store the coefficient a of the quadratic term of < table 6 >.
For example, the second register 330 may be configured to store the coefficient B of the linear term of < table 6 >. For example, the third register 340 may be configured to store an offset of C < table 6 >.
Controller 100 and/or DMA 200 may instruct to store data of the program activation function in < table 6> in memory 300. Examples of the present disclosure are not limited thereto, and the data of the program activation function may be configured to be stored in at least one of a register in controller 100, a register in PAFE unit 500', a separate memory, and a separate register. That is, the storage location of the data programming the activation function is not limited to a particular location.
Referring to < table 6>, an example of programming activation function data is disclosed.
For example, the program activation function data may be configured to include a segment boundary value SB.
For example, the program activation function data may be configured to include a range of segments S for each segment.
For example, the programming activation function data may be configured to include coefficients a of quadratic terms and coefficients B of linear terms for each segment.
For example, the program activation function data may be configured to include an offset C for each segment.
An exemplary PAFE unit configured to process a program activation function of a quadratic term may be configured to include a plurality of comparators 0 through (N-2) or 511 through 51 (N-2), a selector 520, a plurality of multipliers 531, 532, and 533, and a plurality of adders 541 and 542.
Each of the plurality of comparators 510 to 51 (N-2) compares the input value X calculated in the at least one processing element 400 with each of the plurality of segment boundary values SB0 to SB (N-2). For example, when the input value X is greater than each of the plurality of segment boundary values SB0 to SB (N-2), each of the plurality of comparators 510 to 51 (N-2) may output a first level output value. In contrast, when the input value X is less than or equal to each of the plurality of segment boundary values SB0 to SB (N-2), each of the plurality of comparators 510 to 51 (N-2) may output the second level output value.
Accordingly, the section of the segment to which the input value X belongs may be determined among the sections of the plurality of segments by the output value output from each of the plurality of comparators 510 to 51 (N-2).
Meanwhile, the operation of each of the plurality of comparators 510 to 51 (N-2) may be determined by each of the plurality of comparator enable signals Comp En 1 to Comp En (N-2).
Further, according to the section determination data SDD0 to SDD (N-2), the selector 520 outputs coefficients a, B, C of a programmable segment corresponding to the section of the segment to which the input value X belongs, among coefficients A0 to a (N-1), B0 to B (N-1), and C0 to C (N-1) of the plurality of programmable segments.
Specifically, the first register 320 provides coefficients A0 through A (N-1), coefficients B0 through B (N-1), and offsets C0 through C (N-1) of the quadratic term, the coefficients B0 through B (N-1), of the linear term for each of the plurality of programmable segments to the selector 520.
Further, the selector 520 may determine a section of a segment to which the input value X belongs among sections of a plurality of segments according to the section determination data SSD0 to SSD (N-2) output from each of the plurality of comparators 510 to 51 (N-2).
Further, the selector 520 outputs coefficients A0 to a (N-1) of the quadratic terms of the plurality of programmable segments, coefficients B0 to B (N-1) of the linear terms, and coefficients a of the quadratic terms of the programmable segments corresponding to the sections of the determined segments, coefficients B of the linear terms, and offsets C among the offsets C0 to C (N-1).
Thus, the selector 520 may output the coefficient a of the quadratic term of the programmable segment corresponding to the segment of the segment to which the input value X belongs, the coefficient B of the linear term, and the offset C.
Meanwhile, the selector 520 may be a multiplexer composed of a plurality of switching elements controlled according to the section determination data SDD, but the configuration of the selector 520 may be variously changed.
The program-activated function calculation unit of PAFE unit 500' may refer to a circuit unit configured to receive as inputs an input value X, a coefficient A of a quadratic term, a coefficient B of a linear term, and an offset C, and to calculate an output value Y.
The program activation function calculator of PAFE unit 500' may be configured to include a plurality of multipliers 531, 532, and 533 and a plurality of adders 541 and 542 to process a quadratic function or a linear function.
The program activation function calculation unit of PAFE unit 500' may be a hardwired circuit.
The plurality of multipliers of the program activation function calculator may include a first multiplier 531, a second multiplier 532, and a third multiplier 533.
The first multiplier 531 multiplies the coefficient of the quadratic term of the programmable segment corresponding to the segment of the segment to which the input value X belongs by the input value X.
Specifically, the first multiplier 531 multiplies the input value X calculated in the at least one processing element 400 by a coefficient a of the quadratic term of the programmable segment output from the selector 520.
Thus, the first multiplier 531 may multiply the input value X by the coefficient a of the quadratic term of the programmable segment and output the result. That is, the output of the first multiplier 531 may be expressed as a×x.
Then, the second multiplier 532 multiplies the output value output from the first multiplier 531 by the input value X.
In detail, the second multiplier 532 multiplies the input value X calculated by the at least one processing element 400 by the output value output from the second multiplier 532.
Thus, the output of the second multiplier 532 may be expressed as A X2. However, the process is not limited to the above-described process,the above configuration is merely for realizing AX2And modifications may also be implemented by various circuit combinations.
The third multiplier 533 multiplies the coefficient B of the linear term of the programmable segment corresponding to the segment of the segment to which the input value X belongs by the input value X.
In particular, the third multiplier 533 multiplies the input value X calculated in the at least one processing element 400 by a coefficient B of a linear term of the programmable segment output from the selector 520.
Thus, the third multiplier 533 may multiply the input value X by the coefficient B of the linear term of the programmable segment and output the result. That is, the output of the third multiplier 533 may be expressed as bx.
The plurality of adders may include a first adder 541 and a second adder 542.
The first adder 541 adds the output value of the third multiplier 533 to the output value of the second multiplier 532.
Specifically, the first adder 541 may output a sum of a quadratic term and a linear term of each of the plurality of programmable segments composed of quadratic terms. That is, the output of the first adder 541 may be expressed as a×x2+B×X。
Then, the second adder 542 adds the offset C of the programmable segment corresponding to the segment of the segment to which the input value X belongs to the output value of the first adder 541.
Specifically, adder 540 adds the offset C of the programmable segment to the sum of the quadratic and linear terms of the programmable segment consisting of quadratic terms. That is, the output of the second adder 542 may be expressed as A X2+B×X+C。
Accordingly, the adder 540 may output an activation value to which an activation function programmed as a quadratic function is applied to the input value X as an operation value.
According to the configuration described above, PAFE unit 500' enables processing of the operation of the second order polynomial.
Meanwhile, operations of the second multiplier 532, the third multiplier 533, and the second adder 542 may be controlled by the first enable signal EN 1.
Specifically, when the second multiplier 532, the third multiplier 533, and the second adder 542 do not operate due to the first enable signal EN1, the operation is as follows.
The first multiplier 531 multiplies the coefficient a of the quadratic term of the programmable segment corresponding to the segment of the segment to which the input value X belongs by the input value X.
Specifically, the first multiplier 531 multiplies the input value X calculated in the at least one processing element 400 by the coefficient of the quadratic term of the programmable segment output from the selector 520.
Thus, the first multiplier 531 may multiply the input value X by the coefficient a of the quadratic term of the programmable segment and output the result. That is, the output of the first multiplier 531 may be expressed as a×x.
Further, the second multiplier 532 and the third multiplier 533 do not operate, and the output of the first multiplier 531 is input to the first adder 541 as it is. That is, the calculator disabled by the first enable signal EN1 may be bypassed.
Then, the first adder 541 adds the coefficient B of the linear term of the programmable segment corresponding to the segment of the segment to which the input value X belongs to the output value of the first multiplier 531.
Specifically, the first adder 541 adds the coefficient B of the linear term of the programmable segment to a value obtained by multiplying the input value X by the coefficient a of the second-order term of the programmable segment. That is, the output of the first adder 541 may be represented as a×x+b.
Further, the second adder 542 does not operate, and the output of the first adder 541 is output as it is. That is, the calculator disabled by the first enable signal EN1 may be bypassed.
That is, the first adder 541 may output an activation value to which an activation function programmed as a linear function is applied to an operation value as the input value X.
According to the above configuration, PAFE unit 500' enables processing of the operation of the first order polynomial.
As described above, some components of the plurality of multipliers and the plurality of adders may be controlled by the first enable signal EN 1. Thus, according to the first enable signal EN1, the PAFE cells may be driven not only when each of the programmable segments is a second order polynomial, but also when each of the programmable segments is a first order polynomial.
In other words, at least one processing element 400 and PAFE unit 500' pipelined in accordance with examples of the present disclosure may also be comprised of hardwired circuitry configured to implement activation functions programmed as both quadratic and linear functions.
Thus, there is an advantage in that one PAFE unit can be used to process PAFs in various situations.
FIG. 24 illustrates an example of a device for processing a programmed activation function approximating a sigmoid activation function to a programmable activation function according to another example of the present disclosure.
As described above, each of the plurality of programmable segments of the PAF applied in the PAFE unit of the apparatus for performing the activation function programming method according to another example of the present disclosure is a second order polynomial. In detail, at least a portion of the sigmoid function, for example, only the range of-6.0 to 2.0, may be approximated by dividing it into three fragments.
For example, when a sigmoid activation function is approximated with PAF, it can be approximated as follows.
In section S0, where the input value X is greater than-6.0 and less than or equal to-2.6, the programmable segment may pass through 0.07X2+0.08x+0.23. Furthermore, in section S1 where the input value X is greater than-2.6 and less than or equal to-0.6, the programmable segment may pass through 0.05X 2+0.3x+0.52. In addition, in the section S2 where the input value X is greater than-0.6 and less than or equal to 2, the programmable segment may pass through 0.03X2+0.26x+0.5.
Thus, the programmable parameters may correspond in the format < table 6 >.
For example, A0 in < table 6> may be 0.07. B0 in < table 6> may be 0.08. C0 in < table 6> may be 0.23.
For example, A1 in < table 6> may be 0.05. B1 in < Table 6> may be 0.3. C1 in < table 6> may be 0.52.
For example, A2 in < table 6> may be-0.03. B2 in < Table 6> may be 0.26. C2 in < table 6> may be 0.5.
For example, SB0 in < Table 6> may be-2.6. SB1 in < Table 6> may be-0.6.
For example, min in < table 6> may be-6.0. The Max in < table 6> may be 2.0.
For example, the segment boundary value SB of the segment, the coefficient a of the quadratic term, the coefficient B of the linear term, and the offset C can also be derived by approximating each segment to the best programmable segment using machine learning in the activation function programming method according to the example of fig. 12.
The coefficients in fig. 24 are merely examples derived by machine learning, and may be modified in various ways. For example, some of the programmable segments S0 and S2 may correspond to linear segments, and another portion of the programmable segment S1 may correspond to non-linear segments.
Thus, some of the programmable segments S0 and S2 can be approximated with a linear function, and another portion of the programmable segment S1 can be approximated with a quadratic function.
In some examples, a logarithmic operator may also be included in the output terminal of the PAFE unit. Referring to fig. 25, the PAFE unit including the logarithmic operator will be described in detail.
Fig. 25 is a conceptual diagram illustrating a PAFE cell of an NPU of an apparatus for processing an activation function programmed according to another example of the present disclosure.
Referring to fig. 25, the pafe unit 500″ may include a plurality of comparators 0 to (N-2) or 511 to 51 (N-2), a selector 520, a plurality of multipliers 531, 532, and 533, and a plurality of adders 541 and 542, and a logarithmic operator 550.
Since there is only a difference between the PAFE unit shown in FIG. 23 and the PAFE unit shown in FIG. 25 in terms of whether the logarithmic operator 550 operates, this will be described in detail.
The operation of the logarithmic operator 550 may be controlled by the second enable signal EN 2. When the second enable signal EN2 is applied to the logarithmic operator 550, the logarithmic coefficient D may be input to the logarithmic operator 550. When the logarithmic operator 550 is activated, the operators 531, 532, 533, 541, and 542 related to the coefficient a of the second order term, the coefficient B of the first order term, and the offset C may be disabled.
That is, the output of the logarithmic operator 550 may be expressed as log d.
That is, the logarithmic operator 550 may output an activation value to which the PAF including the logarithmic operation is applied to the input value X.
Each of the plurality of programmable segments of the PAF applied in the PAFE cells as shown in fig. 25 may operate as a linear function, a quadratic function, or a logarithmic function. Thus, the coefficients A, B, C and D of the programmable segments described above may include the coefficient a of the quadratic term, the coefficient B of the linear term, the offset C, and the logarithm D.
< Table 7>
Referring to < table 7>, data for driving the program activation function may be configured to be generated in the activation function conversion programming unit 3000 and stored in the memory 300, for example, in the fragment register 310, the first register 320, the second register 330, the third register 340, and the fourth register 350 of the NPU.
For example, the program activation function data may be configured to include a segment boundary value SB. The segment boundary value SB may be stored in a first register of the memory.
For example, the programming activation function data may include a range of segments S for each segment.
For example, the programming activation function data may include a quadratic coefficient a for each segment. The coefficient a of the quadratic term may be stored in a second register of the memory.
For example, the programming activation function data may include coefficients B for the linear term of each segment. The coefficient B of the linear term may be stored in a third register of the memory.
For example, the program activation function data may include an offset C for each segment. The offset C may be stored in a fourth register of the memory.
For example, the programming activation function data may include a logarithmic coefficient D for each segment. The logarithmic coefficient D may be stored in a fifth register of the memory.
As described above, it has been described that the PAF including the logarithmic operation is applied by adding the logarithmic operator 550 to the PAFE unit. However, as an operator added to the output of the PAFE unit, not only the logarithmic operator 550 but also various types of operators may be added.
In other words, the program activation function data may be determined from the operator circuit configuration of the program activation function calculator of the PAFE unit and the supportable equations.
A neural processing unit according to an example of the present disclosure may include: at least one processing element configured to output an operational value by an artificial neural network operation; a program-activated-function execution unit configured to generate an activation value by applying at least one program-activated function including a plurality of programmable segments to the operation value; and a controller configured to control operations of the at least one processing element and the programmed activation function execution unit.
According to another feature of the present disclosure, the neural processing unit may further include a segment register for storing information about segments of the plurality of programmable segments.
According to another feature of the present disclosure, the neural processing unit may further include a segment register for storing segment boundary values of the plurality of programmable segments.
According to another feature of the present disclosure, the program activation function execution unit may include a plurality of comparators, a selector, at least one multiplier, and at least one adder that are hardwired.
According to another feature of the present disclosure, the neural processing unit may further include a plurality of comparators configured to compare the operation value with each of a plurality of inputted segment boundary values, and output segment determination data.
According to another feature of the present disclosure, the neural processing unit may further include a plurality of comparators configured to determine whether to operate by a comparator enable signal.
According to another feature of the present disclosure, the nerve processing unit may further include a plurality of comparators configured to output section determination data, and the program activation function execution unit may be configured to generate the activation value by applying gradients and offsets of corresponding segments among the plurality of programmable segments to the operation value according to the section determination data.
According to another feature of the present disclosure, the at least one multiplier may multiply an input value with a gradient of the programmable segment output from the selector.
According to another feature of the present disclosure, the at least one adder may add a value output from the at least one multiplier, which is obtained by multiplying the input value with the gradient for the programmable segment, to an offset for the programmable segment.
According to another feature of the present disclosure, the selector may output, from the plurality of sections determination data, gradients of second order terms, gradients of first order terms, and offsets for the programmable segments corresponding to the section of the segment to which the input value belongs, among gradients for the plurality of programmable segments.
According to another feature of the present disclosure, the at least one multiplier may include: a first multiplier for multiplying an input value by a coefficient of the second order term for the programmable segment output from the selector; a second multiplier for multiplying the output value output from the first multiplier by the input value; and a third multiplier for multiplying the input value by a coefficient for the first-order term of the programmable segment output from the selector.
According to another feature of the present disclosure, operation of the second multiplier and the third multiplier may be controlled by a first enable signal.
According to another feature of the present disclosure, the at least one adder includes: a first adder that adds an output value of the third multiplier to an output value of the second multiplier; and a second adder for adding the offset for the programmable segment output from the selector to the output value of the first adder.
According to another feature of the present disclosure, operation of the second adder may be controlled by the first enable signal.
According to another feature of the present disclosure, the program activation function execution unit may further include a logarithmic operator that performs a logarithmic operation of the output value of the at least one adder.
According to another feature of the present disclosure, operation of the logarithmic operator may be controlled by a second enable signal.
According to another feature of the present disclosure, the neural processing unit may further include a programmable activation function library storing gradient and offset information for configuring a plurality of programmable segments of the programmable activation function.
According to another feature of the present disclosure, the at least one processing element is connected to the programming activation function execution unit through a multiplexer.
According to another feature of the present disclosure, the apparatus may further include an activation function conversion programming unit that programs an activation function into the at least one programmed activation function.
According to another feature of the present disclosure, the activation function conversion programming unit may change the data according to a slope,
the linear and nonlinear sections of the at least one programming activation function are preferentially determined.
According to another feature of the present disclosure, the activation function conversion programming unit may,
a section where the second derivative of the slope change data is below a threshold is determined as the linear section.
According to another feature of the present disclosure, the activation function conversion programming unit may,
a section where the second derivative of the slope change data is above a threshold is determined as the nonlinear section.
According to another feature of the present disclosure, the activation function conversion programming unit may divide the nonlinear section into a plurality of sections based on an integrated value of the second derivative.
According to another feature of the present disclosure, the activation function conversion programming unit may convert the linear section of the at least one programming activation function into a programmable segment approximated by a linear function.
According to another feature of the present disclosure, the activation function conversion programming unit may convert the nonlinear section of the at least one programming activation function into a programmable segment approximated by a quadratic function.
According to another feature of the present disclosure, the activation function conversion programming unit may convert the nonlinear section of the at least one programming activation function into a programmable segment approximated by a logarithmic function.
Examples of the present disclosure disclosed in the present specification and drawings are presented as specific examples only to easily illustrate the technical content of the present disclosure and to aid understanding of the present disclosure, and are not intended to limit the scope of the present disclosure. It is obvious to those skilled in the art that other modified examples can be implemented based on the technical ideas of the present invention in addition to the examples disclosed herein.
[ national research and development projects supporting the present invention ]
[ Allocation identification number ]1711170646
[ Allocation number ]2022-0-00957-001
Department name scientific and technical information communication unit (Ministry of Science and ICT)
[ distribution management (professional) organization name ] institute of information and communication planning and evaluation (Information and Communications Planning and Evaluation Institute)
[ study item name ] PIM Artificial Intelligence semiconductor core technology development (design)
[ study item title ] Distributed on-chip memory-computer fusion PIM semiconductor technology development for edges (Distributed on-chip memory-computer convergence PIM semiconductor technology development for edge)
[ contribution ratio ]1/1
[ name of mechanism for executing task ] DEEPX co., ltd.
[ study time ] 2022.04.01-2022.12.31

Claims (26)

1. A nerve processing unit, the nerve processing unit comprising:
at least one processing element configured to output an operational value by an artificial neural network operation;
a program-activated-function execution unit configured to generate an activation value by applying at least one program-activated function including a plurality of programmable segments to the operation value; and
a controller configured to control operation of the at least one processing element and the programmed activation function execution unit.
2. The neural processing unit of claim 1, further comprising a segment register for storing information about segments of the plurality of programmable segments.
3. The neural processing unit of claim 1, further comprising a segment register for storing segment boundary values of the plurality of programmable segments.
4. The neural processing unit of claim 1, wherein the programming activation function execution unit comprises a plurality of comparators, a selector, at least one multiplier, and at least one adder that are hardwired.
5. The neural processing unit of claim 1, further comprising a plurality of comparators configured to compare the operational value with each of a plurality of input segment boundary values and output segment determination data.
6. The nerve processing unit of claim 1, further comprising a plurality of comparators configured to determine whether to operate by a comparator enable signal.
7. The neural processing unit of claim 1, further comprising a plurality of comparators configured to output segment determination data,
wherein the program activation function execution unit is configured to generate the activation value by applying a gradient and an offset of a corresponding segment among the plurality of programmable segments to the operation value according to the segment determination data.
8. The neural processing unit of claim 4, wherein the at least one multiplier multiplies the input value with a gradient of the programmable segment output from the selector.
9. The neural processing unit of claim 8, wherein the at least one adder adds a value output from the at least one multiplier obtained by multiplying the input value with the gradient for the programmable segment to an offset for the programmable segment.
10. The neural processing unit of claim 4, wherein the selector outputs, from the plurality of section determination data, a gradient for a second order term, a gradient for a first order term, and an offset for a second order term of a programmable segment corresponding to a section of a segment to which the input value belongs, among gradients for a plurality of programmable segments.
11. The neural processing unit of claim 10, wherein the at least one multiplier comprises:
a first multiplier for multiplying an input value by a coefficient of the second order term for the programmable segment output from the selector;
a second multiplier for multiplying the output value output from the first multiplier by the input value; and
A third multiplier for multiplying the input value by a coefficient of the first-order term for the programmable segment output from the selector.
12. The neural processing unit of claim 11, wherein operation of the second multiplier and the third multiplier is controlled by a first enable signal.
13. The neural processing unit of claim 12, wherein the at least one adder comprises:
a first adder that adds an output value of the third multiplier to an output value of the second multiplier; and
a second adder for adding the offset for the programmable segment output from the selector to the output value of the first adder.
14. The neural processing unit of claim 13, wherein operation of the second adder is controlled by the first enable signal.
15. The neural processing unit of claim 4, wherein the programming activation function execution unit further comprises a logarithmic operator that performs a logarithmic operation of the output value of the at least one adder.
16. The neural processing unit of claim 15, wherein operation of the logarithmic operator is controlled by a second enable signal.
17. The neural processing unit of claim 1, further comprising a library of programmable activation functions storing gradient and offset information for configuring a plurality of programmable segments of programmable activation functions.
18. The neural processing unit of claim 1, wherein the at least one processing element is connected to the programmed activation function execution unit through a multiplexer.
19. The nerve processing unit of claim 1, further comprising an activation function conversion programming unit that programs an activation function into the at least one programmed activation function.
20. The neural processing unit of claim 19, wherein the activation function conversion programming unit preferentially determines linear and nonlinear sections of the at least one programmed activation function from slope change data.
21. The neural processing unit of claim 20, wherein the activation function conversion programming unit determines a section of the slope change data where the second derivative is below a threshold as the linear section.
22. The neural processing unit of claim 20, wherein the activation function conversion programming unit determines a section of the slope change data where the second derivative is above a threshold as the nonlinear section.
23. The neural processing unit of claim 22, wherein the activation function conversion programming unit divides the nonlinear section into a plurality of sections based on an integrated value of the second derivative.
24. The neural processing unit of claim 19, wherein the activation function conversion programming unit converts the linear section of the at least one programmed activation function into a programmable segment approximated by a linear function.
25. The neural processing unit of claim 19, wherein the activation function conversion programming unit converts the nonlinear section of the at least one programmed activation function into a programmable segment approximated by a quadratic function.
26. The neural processing unit of claim 19, wherein the activation function conversion programming unit converts the nonlinear section of the at least one programmed activation function into a programmable segment approximated by a logarithmic function.
CN202280047322.4A 2021-12-01 2022-12-01 Neural processing unit including a programmed activation function execution unit Pending CN117677958A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR10-2021-0170040 2021-12-01
KR1020220165012A KR102651560B1 (en) 2021-12-01 2022-11-30 Neural processing unit including a programmed activation functrion execution unit
KR10-2022-0165012 2022-11-30
PCT/KR2022/019376 WO2023101472A1 (en) 2021-12-01 2022-12-01 Neural processing unit comprising programmed activation function execution unit

Publications (1)

Publication Number Publication Date
CN117677958A true CN117677958A (en) 2024-03-08

Family

ID=90069949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280047322.4A Pending CN117677958A (en) 2021-12-01 2022-12-01 Neural processing unit including a programmed activation function execution unit

Country Status (1)

Country Link
CN (1) CN117677958A (en)

Similar Documents

Publication Publication Date Title
Nakahara et al. A lightweight YOLOv2: A binarized CNN with a parallel support vector regression for an FPGA
CN108701250B (en) Data fixed-point method and device
Faraone et al. AddNet: Deep neural networks using FPGA-optimized multipliers
US20190244097A1 (en) Information processing apparatus and information processing method
Daghero et al. Energy-efficient deep learning inference on edge devices
US11537879B2 (en) Neural network weight discretizing method, system, device, and readable storage medium
DiCecco et al. FPGA-based training of convolutional neural networks with a reduced precision floating-point library
US11783200B2 (en) Artificial neural network implementation in field-programmable gate arrays
US11625607B2 (en) Method of structured network pruning and sparsity speed-up
Lee et al. Successive log quantization for cost-efficient neural networks using stochastic computing
CN115204355A (en) Neural processing unit capable of reusing data and method thereof
CN113407747A (en) Hardware accelerator execution method, hardware accelerator and neural network device
Posewsky et al. A flexible fpga-based inference architecture for pruned deep neural networks
CN117677958A (en) Neural processing unit including a programmed activation function execution unit
Wong et al. Low bitwidth CNN accelerator on FPGA using Winograd and block floating point arithmetic
KR102651560B1 (en) Neural processing unit including a programmed activation functrion execution unit
EP4020324A1 (en) Compressing a set of oefficients for subsequent use in a neural network
KR20210116182A (en) Softmax approximation method and apparatus
KR20240041307A (en) Neural processing unit including a programmed activation functrion execution unit
Ullah et al. Approximate Arithmetic Circuit Architectures for FPGA-based Systems
CN113191494A (en) Efficient LSTM accelerator based on FPGA
Przewlocka-Rus et al. Energy efficient hardware acceleration of neural networks with power-of-two quantisation
KR102553119B1 (en) Method for generating programmable activation function and apparatus using the same
US11836604B2 (en) Method for generating programmable activation function and apparatus using the same
CN110770696B (en) Processing core operation suppression based on contribution estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination