CN117391157A

CN117391157A - Hardware acceleration circuit, data processing acceleration method, chip and accelerator

Info

Publication number: CN117391157A
Application number: CN202210757664.7A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Guangzhou Xiaopeng Autopilot Technology Co Ltd
Current assignee: Guangzhou Xiaopeng Autopilot Technology Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2024-01-12

Abstract

The application relates to a hardware acceleration circuit, a data processing acceleration method, a chip and an accelerator. The circuit comprises: the natural logarithm module is used for obtaining the natural logarithm value of the ith data element and the natural logarithm value of the mean square error of the n data elements in the n data elements of the data set, wherein n is greater than 1; the addition and subtraction module is used for obtaining a subtraction operation result between the natural logarithmic value of the ith data element and the natural logarithmic value of the mean square error; the exponential function module is used for obtaining an exponential function value of the subtraction operation result; and a multiplication module for obtaining the multiplication result between the exponent function value and the mask tensor of the ith data element to obtain a specific function value corresponding to the ith data element. The scheme provided by the embodiment of the application is beneficial to improving the accuracy of the obtained standardized function value.

Description

Hardware acceleration circuit, data processing acceleration method, chip and accelerator

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a hardware acceleration circuit, a data processing acceleration method, a chip and an accelerator.

Background

Artificial intelligence has been developed from birth, and the theory and technology are mature, and the application field is expanding. Taking neural network based deep learning as an example, the requirement on the quantity and quality of training data is high. In order to improve the training effect of the neural network, factors which are adverse to training in training data can be eliminated. For example, the raw data is processed using a method of data Normalization (Normalization).

The normalization processing of data is widely applied to deep learning and the like. The function values of the standardized functions may be calculated by a general-purpose computing unit such as a Central Processing Unit (CPU) or a Graphic Processor (GPU) in the related art. However, in the case where the processing of the neural network is performed by a hardware circuit such as a deep learning accelerator (Deep Learning Accelerator, abbreviated DLA) or a neural network processor (Neural Network Processing Unit, abbreviated NPU), if a normalization function layer (such as a normalization layer) is located in a network middle layer of the neural network, job migration (job migration) overhead between the DLA/NPU and the CPU/GPU may be caused, so that the scheme of determining the normalization function value using the CPU/GPU is inefficient, resulting in an increased system bandwidth and higher power consumption.

Disclosure of Invention

In order to solve or partially solve the problems existing in the related art, the application provides a hardware acceleration circuit, a data processing acceleration method, a chip and an accelerator, which are beneficial to improving the speed of obtaining the layer standardization function value on the basis of meeting the precision requirement on the layer standardization function.

A first aspect of the present application provides a hardware acceleration circuit, comprising:

the natural logarithm module is used for obtaining the natural logarithm value of the ith data element and the natural logarithm value of the mean square error of the n data elements in the n data elements of the data set, wherein n is greater than 1;

The addition and subtraction module is used for obtaining a subtraction operation result between the natural logarithmic value of the ith data element and the natural logarithmic value of the mean square error;

the exponential function module is used for obtaining an exponential function value of the subtraction operation result;

and a multiplication module for obtaining the multiplication result between the exponent function value and the mask tensor of the ith data element to obtain a specific function value corresponding to the ith data element.

A second aspect of the present application provides an artificial intelligence chip comprising a hardware acceleration circuit as described above.

A third aspect of the present application provides a data processing acceleration method, where the method includes:

obtaining a natural logarithmic value of an ith data element and a natural logarithmic value of a mean square error of n data elements in n data elements of the data set, wherein n is greater than 1;

obtaining a subtraction operation result between the natural logarithmic value of the ith data element and the natural logarithmic value of the mean square error;

obtaining an exponential function value of a subtraction operation result;

a multiplication result between the exponent function value and the mask tensor of the i-th data element is obtained to obtain a specific function value corresponding to the i-th data element.

A fourth aspect of the present application provides an artificial intelligence accelerator comprising:

A processor; and

a memory having executable code stored thereon which, when executed by a processor, causes the processor to perform the method as above.

A fifth aspect of the present application provides a computer readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to perform a method as above.

In some embodiments of the present application, the normalization function is deformed, so that the variance value corresponding to at least part of elements in the input data set is changed from denominator to numerator, and the accuracy loss possibly caused by the reciprocal approaching 0 when the addition value involved in the variance calculation process is larger is reduced. The accuracy of the obtained layer standardization function value is improved on the basis of guaranteeing the processing speed.

Furthermore, the natural logarithmic value and the exponential function value are obtained in a table look-up mode, complex exponential operation, natural logarithmic operation and reciprocal operation are avoided, the data processing speed in the layer standardized function calculation process can be improved, and the layer standardized function value can be obtained more quickly. On the other hand, the area and cost of the excessive hardware circuit for realizing the exponential operation, the natural logarithm operation and the reciprocal operation are reduced.

Further, the precision of the Layer Norm function is increased for the 8-bit DLA or other hardware architecture of the related art.

Further, based on the above operation of transforming the normalization function, when at least one of the exponential function value, the natural logarithmic value, and the like is obtained using a look-up table (LUT), a look-up table having fewer entries is used, which is more acceptable for hardware implementation.

Further, LUT combinational logic circuits are too expensive for Integer (INT) 16 (2 ¹⁶ =65536 entries), performance is unacceptable if a time-sharing based circuit is used. As it takes 65536 cycles to complete a single LUT result. According to the embodiment of the application, INT 8, INT 10 or the like is used after data conversion, so that the number of periods of table lookup can be greatly reduced, the query result of a single LUT can be obtained, and the processing speed is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.

FIG. 1 is a schematic diagram of layer normalization as shown in an embodiment of the present application;

FIG. 2 is a schematic diagram of a neural network according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a neural network for classification according to an embodiment of the present application;

FIG. 4 is a block diagram of the hardware acceleration circuit of an embodiment of the present application;

FIG. 5 is a block diagram of a hardware acceleration circuit according to another embodiment of the present application;

FIG. 6 is a block diagram of the basic look-up table circuit unit of an embodiment of the present application;

FIGS. 7-9 are block diagrams of hardware acceleration circuits according to further embodiments of the present application;

FIG. 10 is a flow chart of a data processing acceleration method according to an embodiment of the present application;

FIG. 11 is a flow chart of a data processing acceleration method according to another embodiment of the present application;

FIG. 12 is a block diagram of an artificial intelligence accelerator according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

The data may be normalized during use or training of the artificial intelligence model. The normalization processing of data refers not only to normalization processing before the data is input into the model, but also normalization processing can be performed on the output data of each layer in the model after the data is input into the model. For example, it is possible to observe whether the output result and weight distribution of each layer in the network model are within an allowable range by means of some tools, and the output of each layer is fixed within a certain range in general after the data is normalized.

For the standardized method of bounded data, since the boundary of the data is fixed, it is obvious that the boundary is related to the number of data, and the minimum and maximum values. The original data and the data quantity size, the minimum value and the maximum value can be calculated, so that the data is scaled in a certain range according to a certain proportion. For example, processing the image data to process the data between [0,255] to between [ -1,1] meets the criteria of 0 as the mean and 1 as the standard deviation, i.e., performs a fixed scaling of the bounded data.

When the neural network is trained by using the gradient descent method, the feature distribution of training data is changed continuously as the depth of the neural network increases. In order to ensure the stability of data characteristic distribution, data standardization can be realized by adding standardization, so that a larger learning rate and a higher convergence rate can be obtained. Meanwhile, the standardization also has a certain overfitting resistance function, so that the training process is more stable.

By normalizing the data, at least some of the effects shown below can be obtained.

On the one hand, the method can unify dimensions, smooth gradients among different layers or among different batches, and prevent model gradient explosion or gradient dispersion. After data normalization, as the values of the original data are scaled in a smaller range, the gradient gap between the original data is also reduced integrally, so that the effect of smoothing the gradient is achieved, and large oscillation of data of a certain layer in the model can be prevented. Excessive data concussion may cause the derivative changes of the various parameters it is to train to increase, while for derivative, the smoother the better. Therefore, when the model is back-propagated and derived, the loss of the model also tends to be stable, and the problem of gradient explosion or gradient disappearance is not easy to occur.

On the other hand, the negative influence of the singular data (outlier) on model training is eliminated, and the convergence speed is increased. The singular data is the data with larger variance in the original data structure, and the representation on the number axis is far from other sample data. Since in the gradient descent method based on deep learning, an average error is obtained by calculating the average value for the loss error between the output of the model and the label. Thus, when the gradient is counter-propagated using the gradient descent method, the parameters are updated by mean error.

On the other hand, negative effects of noise data on the model are eliminated, preventing the model from being over fitted. The influence of noise data on the model output result can be reduced by data normalization, the dimension interference of the data can be removed by normalization processing, and random noise is removed by wavelet noise reduction processing. The gradient between noise and other data around the noise is often larger, and the data normalization processing is used, so that the gradient between the noise and the data can be reduced, and the effect of eliminating the noise is achieved.

For example, the data may be normalized as follows: mean variance normalization, maximum normalization, absolute maximum normalization, maximum minimum normalization, norm normalization, quartile normalization, etc. The mean variance normalization can scale data to a standard normal distribution N0, 1 with a mean value of 0 and a standard deviation of 1, and is a widely applied normalization processing mode.

The normalization can be classified into: layer normalization (Layer Normalizaiton, layer Norm or LN for short), instance normalization (instance normalization, instance Normalizaiton, IN for short), group normalization (Group Normalizaiton, GN for short), batch normalization (Batch Normalizaiton, BN for short), and the like.

Fig. 1 is a schematic diagram of layer normalization as shown in an embodiment of the present application.

Referring to fig. 1, a Layer standardized Layer Norm is illustrated as an example. The calculation of layer standardization is to perform standardization processing on each channel (C) and space dimension (H, W) independently, is not affected by batch size (batch size), and can be applied to a Recurrent Neural Network (RNN) network and the like. For example, LN function values may be obtained by job migration between DLA/NPU and CPU/GPU, which may be implemented by invoking torch. Nn. LayerNorm (out_channels, H, W) in pyrach, but CPU/GPU efficiency is not high, job migration may result in reduced performance, increased system bandwidth, and higher power consumption.

Hardware solutions in Deep Learning Accelerators (DLAs) or Neural Processing Units (NPUs) for Layer Norm are not common. INT 8 is the most widely used precision in DLA/NPU, but because of its wide dynamic range in intermediate values, it is difficult to provide an 8-bit hardware solution for Layer Norm. Specifically, layer normalization is the normalization of all neurons of a layer. The layer normalization function is shown in the formulas (1) to (3).

x _i ＝x′ _i Mean type (2)

mean＝(∑ _n x′ _i ) N type (3)

Wherein X 'is' _i Is the data which need to be standardized, y (x) _i Is the normalized data, var is the variance,is the mean square error. It is to be noted that +_in the formula (1) can be used>Replaced by->Epsilon is used to promote numerical stability in case the denominator accidentally goes to zero.

From equations (1) through (3), it can be seen that the dedicated hardware pipeline of Layer Norm is not feasible for large computing power implementations (area cost) due to the complex square root and inverse (1/square root) in Layer Norm functions. Layer Norm is a normalization function applied to the C, H, W dimension to obtain zero mean and unit variance, which is beneficial for training time and network performance. During the reasoning process, the circuit should dynamically calculate the mean/variance of the feature map to accomplish Layer Norm. The process of normalizing data may involve complex square root and reciprocal (1/square root) operations.

It is considered that dedicated hardware pipes for Layer Norm functions are not feasible for implementation of large computing power. For example, an increase in computing power can result in expensive hardware costs. One approach to the acquisition of the Layer Norm function in the related art is implemented using a look-up table (LUT) that looks up an integer of 16 bits (INT 16). However, a 16-bit LUT is a table that occupies a large memory space, e.g., comprising 2 ¹⁶ (i.e., 65536) entries, require large Static Random Access Memory (SRAM)/Dynamic Random Access Memory (DRAM) to store data, which can result in excessive cost of LUT combinational logic circuits. On the other hand, when a 16-bit LUT is used, the processing time for completing a single look-up table is too long. If a time-sharing based circuit is used, performance is unacceptable because it takes 65536 cycles to complete a single LUT result.

The embodiment of the application provides a hardware acceleration circuit, a chip, a data processing acceleration method and an accelerator, which are used for converting a layer standardized function, so that at least partial logic components in the converted layer standardized function can be realized by adopting a plurality of lookup tables occupying smaller storage space, the power consumption, the bandwidth, the performance and the precision of a function value of the layer standardized function can be balanced and determined, and the requirement of a neural network is met.

For example, for an INT8 DLA/NPU, multiplication by multiply-add operation (MAC for short) is typically the INT8 format data times the INT8 format data, with the accumulation portion being a typical 32-bit convolution or matrix multiplication calculation. If the inverse of the integer value is directly calculated, there will be a considerable penalty. Therefore, the application proposes a method for Layer Norm hardware acceleration, which reduces the operation processes of complex square root, reciprocal (1/square root) and the like by converting the Layer Norm function and expressing the Layer Norm function method in a natural logarithm mode, effectively reduces the hardware cost, and reduces the number of entries of the lookup table from 65535 to 256 or 1024.

Fig. 2 is a schematic structural diagram of a neural network according to an embodiment of the present application.

Referring to fig. 2, a topology of a neural network 100 is shown, including an input layer, a hidden layer, and an output layer. The neural network 100 is capable of receiving data elements I based on an input layer ₁ 、I ₂ To perform a calculation or operation, and to generate output data O based on the result of the calculation ₁ 、O ₂ 。

For example, the neural network 100 may be a deep neural network (Deep Neural Networks, DNN for short) comprising one or more hidden layers. The neural network 100 in fig. 2 includes an input layer L1, two hidden layers L2, L3, and an output layer L4. Among these, DNNs include, but are not limited to, convolutional neural networks (Convolutional Neural Networks, CNN for short), recurrent neural networks (Recurrent Neural Network, RNN for short), and the like.

It should be noted that, the four layers shown in fig. 2 are only for facilitating understanding of the technical solution of the present application, and are not to be construed as limiting the present application. For example, the neural network may include more or fewer hidden layers. A normalization layer may be provided before each layer shown in fig. 2, for implementing normalization processing of input data for that layer.

Nodes of different layers of the neural network 100 may be connected to each other for data transmission. For example, one node may receive data from other nodes to perform calculations on the received data and output the calculation results to nodes of other layers.

Each node may determine output data for the node based on the output data and weights received from the nodes in the previous layer. For example, in FIG. 2Representing the weight between the first node of the first layer and the first node of the second layer. />Representing output data of a first node of the first layer. />Representing the bias value of the first node in the second layer, the output data of the first node of the second layer may be represented as: />The output data of the other nodes is calculated in a similar manner and will not be described in detail here.

In some embodiments, a normalization layer, such as an LN layer, is configured in the neural network, where the LN layer may normalize input data of a hidden layer corresponding to the LN layer, and so on.

In some embodiments, an activation function layer, such as a flexible Max (Soft Max) function layer, is configured in the neural network, which may convert the resulting values for each class into probability values.

In some embodiments, the neural network is configured with a loss function layer after the flexible maximum function layer, the loss function layer being capable of calculating the loss as an objective function for training or learning.

It can be understood that the neural network can respond to the data to be processed, and the recognition result is obtained after the data to be processed is processed; the data to be processed may include, for example, at least one of voice data, text data, and image data.

One typical type of neural network is a neural network for classification. The neural network for classification may determine the class to which the data element belongs by calculating the probability that the data element corresponds to each class.

Fig. 3 is a schematic structural view of a neural network for classification according to an embodiment of the present application.

Referring to fig. 3, the neural network 200 for classification of the present embodiment may include a hidden layer 210, a fully connected layer (Full Connect Layer, abbreviated as FC layer) 220, a flexible maximum function layer 230, and a loss function layer 240. Wherein at least some of the hidden layer 210, the fully connected layer 220, etc. may each be preceded by a normalization layer.

As shown in fig. 3, the neural network 200 performs normalization processing on the data to be classified by the normalization layer in response to the data to be classified, and then sequentially performs computation in the order of the hidden layer 210 and the FC layer 220, and the FC layer 220 outputs a computation result s corresponding to the classification probability of the data element. Wherein the FC layer 220 may include a plurality of nodes respectively corresponding to a plurality of classes, each node outputting a result value corresponding to a probability that the data element is classified into the corresponding class. For example, referring back to fig. 2, the fc layer 220 corresponds to the output layer L4 in fig. 2, and has two nodes corresponding to two classifications (first and second), and the output value of one node may be a result value indicating the probability that the data element is classified into the first class, and the output value of the other node may be a result value indicating the probability that the data element is classified into the second class. The FC layer 220 outputs the calculation result s to the flexible maximum function layer 230, and the flexible maximum function layer 230 converts the calculation result s into a probability value y, and may also perform normalization processing on the probability value y.

The flexible maximum function layer 230 outputs a probability value y to the loss function layer 240, and the loss function layer 240 may calculate a cross-entropy loss (cross-entcopy loss) L of the result s based on the probability value y.

During the back propagation learning process, the flexible maximum function layer 230 calculates the gradient of the cross entropy loss LThen, the FC layer 220 performs learning processing based on the gradient of the cross entropy loss L. For example, FC layer 2 may be updated according to a gradient descent algorithmA weight of 20. Further, a subsequent learning process may be performed in the hidden layer 210.

The neural network 200 may be implemented in software, or in hardware circuitry, or in a combination of software and hardware. For example, in the case of a hardware circuit implementation, the normalization layer, the hiding layer 210, the FC layer 220, the flexible maximum function layer 230, and the loss function layer 240 are all implemented by hardware circuits, and may be integrated in one artificial intelligence chip or distributed among multiple chips. By the configuration, when the normalization layer is realized by the CPU/GPU, data migration between other layers of the neural network and the processors such as the CPU/GPU is avoided, the data processing efficiency of the neural network can be improved, the data processing delay and the power consumption are reduced, and the occupied bandwidth is prevented from being increased.

The following describes the technical scheme of the embodiments of the present application in detail with reference to the accompanying drawings.

Fig. 4 is a block diagram of a hardware acceleration circuit according to an embodiment of the present application. In this application, the hardware acceleration circuit may be used, for example, but not limited to, implementing the normalization layer in the neural network 200 described above, and may be, for example, but not limited to, a circuit component in a CPLD (Complex Programming logic device, complex programmable logic device) chip, an FPGA (Field Programmable Gate Array ) chip, a dedicated chip, or the like.

For ease of understanding the present application, the layer normalization function is described below. Assuming an array X, the ith element X _i The calculation formula of the Layer Norm function value of (a) can be shown as the formula (1) to the formula (3). Since there is a relationship as shown in the formulas (4) to (5):

formula (6) can be obtained based on formula (1), formula (4) and formula (5)

In which y (x) _i Represents the ith element x _i E is a natural constant, ln represents a logarithm (natural logarithm) based on a natural number e, x _i Representing the ith element of array X, var representing the variance of the ith element, mask _i Representing a mask tensor for characterizing an ith element x _i Is a symbol of (c).

Referring to fig. 4, the hardware acceleration circuit of the present embodiment includes a natural logarithm module 310, an addition and subtraction module 320, and an exponential function module 330.

The natural logarithm module 310 is configured to obtain an ith data element x of n data elements of the data set _i Natural logarithm of the mean square error of the n data elements. The natural logarithm value can be calculated by a special hardware circuit or can be determined by a table look-up mode.

The add-subtract module 320 is configured to obtain the ith data element x _i The result of the subtraction between the natural logarithm value of the mean square error and the natural logarithm value of the mean square error. It should be noted that, in this application, the add-subtract module 320 may have three cases including an adder without a subtractor, including a subtractor without an adder, and including both an adder and a subtractor.

The exponential function module 330 is configured to obtain an exponential function value of the subtraction result.

The multiplication module 340 is configured to obtain a multiplication result between the exponent function value and the mask tensor of the ith data element to obtain a multiplication result with the ith data element x _i Corresponding specific function values.

Mask tensor is used to represent the ith data element x _i Such as a positive or negative sign. For example, if the i-th data element is-0.5, then the mask tensor is used to represent "-"; the i-th data element is 0.5, then the mask tensor is used to represent "+". Ith data element x _i Mask tensor mask of (a) _i Can be represented by formula (7).

In this embodiment, the specific function may be a layer normalization function, which may be expressed using an exponential function and a natural logarithm. Specific functions include, but are not limited to: layer Norm functions, instance normalization functions (instance Norm), group normalization functions (group Norm), etc.

In some embodiments, data element x _i The initial data x 'can be obtained by the acceleration circuit' _i For example, the product is obtained by performing the treatment of the above formula (2). For example, the initial data is input data of the neural network or data output by some intermediate layer of the neural network. Furthermore, data element x _i Or may be data obtained by a hardware acceleration circuit from, for example, a CPU or the like.

In data element x _i The initial data x 'is obtained by the acceleration circuit' _i The example obtained after the data processing is exemplarily described. Specifically, the add-subtract module is further configured to output an ith initial data x 'of n initial data of the initial data set B' _i Subtracting the average value of the n initial data to obtain a subtraction result containing the n data elements x _i Is a data set a of (a).

In some embodiments, the variance of n data elements is obtained first, and then the natural logarithm value of the mean square error of the n data elements is obtained from the variance. The variance of the ith data element may be calculated by the acceleration circuit. Alternatively, the variance of the ith data element may be obtained externally by the acceleration circuit, as provided by the CPU.

For example, data set A includes n data elements, 0.ltoreq.i.ltoreq.n-1. The variance of the ith data element may be calculated by the acceleration circuit in the manner shown below.

In particular, the multiplication module is further configured to obtain a multiplication result x between the ith data element and the ith data element _i ² I.e. the square result of the i-th data element.

The addition and subtraction module is also used for adding the square operation results of the n data elements to the addition operation results sigma _n x _i ² 。

The hardware acceleration circuit further includes a shift module for performing a right shift operation on the addition result of the square operation result of the n data elements to obtain a variance var (available formula var= Σ _n x _i ² /n＝∑ _n x _i ² * mul rshift) to obtain the mean square error of the ith data element based on variance varWhere n is determined from the correction data and set by software. mul is a scale factor. For example by accumulating individual data elements x _i The square operation result of (2) calculates variance var, and saturates to INT 8 through the scale factor mul and the right shift number rshift to obtain the mean square error +_of the integer type with the bit width of 8 bits>

In some embodiments, the average mean of the n initial data may be obtained by the add-subtract module 320. Alternatively, the average mean of n pieces of initial data may be data obtained by a hardware acceleration circuit from, for example, a CPU or the like.

Taking the example that the average mean of n initial data may be obtained by the addition and subtraction module 320 as an example.

The add-subtract module 320 is also used for n initial data x' _i And executing accumulation to obtain the addition operation results of the n initial data. The shift module is further configured to perform a right shift operation on the addition result of the n initial data, thereby obtaining an average mean of the n initial data. For example by accumulating all x' _i And saturated to int8 by the scaling factor mul and right shift bit number rishift to obtain an average value of integer types with bit width of 8 bits, which can be expressed as:

mean＝(∑ _n x _i ')/n＝∑ _n x _i '*mul》rshift。

it will be appreciated that the scale factors mul and right shift count rishift of the two shift processes described above may be different depending on the actual need. It will be appreciated that the two shift processes may be implemented by the same shift circuit, or may be implemented by two shift circuits that are independently provided, respectively.

In some embodiments, the hardware acceleration circuit may also be configured to accelerate the initial data x' _i Format conversion is performed to reduce the bandwidth required to normalize the data. For example, the initial data x 'may be reduced' _i The number of bits occupied reduces the bandwidth occupied.

Specifically, the hardware acceleration circuit further comprises a first conversion circuit for obtaining n initial data x 'in the initial data set' _i N initial data x' _i And respectively converting the data with the first bit width (N2) into the data with the second bit width (N0), and outputting the data to the addition and subtraction module. For example, the initial data may be converted from a floating point number of N2 bits (e.g., 32 bits) to an integer of 8 bits, and the integer of 8 bits may be used as the i-th initial data x' _i 。

The hardware acceleration circuit provided by the embodiment effectively reduces the evolution function, the reciprocal function and the like by converting the standardized function, and is beneficial to reducing the complexity of the hardware acceleration circuit. In addition, the operation precision of the standardized function is improved by eliminating the denominator.

In some embodiments, the natural logarithmic value and the exponential function value in equation (6) may be obtained by a lookup table to increase the data processing speed. For example, ln x can be obtained by means of a lookup table _i 、And e ^z Wherein z is +.>

For example, the expression (7) can be obtained by converting the expression (6).

y(x) _i ＝LUT ₂ (LUT ₀ (x _i )-LUT ₁ (var))*mask _i (7)

Wherein LUT ₀ Representing a second look-up table, LUT ₁ Representing a third lookup table, LUT ₂ Representing a first look-up table.

Specifically, the exponential function module includes a first lookup table module for outputting an exponential function value, such as e, of the subtraction result based on the first lookup table ^z Is a function of (a).

Further, the natural logarithm module may include some or all of the second lookup table module, the third lookup table module. The second lookup table module is used for outputting natural logarithmic value corresponding to the ith data element, such as ln x, based on the second lookup table _i Is a function of (a). The third lookup table module is used for outputting natural logarithmic values of the mean square error of the n data elements based on the variance of the n data elements and the third lookup table, such asIs a function of (a). It should be noted that the third lookup table establishes var and +.>Mapping relationship between them, not +.>And->The mapping relation between the two components does not need to carry out square operation, and the complexity of a hardware acceleration circuit is further reduced.

In this embodiment, referring to equation (1), the square function value (Σ) of each data element is calculated _n x _i ² ) The inverse of the addition result of/n and the subsequent multiplication process are converted, thereby avoiding that the inverse approaches 0 to cause the possibility of Is beneficial to improving the precision of the obtained layer standardized function value.

Furthermore, each evolution function value, each natural logarithm value and each exponential function value are obtained in a table look-up mode, so that complex evolution operation, natural logarithm operation and reciprocal operation are avoided, the data processing speed in the Layer standardization function calculation process can be improved, and the Layer standardization (Layer Norm) function value can be obtained more quickly. On the other hand, excessive hardware circuit area and excessive cost for realizing complex evolution operation, natural logarithm operation and reciprocal operation are avoided.

Fig. 5 is a block diagram of a hardware acceleration circuit according to another embodiment of the present application. Referring to fig. 5, the hardware acceleration circuit of the present embodiment includes: the second lookup table module 410, the third lookup table module 420, the first lookup table module 450, the subtractor 430, the conversion circuit 440, and the multiplication module 340. In this embodiment, the second lookup table module 410, the third lookup table module 420, and the first lookup table module 450 are respectively implemented by independent lookup table circuits, which are also referred to as a second lookup table circuit, a third lookup table circuit, and a first lookup table circuit. It will be appreciated that in other embodiments of the present application, some or all of the look-up table modules may also be implemented as software modules.

The second lookup table circuit is configured to output a natural logarithm value corresponding to an ith data element in the data set based on the second lookup table in response to the index value of the ith data element. The index value of a data element is data with a bit width of N0 bits.

In one embodiment, the index values of the plurality of i-th data elements are sequentially input to the second lookup table module 410. The second lookup table circuit sequentially outputs natural logarithmic values corresponding to the data elements in the second lookup table. Each natural pair value in the second lookup table is data having a bit width of N1 bits.

It will be appreciated that the index value of the data may be the data itself or may be obtained by conversion from the data, for example, may be a portion of the data that is truncated from the data.

The second look-up table may be used to implement a mapping between the index values of the data elements and their natural logarithmic values. The second lookup table enables natural logarithmic values of the data elements to be determined through a preset mapping relation without complex function calculation.

Similarly, the third lookup table module 420 is configured to output a natural logarithmic value of the mean square error with the n data elements based on the third lookup table in response to the index value of the variance of the n data elements. The index value of the variance of the data element is data with a bit width of N3 bits.

The index value of the variance of the n data elements is input to the third lookup table module 420. The third lookup table circuit outputs a natural logarithmic value of the mean square error with the n data elements in the third lookup table. Each natural pair value in the third lookup table is data having a bit width of N4 bits.

A third look-up table may be used to implement a mapping between the index value of the variance (square of the mean square) and the natural logarithmic value of the mean square. The third lookup table enables natural logarithm values of the mean square error to be determined through a preset mapping relation without complex function calculation.

The subtractor 430 is configured to output a subtraction result between the output of the second lookup table and the output of the third lookup table.

In one embodiment, subtractor 430 inputs a table look-up result having a bit width of N5 bits, such as 10 bits of data, and outputs a subtraction result having a bit width of N6 bits (e.g., 32 bits).

The hardware acceleration circuit may further include a conversion circuit 440, where the conversion circuit 440 is configured to convert a subtraction result with a bit width of N6 bits output by the add-subtract module into a corresponding index value in response to a state control signal. The index value output from the conversion circuit 440 is data having a bit width of N7 bits. For example, from the 32-bit subtraction result, an index value of 8 bits is converted.

In one embodiment, the conversion circuit 440 may include a preamble 0 count (Leading Zero Count, LZC) circuit and a shifter. The preamble 0 count circuit outputs the number of preamble 0 s in the subtraction result to the shifter. The number of leading 0 s is the number of 0 s occurring from the most significant bit of binary data to the first 1 s.

In one specific implementation, the shifter uses the number of leading bits 0 as a shift bit number, shifts the subtraction result by the shift bit number, and outputs shifted data with a bit width of N7 bits, that is, data with N7 consecutive bits truncated from the subtraction result from the leading bit 1 to the lower bit direction, as an index value of the subtraction result. It will be appreciated that the particular configuration of the conversion circuit 440 may be based on the particular data structure of the index value.

In some embodiments, the first lookup table to the third lookup table are respectively stored in different storage areas of the storage module, and the first lookup table circuit to the third lookup table circuit are respectively configured with a basic lookup table circuit unit, so that the lookup operations are completed independently of each other.

In this embodiment of the present application, the Memory module may be, for example, RAM (Random-Access Memory), ROM (Read-Only Memory), FLASH, etc.

In some embodiments, a lookup table for an ln () function may be generated as follows. In ln (x) _i ) For example, according to x _i Less than 0.01, the natural logarithm rapidly approaches negative infinity (data overflow), and x _i Above 15, the difference between natural log values can be ignored, x _i Is limited to, for example, [0.01, 15 ]]Within the range. First, x can be _i Is divided into 256 points. Then, ln (x _i ) And quantizes all of these results to map them to [ -512,512]Within the range. All these quantized values are then filled into a table.

The first lookup table circuit is configured to output an exponential function value corresponding to the output data of the conversion circuit 440 based on the first lookup table in response to the index value corresponding to the output data of the conversion circuit 440. The index value is data having a bit width of N7 bits.

The first lookup table circuit sequentially outputs the index function values corresponding to the subtraction results in the first lookup table. Each exponent function value in the first lookup table is data having a bit width of N8 bits.

The first lookup table may be used to implement a mapping between the subtraction result and the exponential function value. The first lookup table enables the exponential function value of the data to be determined through a preset mapping relation without complex function calculation.

The first lookup table circuit is used for responding to the index value of the subtraction result and outputting an index function value corresponding to the subtraction result based on the first lookup table. The index value of the subtraction result is data having a bit width of N7 bits.

In some embodiments, a lookup table for an exponential function may be generated as follows. The value of the subtraction result between the natural logarithmic value of the i-th data element and the natural logarithmic value of the mean square error thereof is negative or 0, and when the negative value is as small as a certain value, the difference between the exponential function values can be ignored, and therefore, the value range of the subtraction result is limited to [ -10,0] for example. Since the value of the subtraction result is negative or 0, the value of the exponent function based on e is normalized to the (0, 1) range.

In one implementation, the second lookup table (LUT ₀ ) The index value of the data element input in (a) is a fixed point integer with a bit width of 8 (N0) bits. Each natural logarithmic value output is data with a bit width of 10 (N1) bits, a third look-up table (LUT ₁ ) The index value of the variance of the N data elements inputted is data with a bit width of 8 (N3) bits, and each natural logarithmic value outputted is data with a bit width of 10 (N4) bits. The subtraction result of the plurality of natural logarithmic values is data having a bit width of 32 (N6) bits, and the index value of the subtraction result is data having a bit width of 8 (N7) bits. First lookup table (LUT) ₂ ) Data with 8 (N7) bits of bit width is input, and data with 8 (N8) bits is output. The multiplication block 340 has 8 (N9) bit data as an input and 8 (N10) bit data as an output. Namely, N0,N3, N7, N8 to N10 are all 8, N2 (bit width of initial data) and N6 are 32, and N1, N4, N5 (bit width of input data of subtracter) are 10. That is, the second and third lookup tables are 8-bit input 10-bit output, and the first lookup table is 8-bit input 8-bit output.

Therefore, in the embodiment of the application, the value range of each data to be processed in the process of obtaining the specific function value (such as the layer standardized function value) can be integrally limited within a certain range, so that the scheme of the application can be conveniently realized by data with less bit width and corresponding hardware circuits. For example, when the first and second lookup tables are 8-bit input and 8-bit output and the third lookup table is 10-bit input and 8-bit output, the memory space occupied by the three lookup tables is at most only (2×2 ⁸ ) =512 plus (1×2) ¹⁰ ) =1024 entries, 1536 entries total. The size of the hardware look-up table is significantly reduced compared to 65536 entries required for a 16-bit scheme, and the bandwidth consumed is also significantly reduced. On the other hand, the table searching speed can be improved within the accuracy allowable range, so that the response speed of the circuit is further improved, and the power consumption is reduced. The hardware circuit solution based on 8 bits basically can effectively balance important indexes such as circuit cost, power consumption, bandwidth, performance, data precision and the like.

It will be appreciated that N0 to N10 may be other values, for example values in the range [1, 32 ]; in some embodiments, N0, N1, N3, N4, N5, N7-N10 may be values in 9, 10, 11, 12, i.e., N0, N1, N3, N4, N5, N7-N10 may take values in the range of [8, 12], and N0, N1, N3, N4, N8-N10 may also be unequal.

Referring to fig. 6, in one specific implementation, the basic look-up table circuit unit 20 includes a logic circuit 21, an input terminal set 22, a control terminal set 23, and an output terminal set 24. The input set 22 inputs the data of the look-up table into the logic circuit 21. The logic circuit 21 selects a value corresponding to the index value in the lookup table from the index value (also referred to as an address) input from the control terminal group 23, and outputs the value from the output terminal group 24. The logic circuit 21 may be, for example, a logic gate circuit or a logic switch circuit. It is understood that in this application, an end group refers to a group of connection ends, including the case of one or more connection ends. The control terminal group 23 has a control terminals, and the output terminal group 24 has B output terminals, which are called the basic lookup table circuit unit 20 as a input B output.

The basic look-up table circuit unit 20 may perform a look-up table output based on a stored look-up table. Taking the second lookup table as an example, the lookup table is also a input and B output, the data element input by the control end of the lookup table is an index value with a bit width of a bits, and the output data is an index function value with a bit width of B bits. It will be appreciated that the second lookup table in the memory module stores only true values of the natural logarithmic values, and the basic lookup table circuit unit 20 is configured to implement a mapping relationship between the index values and the true values of the natural logarithmic values.

For a better understanding of the look-up process of the embodiments of the present application, table 1 below shows one specific example of a first look-up table, which is an N7 bit input and an N8 bit output, where N7 and N8 are both 8. The input data of the first lookup table control terminal may be an index value with a bit width of N7 bits, and the output data may be an index function value with a bit width of N8 bits. For ease of understanding, the data in table 1 are all represented in a 10-ary format. It will be appreciated that the first lookup table in the memory module stores only the true values of the index function values, and the lookup table circuit is used to implement the mapping relationship between the index values and the index function values, and the subtraction result and the table integer values are listed in table 1 for better understanding of the present application.

TABLE 1

Index value	The result of the subtraction operation	Exponential function value	Table integer value
				0	0	1.0	255
1	-0.0390625	0.96169	246
				2	-0.078125	0.92485	237
…	…	…	…
				254	-9.960784	0.000047	0
255	-10	0.000045	0

In combination with the table 1, the result of the subtraction is negative or 0, and the value range of the result of the subtraction is defined as [ -10,0]. To look up a table, the value range is [ -10,0]Discretized into 256 (i.e., 2 ^N0 ) Each point has an index function value shown in column "index function value", and each subtraction result corresponds to column "Index value "0, 255]An integer value within the range, each index function value corresponding to [0, 255 ] shown in the column "table integer value ]]The data in the column of the index function values are stored in a first lookup table of the storage module as true values, and the lookup table can be realized through the index values.

Fig. 7 is a block diagram of a hardware acceleration circuit according to another embodiment of the present application. In the present embodiment, the first to third lookup table modules of the lookup table circuit 30 share one basic lookup table circuit unit 20, and the subtracter and the adder share one adder operation circuit 402.

Referring to fig. 6 and 7, the hardware acceleration circuit of the present embodiment includes a lookup table circuit 30, an add-subtract module 400, a shift circuit 600, a conversion circuit 700, a multiplication module, and a storage module 10. In this embodiment, the multiplication module includes a first multiplier 390 and a second multiplier 500 that are independently disposed, and it can be appreciated that in other embodiments, the functions of the first multiplier 390 and the second multiplier 500 can be implemented by time-division multiplexing of the same multiplier.

The look-up table circuit 30 comprises a basic look-up table circuit unit 20, the basic look-up table circuit unit 20 comprising a logic circuit 21, an input terminal set 22, a control terminal set 23, and an output terminal set 24. The input terminal group 22 is connected to the memory module 10, and the logic circuit 21 is configured to: the natural logarithmic value corresponding to the i-th data element is output from the output terminal group 24 based on the second lookup table in response to the index value of the i-th data element input from the control terminal group 23 in the first period. In a second period after the first period, natural logarithmic values of the mean square error of the n data elements are output from the output terminal group 24 based on the third lookup table in response to the index values of the variance values of the n data elements input from the control terminal group 23. In a third period after the second period, in response to the index value of the subtraction result between the above two natural pair values input from the control terminal group 23, the index function value corresponding to the subtraction result is output from the output terminal group 24 based on the first lookup table.

In one embodiment, the storage module 10 includes a first storage area, and the first to third lookup tables are stored in the first storage area in a time-sharing manner. Because only one storage area is configured for storing any one of the three lookup tables in a time-sharing way, the storage space occupied by the lookup tables is effectively reduced, and the hardware cost can be reduced.

In another embodiment, the storage module 10 includes a first storage area to a third storage area, and the first lookup table to the third lookup table are respectively stored in one storage area of the three storage areas.

In a specific implementation, the basic lookup table circuit unit 20 further includes a state control terminal set for configuring the basic lookup table circuit unit 20 to an M1-bit input M2-bit output state in response to the first state control signal in a part of the first to third time periods and configuring the basic lookup table circuit unit 20 to an M3-bit input M4-bit output state in response to the second state control signal in another part of the time periods. Wherein at least one pair of M1 and M3, and M2 and M4 are unequal. I.e. M1 and M3 are not equal and/or M2 and M4 are not equal. This scheme is better suited for cases where the input/output data bit widths of the first through third look-up tables are not exactly the same.

In another embodiment, the basic lookup table circuit unit 20 may be fixed to an M1 bit input M2 bit output state, which is better suited for the case where the input/output data bit widths of the first to third lookup tables are the same.

It will be appreciated that in this embodiment, the look-up table circuit 30 further includes a first selector 40 and a second selector 50. The first selector 40 is configured to selectively output the index value of the ith data element, the index value of the variance of the ith data element, and the index value of the subtraction result between natural logarithms, which are input through different data input channels, to the control terminal group 23 of the basic lookup table circuit unit 20. The second selector 50 is configured to output the different look-up data output from the output terminal set 24 of the basic look-up table circuit unit 20 to the corresponding data output channels.

The addition and subtraction module 400 may be configured to obtain the ith data element and to obtain the subtraction result between the two natural pair values before the third time period.

In some embodiments, the addition and subtraction module includes an adder for obtaining a first addition result and a subtractor for obtaining a subtraction result between the natural logarithmic value of the i-th data element and the natural logarithmic value of the mean square error of the n data elements. Wherein the adder and the subtracter are configured independently of each other, or the adder and the subtracter share the same addition unit.

The second multiplier 500 is used for obtaining the data element x _i Square values of (a).

In some embodiments, to facilitate calculation of the mean square error and/or the average value, the hardware acceleration circuit may further include a third selector 60 for outputting the initial data to the addition and subtraction module 400 before the first period of time for accumulation processing to obtain the average mean, or for looking up the table to obtain the natural log value lnx of the i-th data element _i And outputting the result to an addition and subtraction module for obtaining a subtraction operation result between the natural logarithmic value of the ith data element and the natural logarithmic value of the mean square error of the n data elements.

In one embodiment, the hardware acceleration circuit may further include a fifth selector 80 for inputting an average mean of the initial data output by the shift circuit 600 into the inverter circuit 406 for obtaining each data element, or for inputting a natural logarithmic value of a mean square error of n data elements obtained by table lookup into the inverter circuit 406 for obtaining a subtraction result between the natural logarithmic value of the i-th data element and the natural logarithmic value of the mean square error. The add-subtract module 400 includes: an addition circuit 402, a fourth selector 404, and an inverter circuit 406.

The inverter circuit 406 is configured to output a negative number of the average mean of the initial data from the shift circuit 600 or a negative number of the natural logarithmic value of the mean square error of the n data elements.

The fourth selector 404 is used for outputting the data elements x sequentially from the second multiplier 500 _i Sequentially input to the adder circuit 402 to obtain the sum of squares of the data elements, or to output the inverter circuit 406The negative number of the natural logarithmic value of the negative number of the mean or the square root of the mean square error of the initial data is input to the addition circuit 402.

In one embodiment, referring to FIG. 7, the addition circuit 402 adds the initial data x' _i Superposition is carried out to obtain sigma _n x′ _i Then, the average mean is obtained after processing via the shift circuit 600. The inverter circuit 406 outputs the negative value of the average mean to the addition circuit 402, and the addition circuit 402 outputs the i-th initial data x' _i The result of the addition of the negative value of the mean, i.e. the ith data element x _i . Ith data element x _i Is output to the second multiplier 500 and the look-up table circuit 30. The lookup table circuit 30 outputs the ith data element x through the second lookup table _i Is input to the third selector 60. On the other hand, the second multiplier 500 multiplies the ith data element x _i Square operation is performed to output square value, and the adder operation circuit 402 outputs the square value to each data element x _i Performing superposition operation to obtain square accumulated result of each data elementThen, the square accumulated result is processed by the shift circuit 600 to obtain +.>And takes this as variance var. />

In some embodiments, the add-subtract module 400 and the second multiplier 500 together form a multiplier-adder module. The output of the multiply-add module is N6 (e.g., 32) bits of data. The conversion circuit 700 converts the variance var from N6-bit data to N7-bit (e.g., 8-bit or 10-bit, etc.) data, and outputs the converted data to the lookup table circuit 30, and the lookup table circuit 30 outputs the natural logarithmic value of the mean square error of the N data elements through the third lookup table. Ith data element x _i The natural logarithmic value of the mean square error of the n data elements is output to the addition circuit 402 via the selection of the third selector 60, the natural logarithmic value of the mean square error of the n data elements is output to the inverter circuit 406 via the fifth selector 80, and the inverter circuit 406 outputsThe negative value of the natural logarithmic value of the variance, which is output to the addition circuit 402 via the fourth selector 404, the addition circuit 402 outputs the i-th data element x _i The result of the addition of the natural logarithm value of the mean square error to the negative value of the natural logarithm value, i.e. i data element x _i The result of the subtraction between the natural logarithm value of the mean square error and the natural logarithm value of the mean square error. The addition module 400 outputs the subtraction result to the lookup table circuit 30, and the lookup table circuit 30 outputs an exponential function value corresponding to the subtraction result through the first lookup table.

The first multiplier 390 multiplies the exponent function value and the mask tensor of the i-th data element to obtain a normalized function value.

In this embodiment, through multiplexing the basic lookup table circuit units, only one basic lookup table circuit unit needs to be configured to realize the lookup requirement of three lookup table modules, so that the area and the cost of the hardware acceleration circuit can be effectively reduced. The subtracter and the adder share the addition operation circuit, so that the circuit area can be further reduced. The switching of the conversion circuit between different states can adapt to different states of the basic lookup table circuit unit, so that the lookup table with different data bit widths can be conveniently realized, and the flexibility and the applicability of the hardware acceleration circuit are improved.

Fig. 8 is a block diagram of a hardware acceleration circuit according to another embodiment of the present application. The hardware acceleration circuit of the present embodiment includes: the first lookup table module, the second lookup table module, the third lookup table module, the adder 420, the conversion circuit 440, and the subtractor 440.

Referring to fig. 8, the present embodiment is different from the hardware acceleration circuit of fig. 5 in that in the present embodiment, the second lookup table module and the third lookup table module share the first basic lookup table circuit unit 20A, and the first lookup table module is configured with the second basic lookup table circuit unit 20B. The first basic lookup table circuit unit 20A is an M1 bit input M2 bit output and the second basic lookup table circuit unit 20B is an M3 bit input M4 bit output. Wherein at least one pair of M1 and M3, and M2 and M4 are unequal.

In this embodiment, the second lookup table and the third lookup table are both M1-bit input and M2-bit output, and the first lookup table is M3-bit input and M4-bit output. In one specific example, the second and third lookup tables are 8-bit input 10-bit output, and the first lookup table is 8-bit input 8-bit output.

In one embodiment, the second lookup table and the third lookup table are stored in the first storage area of the storage module in a time-sharing manner, and the first lookup table is stored in the second storage area of the storage module.

The first basic lookup table circuit unit 20A includes a first input terminal group, a first control terminal group, a first output terminal group, and a first logic circuit, where the first input terminal group is connected to the first storage area of the memory module. The first logic circuit is configured to: the index function value corresponding to the i-th data element is output from the first output terminal group based on the first lookup table in response to the index value of the i-th data element input from the first control terminal group in the first period. In a second time period after the first time period, natural logarithm values of the mean square error of the n data elements are output from the first output terminal group based on a second lookup table in response to index values of the square error of the n data elements input from the first control terminal group.

The second basic lookup table circuit unit 20B includes a second input terminal set, a second control terminal set, a second output terminal set, and a second logic circuit, where the second input terminal set is connected to the second storage area of the memory module. The second logic circuit is used for: in a third period after the second period, in response to the index value of the subtraction result between the two natural logarithms input from the second control terminal group, an index function value corresponding to the subtraction result is output from the second output terminal group based on the third lookup table.

It will be appreciated that in this embodiment, the hardware acceleration circuit further includes a sixth selector 70. The sixth selector 70 is configured to selectively output the index value of the plurality of data elements input via different data input channels, the index value of the variance of the ith data element, to the first control terminal group of the first basic lookup table circuit unit 20A. For example, the second lookup table and the third lookup table are both 8-bit input and 10-bit output, and are implemented based on the first basic lookup table circuit unit 20A; the first look-up table is an 8-bit input, an 8-bit output, then it may be implemented based on a second basic look-up table circuit unit 20B that is different from the first basic look-up table circuit unit 20A.

In this embodiment, by multiplexing the look-up table modules having the same data bit width with one basic look-up table circuit unit, the complexity of circuit control can be reduced by configuring separate basic look-up table circuit units for the look-up table modules having different data bit widths.

Fig. 9 is a block diagram of a hardware acceleration circuit according to another embodiment of the present application. The hardware acceleration circuit of this embodiment includes a first lookup table module 450, a second lookup table module 410, a third lookup table module 420, a subtractor 800, a subtractor 430, a conversion circuit 440, and a multiplier. In this embodiment, the first lookup table module 450, the second lookup table module 410, and the third lookup table module 420 are implemented by independent lookup table circuits.

Referring to fig. 9, this embodiment is similar to fig. 5, with the following main differences. The hardware acceleration circuit of the present embodiment is provided with a subtractor 800. Subtractor 800 is configured to output a subtraction result of a plurality of initial data in an initial data set and an average value of the plurality of initial data to obtain a data set including a plurality of data elements. By the subtraction operation, the range of the value range of the data element can be reduced, so that the scheme of the application can be conveniently realized by data with less bit width and corresponding hardware circuits. The average value may be an average value obtained by processing a plurality of initial data by the hardware acceleration circuit, or may be an average value obtained from, for example, a CPU, and the like, and is not limited thereto.

Embodiments of a data processing acceleration method to obtain layer normalization function values are also provided.

According to the data processing acceleration method provided by some embodiments of the present application, a simple LUT circuit (e.g., three 8-bit LUTs) is adopted, so that hardware cost (small-sized combinational logic circuit) can be reduced.

The data processing acceleration method provided by some embodiments of the present application adopts a simple 8-bit lookup table, and can adapt to the 8-bit data processing process of the existing 8-bit DLA hardware architecture.

According to the data processing acceleration method provided by some embodiments of the application, a simple 8-bit lookup table is adopted, so that the system bandwidth requirement can be reduced.

According to the data processing acceleration method provided by some embodiments of the present application, the memory occupation of the DRAM/SRAM can be reduced by reducing the size of the lookup table.

The data processing acceleration method provided in some embodiments of the present application can reduce the size of the lookup table to obtain smaller LUT delay when using the time-sharing circuit (e.g., 256 cycles are required for each 8-bit LUT).

The data processing acceleration method provided by some embodiments of the present application may be used for multi-step or disposable hardware implementation.

The data processing acceleration method provided by some embodiments of the present application can increase the accuracy of the Layer Norm function compared to the INT 8 DLA or other hardware architecture schemes in the related art.

The data processing acceleration method provided by some embodiments of the present application is implemented using a DLA/NPU hardware accelerator, and supports end-to-end training or reasoning without using a CPU or GPU.

The data processing acceleration method provided by some embodiments of the present application may support a middle Layer or a last Layer of a convolutional neural network (Convolutional Neural Networks, CNN) or a transform network (Transformer Network).

Fig. 10 is a flow chart illustrating a data processing acceleration method according to an embodiment of the present application. Referring to fig. 10, a data processing acceleration method includes steps S1010 to S1030.

In step S1010, a natural logarithmic value of an i-th data element and a natural logarithmic value of a mean square error of n data elements in n data elements of the data set are obtained, where n is greater than 1.

In step S1020, a subtraction result between the natural logarithmic value of the i-th data element and the natural logarithmic value of the mean square error is obtained.

In step S1030, an exponential function value of the subtraction result is obtained.

In step S1040, the result of the multiplication between the exponent function value and the mask tensor of the i-th data element is obtained to obtain a specific function value corresponding to the i-th data element.

In certain embodiments, the above-described methods may further comprise the following operations. First, the square operation results of each of n data elements are obtained. Then, the addition result of the square operation result of n data elements is obtained. Then, a right shift operation is performed on the addition result, thereby obtaining variances of the n data elements, and obtaining natural logarithm values of mean square differences of the n data elements according to the variances.

In some embodiments, the data elements may be derived by preprocessing the initial data. Specifically, the method may further include an operation of outputting a subtraction result of the i-th initial data from among the n initial data of the initial data set and an average value of the n initial data to obtain the data set including the n data elements.

In some embodiments, the received initial data may be converted first to reduce the occupied bandwidth. Specifically, the above method may further include the operation of obtaining n initial data in the initial data set, and converting the n initial data from the data of the first bit width bit to the data of the second bit width bit, respectively.

In some embodiments, an average of the plurality of initial data may be obtained as follows. Specifically, the above method may further include the operation of obtaining the addition result of the n pieces of initial data, and performing a right shift operation on the addition result of the n pieces of initial data, thereby obtaining an average value of the n pieces of initial data.

In order to improve the speed-up effect of the hardware acceleration method and reduce the hardware cost, a lookup table can be used to implement the calculation of complex functions.

For example, obtaining the exponential function value of the subtraction result may include: and outputting an exponential function value of the subtraction result based on the first lookup table.

For example, obtaining the natural logarithmic value for the i-th data element, and the natural logarithmic value for the mean square error for the n data elements may include: the natural logarithm value of the i-th data element is output based on the second lookup table, and the natural logarithm value of the mean square error of the n data elements is output based on the variance of the n data elements and the third lookup table.

Fig. 11 is a flow chart illustrating a data processing acceleration method according to another embodiment of the present application.

Referring to fig. 11, a data processing acceleration method includes steps S1110 to S1160.

In step S1110, a subtraction result of the i-th initial data of the n initial data of the initial data set and the average value of the n initial data is output to obtain a data set containing n data elements.

In step S1120, a natural logarithm value corresponding to the i-th data element is output based on the second lookup table.

In step S1130, the natural logarithm value of the mean square error of the n data elements is output based on the variance of the n data elements and the third lookup table.

In step S1140, a subtraction result between the natural logarithmic value of the i-th data element and the natural logarithmic value of the mean square error is obtained; .

In step S1150, the exponent function value of the subtraction result is output based on the first lookup table.

In step S1160, the result of the multiplication between the exponent function value and the mask tensor is obtained to obtain a specific function value corresponding to the i-th data element.

In some embodiments, the plurality of look-up tables may be look-up tables in a time-multiplexed look-up table circuit.

For example, the method may further include: in response to a state control signal, the subtraction result is converted from data having a bit width of N6 bits to data having a bit width of N7 bits, so that the data having a bit width of N7 bits is input into the lookup table. .

The second lookup table is N0 bit input and N1 bit output, the third lookup table is N3 bit input and N4 bit output, and the first lookup table is N7 bit input and N8 bit output. Wherein the values of N0, N1, N3, N4, N7, N8 are in the range of [8, 12 ].

For example, the method can be used for realizing a standardized function layer of a neural network, and the neural network is used for classifying data to be processed. Wherein the data to be processed includes at least one of voice data, text data, and image data.

The relevant features in the data processing acceleration method of the embodiment of the present application may refer to the relevant content in the foregoing hardware acceleration circuit embodiment, and will not be described in detail.

In one embodiment, first, a 32-bit floating point number x _i ' convert to an 8-bit integer.

Then, a plurality of x are obtained _i ' mean. Specifically, n x may be first counted _i ' accumulate to get the first accumulated value. Wherein x is _i The' number is the number of data of the last dimension of the input tensor. Next, the first accumulated value is converted into an average value of 8-bit integer types by shifting m bits to the right. Wherein, m bits of the right shift m bits are determined according to the correction data, and are set by software.

Next, x is obtained _i The difference x_hat between' and mean.

Before, during or after the difference x_hat is obtained, the variance var may be obtained. Specifically, the square values of the difference values x_hat (of n xi) may be first accumulated to obtain a second accumulated value. And converting the second accumulated value into a variance of an 8-bit integer type by right shifting n bits.

Then, a mask tensor (mask) corresponding to the difference x_hat is obtained _i )。

Then, a natural logarithmic value A (e.g. 10-bit integer) corresponding to the difference x_hat is obtained by a second lookup table circuit.

Then, a natural logarithmic value B (e.g., a 10-bit integer) corresponding to the square root of variance var is obtained by a third lookup table circuit.

Next, a difference C between a and B is obtained.

Then, the exponential function value D of the difference C is obtained by the first lookup table circuit.

Next, the exponential function value D is combined with mask _i Multiplying by x _i ' corresponding standardized data E.

The data processing acceleration method according to the embodiment of the application can be applied to an artificial intelligent accelerator. FIG. 12 is a schematic diagram of an artificial intelligence accelerator according to an embodiment of the present application. Referring to fig. 12, an artificial intelligence accelerator 1200 includes a memory 1210 and a processor 1220.

The artificial intelligence accelerator 1220 may be a general purpose processor such as a CPU (Central Processing Unit ) or an artificial Intelligence Processor (IPU) for performing artificial intelligence operations. The artificial intelligence operations may include machine learning operations, brain-like operations, and the like. The machine learning operation comprises neural network operation, k-means operation, support vector machine operation and the like. The artificial intelligence processor may include, for example, one or a combination of a GPU (Graphics Processing Unit ), DLA (Deep Learning Accelerator, deep learning accelerator), NPU (Neural-NetworkProcessing Unit, neural network processing unit), DSP (Digital Signal Process, digital signal processing unit), field-programmable gate array (Field-Programmable Gate Array, FPGA), application specific integrated circuit (Application Specific Integrated Circuit, ASIC). The specific type of processor is not limited by the present application.

Memory 1210 may include various types of storage units such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 1220 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 1210 may include any combination of computer-readable storage media including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some implementations, memory 1210 may include readable and/or writable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROM, dual layer DVD-ROM), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.

Memory 1210 has stored thereon executable code that, when processed by processor 1220, causes processor 1220 to perform some or all of the methods described above.

In one possible implementation, an artificial intelligence accelerator may include multiple processors, each of which may independently run various tasks assigned thereto. The present application is not limited to the processor and the tasks that the processor operates.

It should be understood that, unless otherwise specified, each functional unit/module in the embodiments of the present application may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may be integrated together. The integrated units/modules described above may be implemented either in hardware or in software program modules.

The integrated units/modules, if implemented in hardware, may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. The artificial intelligence processor may be any suitable hardware processor, such as CPU, GPU, FPGA, DSP and ASIC, etc., unless otherwise specified. Unless otherwise indicated, the Memory modules may be any suitable magnetic or magneto-optical storage medium, such as resistive Random Access Memory RRAM (Resistive Random Access Memory), dynamic Random Access Memory DRAM (Dynamic Random Access Memory), static Random Access Memory SRAM (Static Random-Access Memory), enhanced dynamic Random Access Memory EDRAM (Enhanced Dynamic Random Access Memory), high-Bandwidth Memory HBM (High-Bandwidth Memory), hybrid Memory cube HMC (Hybrid Memory Cube), and the like.

The integrated units/modules may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the above-mentioned method of the various embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In one possible implementation, an artificial intelligence chip is also disclosed that includes the hardware acceleration circuit described above.

In one possible implementation, a board is also disclosed, which includes a memory device, an interface device, and a control device, and the artificial intelligence chip described above; the artificial intelligent chip is respectively connected with the storage device, the control device and the interface device; a memory device for storing data; the interface device is used for realizing data transmission between the artificial intelligent chip and external equipment; and the control device is used for monitoring the state of the artificial intelligent chip.

In one possible implementation, an electronic device is disclosed that includes the artificial intelligence chip described above. The electronic device includes a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device. Vehicles include aircraft, ships, and/or vehicles; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric rice cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.

Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing part or all of the steps of the above-described method of the present application.

Alternatively, the present application may also be embodied as a computer-readable storage medium (or non-transitory machine-readable storage medium or machine-readable storage medium) having stored thereon executable code (or a computer program or computer instruction code) which, when executed by a processor of an electronic device (or a server, etc.), causes the processor to perform part or all of the steps of the above-described methods according to the present application.

The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A hardware acceleration circuit, comprising:

the natural logarithm module is used for obtaining the natural logarithm value of the ith data element in n data elements of the data set and the natural logarithm value of the mean square error of the n data elements, wherein n is greater than 1;

an exponential function module, configured to obtain an exponential function value of the subtraction result;

and a multiplication module, configured to obtain a multiplication result between the exponent function value and the mask tensor of the ith data element, so as to obtain a specific function value corresponding to the ith data element.

2. The hardware acceleration circuit of claim 1, wherein the processor is configured to,

the multiplication module is also used for obtaining the square operation result of each of the n data elements;

the addition and subtraction module is further used for obtaining an addition operation result of square operation results of the n data elements;

the hardware acceleration circuit further comprises a shift module for performing a right shift operation on the addition result, thereby obtaining variances of the n data elements, and obtaining natural logarithmic values of mean square deviations of the n data elements according to the variances.

3. The hardware acceleration circuit of claim 2, wherein: the addition and subtraction module comprises an adder and a subtracter, wherein the adder is used for obtaining the addition operation result, and the subtracter is used for obtaining a subtraction operation result between the natural logarithmic value of the ith data element and the natural logarithmic value of the mean square error; wherein,

the adder and the subtracter are configured independently of each other; or,

the adder and the subtracter share the same addition unit.

4. The hardware acceleration circuit of claim 3, wherein the add-subtract module is further configured to output a subtraction result of an i-th initial data of n initial data sets and an average value of the n initial data sets to obtain the data set containing the n data elements.

5. The hardware acceleration circuit of claim 4, wherein: the first conversion circuit is used for obtaining n initial data in the initial data set, converting the n initial data from data with a first bit width bit into data with a second bit width bit respectively, and outputting the n initial data to the addition and subtraction module.

6. The hardware acceleration circuit of claim 4, wherein the add-subtract module is further configured to obtain an addition result of the n initial data;

the shift module is further configured to perform a right shift operation on the addition result of the n initial data, so as to obtain an average value of the n initial data.

7. The hardware acceleration circuit of claim 1, wherein:

the exponential function module comprises a first lookup table module, wherein the first lookup table module is used for outputting an exponential function value of the subtraction result based on a first lookup table; and/or

The natural logarithm module comprises a second lookup table module and a third lookup table module, wherein part or all of the second lookup table module and the third lookup table module; the second lookup table module is used for outputting the natural logarithmic value of the ith data element based on a second lookup table; the third lookup table module is used for outputting natural logarithmic values of the mean square error of the n data elements based on the variances of the n data elements and a third lookup table.

8. The hardware acceleration circuit of claim 7, comprising at least two of the first through third lookup table modules, wherein:

each of the at least two look-up table modules is configured with a basic look-up table circuit unit; or,

the at least two look-up table modules share a basic look-up table circuit unit.

9. The hardware acceleration circuit of claim 7, wherein:

the hardware acceleration circuit comprises a first lookup table module, a second lookup table module and a third lookup table module, at least part of the first lookup table module, the second lookup table module and the third lookup table module share a first basic lookup table circuit unit, and at least one other module of the first lookup table module, the second lookup table module and the third lookup table module is configured with a second basic lookup table circuit unit; the first basic lookup table circuit unit is M1 bit input M2 bit output, and the second basic lookup table circuit unit is M3 bit input M4 bit output;

wherein at least one pair of M1 and M3, and M2 and M4 are unequal.

10. The hardware acceleration circuit of claim 9, wherein: the second conversion circuit is used for responding to a state control signal, converting the subtraction result output by the addition and subtraction module from data with the bit width of N6 bits into data with the bit width of N7 bits, and inputting the data into the second basic lookup table circuit unit.

11. The hardware acceleration circuit of any one of claims 8 to 10, wherein: the second lookup table is N0 bit input and N1 bit output, the third lookup table is N3 bit input and N4 bit output, and the first lookup table is N7 bit input and N8 bit output;

wherein the values of N0, N1, N3, N4, N7, N8 are in the range of [8, 12 ].

12. An artificial intelligence chip, characterized in that: the chip comprising a hardware acceleration circuit of any one of claims 1 to 11.

13. A data processing acceleration method, characterized by comprising:

obtaining a natural logarithmic value of an ith data element in n data elements of a data set and a natural logarithmic value of a mean square error of the n data elements, wherein n is greater than 1;

obtaining a subtraction result between the natural logarithmic value of the ith data element and the natural logarithmic value of the mean square error;

Obtaining an exponential function value of the subtraction result;

a multiplication result between the exponent function value and the mask tensor of the ith data element is obtained to obtain a specific function value corresponding to the ith data element.

14. The method as recited in claim 13, further comprising:

obtaining the square operation result of each of the n data elements;

obtaining an addition operation result of square operation results of the n data elements;

and performing right shift operation on the addition operation result, thereby obtaining variances of the n data elements, and obtaining natural logarithm values of mean square variances of the n data elements according to the variances.

15. The method as recited in claim 14, further comprising:

outputting the subtraction result of the ith initial data in the n initial data of the initial data set and the average value of the n initial data to obtain the data set containing the n data elements.

16. The method as recited in claim 15, further comprising:

and obtaining n initial data in the initial data set, and respectively converting the n initial data from the data with the first bit width bit into the data with the second bit width bit.

17. The method as recited in claim 15, further comprising:

obtaining the addition operation results of the n initial data;

and performing right shift operation on the addition operation result of the n initial data, thereby obtaining an average value of the n initial data.

18. The method of claim 13, wherein:

the obtaining the exponential function value of the subtraction operation result includes:

outputting an exponential function value of the subtraction result based on a first lookup table;

and/or

The obtaining the natural logarithm value of the ith data element and the natural logarithm value of the mean square error includes:

outputting the natural logarithmic value of the ith data element based on the second lookup table, and outputting the natural logarithmic value of the mean square error of the n data elements based on the variance of the n data elements and the third lookup table.

19. The method as recited in claim 18, further comprising:

and responding to a state control signal, converting the subtraction result output by the addition and subtraction module from data with the bit width of N6 bits to data with the bit width of N7 bits, and inputting the data into the second basic lookup table circuit unit.

20. The method of claim 18, wherein: the second lookup table is N0 bit input and N1 bit output, the third lookup table is N3 bit input and N4 bit output, and the first lookup table is N7 bit input and N8 bit output;

wherein the values of N0, N1, N3, N4, N7, N8 are in the range of [8, 12 ].

21. The method according to any one of claims 13 to 20, wherein: the method is used for realizing a standardized function layer of a neural network, and the neural network is used for classifying data to be processed; wherein,

the data to be processed includes at least one of voice data, text data, and image data.

22. An artificial intelligence accelerator, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 13 to 20.