CN110533174B

CN110533174B - Circuit and method for data processing in neural network system

Info

Publication number: CN110533174B
Application number: CN201810506017.2A
Authority: CN
Inventors: 费旭东; 周红; 袁宏辉
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-05-24
Filing date: 2018-05-24
Publication date: 2023-05-12
Anticipated expiration: 2038-05-24
Also published as: CN110533174A

Abstract

The application provides a circuit and a method for data processing in a neural network system, which can be compatible with various precision data formats in a hardware architecture of the neural network. The circuit comprises a serial addition circuit, a first nonlinear mapping circuit, an accumulation circuit and a second nonlinear mapping circuit, wherein the serial addition circuit serially acquires each input data in at least one input data and a weight parameter corresponding to each input data, and performs serial addition operation on each input data and the weight parameter corresponding to each input data to obtain at least one first data; the first nonlinear mapping circuit performs first nonlinear mapping on each first data in at least one first data to obtain at least one second data; the accumulation circuit is used for accumulating at least one second data; and the second nonlinear mapping circuit performs second nonlinear mapping on the accumulation result output by the accumulation circuit to obtain output data.

Description

Circuit and method for data processing in neural network system

Technical Field

The present application relates to the field of circuits, and more particularly, to circuits and methods for data processing in neural network systems.

Background

Neural networks and deep learning algorithms have been very successful and in the process of rapid development, the industry generally expects new computing approaches to help achieve more general, complex intelligent applications. In recent years, neural networks, deep learning algorithms and the like have achieved remarkable achievements in the field of image recognition applications, and therefore, optimization and efficient implementation of deep learning algorithms have been paid attention to and paid attention to in the industry, and research on neural network optimization algorithms has been put into practical use.

Due to the driving of applications, the industry is generally beginning to research and develop highly efficient neural network acceleration hardware and chips. Research and practice have found that many low-precision data formats, including INT8 and other 4-bit, 2-bit, and even 1-bit implementation methods, are fully feasible in neural network operations relative to fp32, fp16 formats of mature applications, so that the multi-precision data formats have different performances in different networks and different applications, and can respectively meet the requirements of the applications in specific scenes.

Therefore, how to accommodate multiple precision data formats in the hardware architecture of the neural network is a challenge.

Disclosure of Invention

The application provides a circuit and a method for data processing in a neural network system, which can be compatible with various precision data formats in a hardware architecture of the neural network.

In a first aspect, a circuit for data processing in a neural network system is provided, the circuit comprising a serial addition circuit, a first nonlinear mapping circuit, an accumulation circuit, and a second nonlinear mapping circuit, wherein,

the serial addition circuit is used for obtaining each input data in at least one input data and the weight parameter corresponding to each input data in series, and carrying out serial addition operation on each input data and the weight parameter corresponding to each input data to obtain at least one first data;

the first nonlinear mapping circuit performs first nonlinear mapping on each first data in the at least one first data to obtain at least one second data, wherein the first nonlinear mapping is exponential transformation with 2 bottoms;

the accumulation circuit is used for accumulating the at least one second data;

and the second nonlinear mapping circuit performs second nonlinear mapping on the accumulation result output by the accumulation circuit to obtain output data, wherein the second nonlinear mapping is determined according to the nonlinear mapping of the neural network and the first nonlinear mapping.

Therefore, in the embodiment of the application, the adder serially acquires the input data and the weight parameter, and performs serial addition operation on the input data and the weight parameter to obtain at least one first data, then performs first nonlinear mapping on the at least one first data through the first nonlinear mapping circuit to obtain at least one second data, then accumulates the at least one second data through the accumulating circuit, and then performs second nonlinear mapping on the accumulated result output by the accumulating circuit through the second nonlinear mapping circuit.

In a specific implementation manner, since the calculation of the neural network is performed in a hierarchical manner, the output of the previous stage is the input of the next stage, and thus, the logarithmic transformation of the input data can be implemented by the second nonlinear mapping circuit of the previous stage, that is, the accumulated sum of the previous stage is sent back to the input of the calculation unit after the second nonlinear mapping, so as to prepare for the next calculation. Alternatively, the weight parameters may be calculated in advance, stored in the memory, and temporarily fetched to the calculation unit during calculation.

Optionally, the second nonlinear mapping is determined from a nonlinear mapping of the neural network and an inverse mapping of the first nonlinear mapping. Specifically, when the first nonlinear mapping is an exponential function based on 2, the inverse mapping of the first nonlinear mapping is a logarithmic function based on 2.

In the embodiment of the present application, taking 8 bits as an example, compared with parallel processing, although the performance of serial processing is reduced by 8 times, the resource cost of the serial processing circuit is reduced by 8 times compared with that of the parallel processing circuit, so if the same resource is used to implement 8 times of serial processing units, the overall performance can be kept basically unchanged, and therefore, the embodiment of the present application can ensure the efficiency of data processing while implementing multi-precision compatibility.

Optionally, in the embodiment of the present application, the accumulating process processed by the accumulator and the second nonlinear mapping process performed by the second nonlinear mapping circuit both need higher precision in the actual scenario of neural network calculation, so that the requirement of multi-precision compatibility may not need to be met, and based on the accumulator and the second nonlinear mapping circuit, data processing may be performed in parallel or may be performed in series.

With reference to the first aspect, in some possible implementation manners of the first aspect, the method further includes:

and the serial-parallel conversion circuit is used for serially acquiring at least part of the data of each first data and outputting the at least part of the data of each first data to the first nonlinear mapping circuit in parallel.

Alternatively, the serial-parallel conversion circuit may be a first serial-parallel conversion circuit, serially acquire the data output from the adder, and output the data in parallel to the first nonlinear mapping circuit, so that the data is subjected to first nonlinear mapping by the first nonlinear mapping circuit.

Alternatively, the first nonlinear mapping circuit may perform serial processing on the data. In particular, the process of the first nonlinear mapping of the integer portion may be serialized while the process of the first nonlinear mapping of the fractional portion may not be serialized. In one possible implementation, the serial-to-parallel circuit may be a second serial-to-parallel circuit, serially obtain a fractional portion of the data output by the adder, and output the fractional portion in parallel to the first nonlinear mapping circuit, such that the fractional portion is first nonlinear mapped by the first nonlinear mapping circuit.

With reference to the first aspect, in some possible implementations of the first aspect, the first nonlinear mapping circuit includes a nonlinear mapping unit, a shift control circuit, and a shift register circuit, where,

the nonlinear mapping unit is used for obtaining the decimal part of each first data in parallel, carrying out the first nonlinear mapping on the decimal part of each first data to obtain third data, and outputting the third data to the shift register circuit;

the shift control circuit acquires the integer part of each first data and outputs a control signal to the shift register circuit according to the integer part of each first data;

and the shift register circuit shifts the third data according to the control signal to obtain the at least one second data.

With reference to the first aspect, in some possible implementation manners of the first aspect, the control signal is k clock signals, the shift register circuit outputs the third data in serial, and outputs k 0 s in serial according to the k clock signals after outputting the third data, so as to obtain the second data, where k is a value of an integer part of the first data, and k is a positive integer. This can make the shift operation simpler, and is advantageous in reducing power consumption.

With reference to the first aspect, in some possible implementation manners of the first aspect, the shift register circuit shifts the third data by k bits to the left according to the control signal to obtain the second data, where k is a value of an integer part of the first data, and k is a positive integer.

With reference to the first aspect, in some possible implementations of the first aspect, the control signal is k clock signals.

With reference to the first aspect, in some possible implementation manners of the first aspect, the control signal is j clock signals, where an ith clock signal in the j clock signals corresponds to an ith bit from a low bit to a high bit of binary numbers of the integer part, j is a bit number of the binary numbers of the integer part, i, j is a positive integer, and 1.ltoreq.i.ltoreq.j;

wherein when the ith bit of the binary number is not 0, the shift register circuit shifts the third data by 2 leftwards according to the ith clock signal ^i-1 A bit;

when the i-th bit of the binary number is 0, the shift register circuit does not shift the third data.

Therefore, the shift times can be reduced, the system power consumption can be reduced, and the system performance can be improved.

Optionally, in an embodiment of the present application, the method further includes: and the precision configuration controller is used for controlling clock and data routing of at least one of the serial adder, the serial-parallel conversion register, the shift controller and the shift register. In this way, the precision configuration controller can control the bit width, i.e., the precision, of the data processed by at least one of the serial adder, the serial-to-parallel register, the shift controller, and the shift register by controlling the clock and the data routing of at least one of the serial adder, the serial-to-parallel register, the shift controller, and the shift register.

In a second aspect, there is provided a method of data processing in a neural network system, the method employing any of the circuits of the first aspect to process input data, the method comprising:

performing first nonlinear mapping on each first data in the at least one first data through the first nonlinear mapping circuit to obtain at least one second data, wherein the first nonlinear mapping is exponential transformation with 2 bottoms;

Accumulating the at least one second data by the accumulating circuit;

and performing second nonlinear mapping on the accumulation result output by the accumulation circuit through the second nonlinear mapping circuit to obtain output data, wherein the second nonlinear mapping is determined according to the nonlinear mapping of the neural network and the first nonlinear mapping.

The respective steps of the method of the second aspect may refer to the respective operations of the respective modules of the circuit of the data processing of the first aspect and are not repeated here.

Drawings

Fig. 1 shows a schematic diagram of an n-level (layer) neural network computational model.

Fig. 2 shows a schematic diagram of a circuit for data processing in a conventional neural network.

Fig. 3 is a schematic diagram of a circuit for data processing in a neural network according to an embodiment of the present application.

Fig. 4 is a schematic diagram of a circuit for data processing in a neural network according to an embodiment of the present application.

Fig. 5 shows a schematic diagram of a specific shift register according to an embodiment of the present application.

Fig. 6 shows a schematic flowchart of a method for data processing in a neural network system according to an embodiment of the present application.

Detailed Description

The technical solutions in the present application will be described below with reference to the accompanying drawings.

Fig. 1 shows a schematic diagram of an n-level (layer) neural network computational model, where the neural network processes the computational correspondence formula of one of the neurons as follows:

y＝f(x1*w1+x2*w2+…+xn*wn+b)，

where xi is the data input, w _i Is a weight parameter, i=1, 2, …, n, b is a constant, f () is a specific function, and i and n are positive integers. The above formula represents the point multiplication of the two vectors (x 1, x2, … xn) and (w 1, w2, … wn). This operation can be done successively on one computational unit with multiplication and accumulation.

Fig. 2 is a schematic diagram of a circuit for data processing in a conventional neural network. The computation of the neural network is performed hierarchically, the output of the previous stage being the input of the next stage. Specifically, the output of the previous stage is taken as the data input (x 1, x2, … xn), the hardware multiplier 201 multiplies the x1, x2, … xn by the corresponding weight parameters respectively, then the accumulator 202 completes the accumulation operation of x1×w1+x2×w2+ … +xn×wn+b, the nonlinear mapping unit 203 performs the nonlinear mapping of the accumulation result y=f (the accumulated result) to obtain the calculation result, and finally the data output is completed.

However, the circuit in fig. 2 requires the use of a hardware multiplier 201, which increases the complexity and resources occupied by the computational unit. To simplify the multiplication, complex multiplication may be converted to addition by logarithmic transformation. Fig. 3 is a schematic diagram of a circuit for data processing in a neural network according to an embodiment of the present application.

The working principle in fig. 3 is explained below. Assuming that data A and data B are to be multiplied, one mayThe A and the B are firstly subjected to logarithmic transformation to obtain: a=log ₂ A，b＝log ₂ B. Here, assuming that a is data and B is a weight parameter, the logarithmic transformation of a may be implemented by the second nonlinear mapping circuit of the previous stage, that is, the accumulated sum of the previous stage is sent back to the input of the calculation unit after the second nonlinear mapping, and the next calculation is ready to be performed, and the weight parameter B may be calculated in advance, stored in the memory, and temporarily fetched to the calculation unit at the time of calculation.

Then, an addition operation is performed on a and b by the adder 301: c=a+b.

Then, c is first non-linearly mapped by the first non-linear mapping circuit 302. Here, the first nonlinear mapping is an exponential function based on 2, namely: d=2 ^c . It can be seen that d=a×b at this time.

The accumulated sum of these products is then obtained by accumulating a plurality of similar products obtained previously by accumulator 303.

The accumulated sum of the accumulator outputs is then second non-linearly mapped by a second non-linear mapping circuit 304. And, after the second nonlinear mapping, the data is sent back to the input end of the computing unit, and the next computation is ready to be executed.

Here, the second nonlinear mapping is determined from a nonlinear mapping of the neural network and an inverse mapping of the first nonlinear mapping. Specifically, when the first nonlinear mapping is an exponential function based on 2, the inverse mapping of the first nonlinear mapping is a logarithmic function based on 2.

In one scheme, data can be processed in parallel in the process of data processing in the neural network, that is, each data processing unit adopted in data processing in the neural network can read data in parallel, process the data and output the data in parallel. However, any parallelized implementation is difficult to achieve multi-precision compatibility. For example, an 8-bit parallel processing unit, if 4 bits of data are to be processed, then 8 bits are used as 4 bits, and 4-bit precision operations can be implemented with half of the resources wasted. When the processing unit reads data from the memory, it is necessary to have circuitry to implement 2 read modes, one 4-bit and one 8-bit, respectively. Moreover, in the case of satisfying the same memory bandwidth, the 4-bit data requires 2 times the read processing speed with respect to the 8-bit data, and thus the double processing unit is also required to process the 4-bit data at this time. Therefore, it is difficult to achieve multi-precision compatibility in parallel processing of data.

In the embodiment of the application, the data can be processed in series in the data processing process in the neural network. For example, instead of reading out one byte at a time, the 8-bit data is read out bit by bit and processed bit by bit, and thus, the compatibility of multiple accuracies can be achieved. Specifically, the reading and processing of 8-bit data needs 8 clock beats, the reading and processing of 4-bit data needs 4 clock beats, and other data widths are the same, so that the embodiment of the application can realize that an elastic and changeable requirement is converted into a time domain. Because the required time beats can be flexibly controlled and the specific parallel hardware circuits are fixed, the embodiment of the application can flexibly realize multi-precision compatibility.

In addition, taking 8 bits as an example, compared with parallel processing, although the performance of serial processing is reduced by 8 times, the resource cost of a serial processing circuit is reduced by 8 times compared with that of a parallel processing circuit, so if the same resource is used for realizing 8 times of serial processing units, the overall performance can be kept basically unchanged, and therefore, the embodiment of the application can ensure the efficiency of data processing while realizing multi-precision compatibility.

It should be noted that, since the accumulating process processed by the accumulator 303 and the second nonlinear mapping process performed by the second nonlinear mapping circuit 304 in fig. 3 both require higher precision in the actual scenario of neural network calculation, the requirement of multi-precision compatibility may not be satisfied, and based on this accumulator 303 and the second nonlinear mapping circuit 304, data processing may be performed in parallel or may be performed in series, which is not limited in the embodiment of the present application.

In the embodiment of the present application, the adder 301 is required to perform serial processing on the data, that is, the adder 301 may be a serial adder, or may be referred to as a serial full adder. Specifically, the serial adder uses a bit-by-bit addition from the lower bit to the higher bit to time-division multiplex the same one-bit full adder.

As an alternative embodiment, if the serial inputs of adder 301 and the following components match, the data output by adder 301 may be directly input to the following components without any register (Reg) to buffer the output serial data.

As an alternative embodiment, if the latter component connected to the adder 301 cannot be input in series, the calculation result of each bit of the adder 301 may be saved in a register, and then output in parallel from the register to the latter component.

In this embodiment, the first nonlinear mapping circuit 302 performs a first nonlinear mapping on the data output by the adder 301, where the first nonlinear mapping is an exponential transformation based on 2, that is: y=2 ^x . As an alternative embodiment, the data output at 302 includes an integer portion and a fractional portion. The first non-linear mapping corresponds to a shift operation for the integer part and may be evaluated by a small-scale look-up table or mapping table for the fractional part.

As an alternative embodiment, the first nonlinear mapping circuit 302 may process the data in parallel. At this time, the data processing circuit in the embodiment of the present application further includes a first serial-parallel conversion circuit, and serially acquires the data output by the adder 301 and outputs the data to the first nonlinear mapping circuit in parallel, so that the first nonlinear mapping circuit performs the first nonlinear mapping on the data.

As an alternative embodiment, the first nonlinear mapping circuit 302 may process the data serially. In particular, the process of the first nonlinear mapping of the integer portion may be serialized while the process of the first nonlinear mapping of the fractional portion may not be serialized. In a possible implementation manner, the data processing circuit in the embodiment of the present application further includes a second serial-parallel conversion circuit, and the fractional part of the data output by the adder 301 is obtained in series and output to the first nonlinear mapping circuit in parallel, so that the first nonlinear mapping circuit performs the first nonlinear mapping on the fractional part.

It should be noted that in the embodiment of the present application, although the procedure of the first nonlinear mapping of the fractional part cannot be serialized, the cost of resource occupation in the procedure of parallelization processing is acceptable. In practical application scenarios, it is generally necessary to guarantee the number of bits of the integer preferentially, so as to obtain a sufficiently large value range. As a specific example, when the required precision is 1bit to 3 bits, usually only integer digits are required, when the required precision is 4 bits, there may be one decimal place, or there may be no decimal place (i.e. 4 are all integer digits), when the required precision is 4 bits to 8 bits, there may be 1-4 decimal places, with the remainder being integer digits.

Then, at least one data output from the first nonlinear mapping circuit is accumulated by the accumulator 303, and the accumulated result output from the accumulator 303 is subjected to a second nonlinear mapping by the second nonlinear mapping circuit 304.

It should be noted that, in the embodiment of the present application, the adder 301 may be replaced with an adding circuit that may perform the function of the adder 301, and the accumulator 303 may be replaced with an accumulating circuit that may perform the function of the accumulator 303, which is not particularly limited in the embodiment of the present application.

Therefore, the embodiment of the application can realize that the multi-precision compatible processing is transferred to the counting of the clock beats by serializing the input and the calculation process of the operand of the adder, and based on the embodiment of the application, the multi-precision compatible neural network operation circuit can be realized.

Fig. 4 is a schematic diagram of a circuit for data processing in a neural network according to an embodiment of the present application. It should be understood that the number of the devices,

fig. 4 shows schematic modules or units of a circuit for data processing, but these modules or units are only examples, and embodiments of the present application may also include other modules or units, or variations of the individual modules or units in fig. 4. Furthermore, the example in FIG. 4 is merely provided to assist one skilled in the art in understanding and implementing the embodiments of the present application and is not intended to limit the scope of the embodiments of the present application. Equivalent changes and modifications can be made by those skilled in the art based on the examples given herein, and such changes and modifications should still fall within the scope of the embodiments of the present application.

In the embodiment of the present application, the circuit for data processing includes a serial adder 401, an accuracy configuration controller 402, a serial-parallel conversion register 403, a nonlinear mapping unit 404, a shift controller 405, a shift register 406, an accumulator 407, and a second nonlinear mapping circuit 408. Wherein the nonlinear mapping unit 404, the shift controller 405, and the shift register 406 may be components of the first nonlinear mapping circuit 302. In an alternative embodiment, the serial-parallel conversion register 403 may be an integral part of the first nonlinear mapping circuit 302, which is not limited in this embodiment. In another alternative embodiment, the shift controller 405 may be a local control circuit included in the precision configuration controller 402 for controlling the shift register 406, which is not limited in this embodiment.

In the embodiment of the present application, the precision configuration controller 402 may control the clock and data routing of at least one of the serial adder 401, the serial-parallel conversion register 403, the shift controller 405, and the shift register 406. In this way, the precision configuration controller 402 can control the bit width, i.e., the precision, of the data processed by at least one of the serial adder 401, the serial-parallel register 403, the shift controller 405, and the shift register 406 by controlling the clock and the data routing of at least one of the serial adder 401, the serial-parallel register 403, the shift controller 405, and the shift register 406.

It should be appreciated that local control circuitry corresponding to control of the serial adder 401, serial to parallel register 403, shift controller 405, or shift register 406 may be included in the precision configuration controller. Thus, by the local control circuit corresponding to the serial adder 401, the serial-parallel register 403, the shift controller 405, or the shift register 406, control, that is, clock and data routing, of the serial adder 401, the serial-parallel register 403, the shift controller 405, or the shift register 406, respectively, can be realized. For example, the control of the serial adder 401 may be realized by a corresponding local control circuit of the serial adder 401. For another example, the shift register 406 may be controlled by a local control circuit corresponding to the shift register 406.

The serial adder 401 obtains each input data in the at least one input data and the weight parameter corresponding to each input data in serial, and performs serial addition operation on each input data and the weight parameter corresponding to each input data to obtain at least one first data.

Here, the input data is, for example, the data a in the above, and the weight parameter corresponding to each input data is, for example, the weight parameter B in the above. In this embodiment of the present application, one bit of input data and a weight parameter may be referred to as an operand, specifically, one bit of input data may be referred to as a serial operand 1, and one bit of a weight parameter corresponding to input data may be referred to as a serial operand 2.

Specifically, serial operand 1 and serial operand 2 are synchronously read from the memory, each 1 bit being read in each clock cycle, as input to serial adder 401. Thus, the precision configuration controller 402 may control that two operands with a precision of M, where M is a positive integer, and M may be 8, for example, after M clock cycles have elapsed. Here, the accuracy is actually the bit width of the data. In addition, the manner of storing the operands in the memory needs to be adapted to a serial read manner, for example, one byte stored in each address of the memory is composed of 8 operands each taking one bit, so that each byte read from the memory may correspond to one bit of the 8 operands.

Then, the serial operand 1 and the serial operand 2 are sent to the serial adder 401 to complete one-bit addition operation, specifically, each bit of the input data and the weight parameter sequentially enters the serial adder 401, and after M clock cycles, the addition operation of the full bit width of the input data and the weight parameter can be completed, so that the sum of the input data and the weight parameter, that is, the first data, is obtained.

Specifically, for a single full adder, assume that the nth operand of the input data is written as A _n The nth operand of the weight parameter is noted as B _n Carry bits from lower bits are denoted as C _n And is denoted as Y _n The carry bit to the upper bit is marked as C _n+1 . Here, n is a positive integer and is less than or equal to the bit width of the input data or weight parameter.

Before starting the calculation, the serial adder 401 needs to be initialized, i.e. the register holding the carry bit needs to be cleared. Alternatively, the initialization of the serial adder 401 may be done by the precision configuration controller 402. Specifically, the precision configuration controller 402 may zero out registers used by the serial adder 401 to hold the input operands and the lower carry bits. Then, the current bits of the two operands are added, plus the carry bit from the lower bit, to obtain the current data output and the carry output to the upper bit. Here, (C _n+1 ,Y _n )＝(A _n ,B _n ,C _n ) In the form of (a) to represent its operational law:

(0,0)＝(0,0,0)

(0,1)＝(0,1,0)

(0,1)＝(1,0,0)

(1,0)＝(1,1,0)

(0,1)＝(0,0,1)

(1,0)＝(0,1,1)

(1,0)＝(1,0,1)

(1,1)＝(1,1,1)

at this time, the carry bit to the upper bits may be saved in the buffer register for the next addition operation. Meanwhile, when the current data output is decimal, the current data output may be saved into the decimal place serial-parallel conversion register 403.

Then, the value of n can be added with 1, and the next addition operation is performed until all bits of the input data and the weight parameters are calculated. As an example, if the bit width of the input data and the weight parameter is 8 bits, the calculation process ends after 8 clock cycles.

In the embodiment of the present application, the result calculated by the serial adder 401 may be output serially and bit by bit. It is assumed that serial operands are input to serial adder 401 in a right-to-left (i.e., a first fractional portion, a later integer portion) order, where the outputs are also in a right-to-left order, with the fractional portion of the first data being output first, and the integer portion of the second data being output last.

The serial-parallel conversion register 403 serially acquires the fractional part of the first data and outputs the fractional part in parallel to the nonlinear mapping unit 404. Specifically, the fractional part may be input to the serial-parallel conversion register 403 in order from right to left. Specifically, if the fractional part is n bits, the precision configuration controller 402 may control such that all the fractional bits are saved in the serial-parallel conversion register 403 after n clock cycles, and after the n clock cycles, the fractional part may be output in parallel.

The nonlinear mapping unit 404 acquires the fractional part of the first data output from the serial-parallel conversion register 403 in parallel, performs the first nonlinear mapping on the fractional part of the first data, obtains third data, and outputs the third data to the shift register 406. Specifically, the first nonlinear mapping may be referred to in the above description, and in order to avoid repetition, a description is omitted here.

The shift controller 405 acquires the integer parts of the first data and outputs a control signal to the shift register 406 according to the integer part of each of the first data. The shift controller 405 may also be referred to herein as an integer floating point shift controller.

Specifically, the integer part of the first data has a value k, where k is a positive integer, and the control signal may be k clock signals. Alternatively, the binary number of the integer part of the first data may be represented as j bits, j being a positive integer, where the control signal may be j clock signals.

The shift register 406 shifts the acquired third data according to the control signal output by the shift controller 405, to obtain second data. The shift register may also be referred to herein as a decimal shift register.

In one possible implementation, when the control signal is k clock signals, the shift register 406 may serially output the third data, and serially output k 0 s according to the k clock signals after outputting the third data, so as to complete the shift of the third data. This can make the shift operation simpler, and is advantageous in reducing power consumption.

In another possible implementation, shift register 406 may shift the third data left by k bits to obtain second data, which may then be output in parallel.

Specifically, when the control signal is k clock signals, the shift register 406 may shift the first data left k times according to the k clock signals, and shift one bit at a time to complete the shift of the third data.

When the control signal is j clock signals, the shift register 406 may shift the first data left j times according to the j clock signals, so as to complete the shift of the third data. Specifically, each of the j clock signals corresponds to each bit of the binary number. In this case, the number of shifts each time is different and may be a power of 2 such as 1 bit, 2 bits, 4 bits, or 8 bits. That is, if a certain bit of the binary number is not 0, the third data needs to be shifted, and the shift number is equal to the weight of the bit. If a bit is 0, the third data is not shifted. Therefore, the shift times can be reduced, the system power consumption can be reduced, and the system performance can be improved.

In a specific implementation manner, the ith clock signal in the j clock signals can correspond to the ith bit from low to high of the binary number, i is a positive integer, and 1 is less than or equal to i is less than or equal to j. At this time, when the ith bit of the binary number is not 0, the shift register 406 shifts left the third data by 2 according to the ith clock signal ^i-1 A bit; when the i-th bit of the binary number is 0, the shift register circuit does not shift the third data.

In particular implementation, the shiftThe register 406 may include a plurality of flip-flops having word sizes sufficient to cover the size of the non-linearly transformed data. Here, the input of each flip-flop is no longer connected directly to the previous flip-flop only, but to the output of a Multiplexer (MUX) whose input comes from the 2 nd preceding the flip-flop ^u And a number of flip-flops, where u is an integer greater than or equal to 0.

Fig. 5 shows a schematic diagram of a specific shift register according to an embodiment of the present application. The shift register is a 16-bit shift register and comprises 16 triggers, namely D0, D1, … D14 and D15. Wherein the input of each flip-flop is from the output of the 1 st, 2 nd, 4 th or 8 th flip-flop preceding the flip-flop. For example, the 16 th flip-flop D15 may input an output from one of D14, D13, D11, D7 through MUX1, thereby implementing a 1-bit, 2-bit, 4-bit, or 8-bit data shift. For another example, the 15 th flip-flop D14 may input an output from one of D13, D12, D10, D6 through MUX2, thereby implementing a 1-bit, 2-bit, 4-bit, or 8-bit data shift. For another example, the 6 th flip-flop D5 may input an output from one of D4, D3, D1 through MUX3, thereby implementing a 1-bit, 2-bit, or 4-bit data shift, where one input of MUX2 includes 0. The connection of the other bits can be analogized, and for low-order flip-flops, e.g., D0, D1, D2, etc., the inputs of their MUXs can include 0.

Specifically, the switching control of the MUX may be controlled by the above j clock signals. As a specific example, when the binary number of the integer part is represented as 1011, the shift controller 405 may serially acquire each bit of the integer part, for example, the shift controller 405 may sequentially acquire four bits of 1,0,1 from the lower order to the upper order.

When the shift controller 405 acquires "1" of the first bit from the low bit to the high bit, a 1 st clock signal, which may be a high level clock signal, may be output to the shift register 406, indicating that the shift register shifts the third data left by 1 bit, where the 1 st clock signal may control the MUX of the flip-flop D1, and may take D0 as an input of D1.

When the shift controller 405 obtains "1" of the second bit from the lower bit to the upper bit, a 2 nd clock signal may be output to the shift register 406, where the clock signal may be a high level clock signal, indicating that the shift register shifts the third data by 2 bits to the left, where the 2 nd clock signal may control the MUX of the flip-flop D3, and may take D1 as an input of D3.

When the shift controller 405 acquires "0" of the third bit from the low bit to the high bit, a 3 rd clock signal, which may be a low level clock signal, may be output to the shift register 406, indicating that the shift register does not shift the third data.

When the shift controller 405 acquires "1" of the fourth bit from the lower bit to the upper bit, a 4 th clock signal may be output to the shift register 406, which may be a high level clock signal, indicating that the shift register shifts the third data by 8 bits to the left, where the 4 th clock signal may control the MUX of the flip-flop D11, and may take D3 as an input of D11. At this time, through the traversal of all non-zero bits, the left shift of 11 (i.e., 8+2+1) bits can be implemented for the third data, and the second data is obtained.

When the shift register 406 completes shifting the third data, the precision configuration controller 402 may send an enable signal to the shift register 406, such that the shift register 406 outputs a shifted result (i.e., the second data) to the accumulator 407, where the output is an accumulation term.

The accumulator 407 accumulates the accumulation items output from the shift register 406 one by one. Alternatively, in the embodiment of the present application, the accumulator 407 may be controlled not to accumulate as needed, for example, when the accumulation term is 0, the last accumulation result is kept unchanged. Alternatively, the accumulator 407 may be controlled to perform subtraction as needed, for example, when the accumulation term is negative, the last accumulation result may be subtracted from the current accumulation term.

The second nonlinear mapping circuit 408 performs a second nonlinear mapping on the result output from the accumulator 407. In general, the second nonlinear mapping may be customized according to the characteristics of the neural network and the type of nonlinear transformation of the preceding stage (i.e., the first nonlinear mapping). As an example, the second non-linear mapping may comprise a cascade of two non-linear transformations, wherein when the first non-linear mapping is an exponential transformation, the first non-linear mapping here is an original non-linear transformation of the neural network, such as Sigmoid, reLU, etc., and the other non-linear mapping is an inverse operation of the first non-linear mapping, i.e. the other non-linear mapping is a logarithmic transformation.

In addition, in the embodiment of the present application, the data output by the second nonlinear mapping circuit 408 may be converted in parallel and serial and sent back to the memory again in serial form, so as to perform the next calculation.

It should be noted that, in the embodiment of the present application, any one of the above devices 401 to 408 may be replaced by a circuit module or a circuit unit that can perform the function of the device, which is not particularly limited in the embodiment of the present application.

The circuit for data processing in the neural network system provided by the embodiment of the application can be applied to a cloud server application occasion, can be an independent processing chip on a terminal (such as a mobile phone) application, and can also be a module (such as a template realized based on an ASIC) in the terminal processor chip. Specifically, the input of the information of the circuit can come from various information inputs such as voice, image, natural language and the like which need intelligent processing, and can be subjected to necessary preprocessing (such as sampling, analog-to-digital conversion, feature extraction and the like) to form data of the neural network operation to be processed. In addition, the information of the circuit may be output to other subsequent processing modules or software, such as graphics or other representations that may be understood to be available. In the cloud application mode, for example, the processing units of the front and rear stages of the data processing circuit in the neural network system may be borne by other server operation units, and in the terminal application environment, the processing units of the front and rear stages of the data processing circuit in the neural network system may be completed by other parts (such as including a sensor and an interface circuit) of the terminal software and hardware.

An example of data processing in the neural network of the embodiment of the present application will be specifically described below taking an operand with 8-bit precision as an example.

Step 1, two operands a and B are read out of memory bit by bit simultaneously, where a=0111.0101, b= 0011.1001, and data a and B each have a 4-bit integer portion and a 4-bit fractional portion. Each time 1 bit is read out, the data a and B can be read out all after 10101110 and 10011100,8 clock cycles in order from right to left. Here, the precondition for enabling serial read-out is that the data is originally stored in the memory in serial, e.g. the first bits of a and B are all stored in the same byte, the memory is accessed byte by byte, so that when this byte is read out of the memory, the first bits of a and B are read out simultaneously.

Step 2, two serial operands (i.e. the two operands read out in step 1) are sent to the serial adder to complete one bit addition. Initialization of the serial adder is accomplished by a precision configuration controller that clears a register of the full adder that holds the low carry bit, indicating that there is no carry initially. When operation is started, the serial adder firstly performs addition 1+1=0 of the first bit, and the carry value is 1; then adding 0+0=0 of the second bit, combining the carry value from the first bit, so that the output result is 1, and the carry value is 0; and so on, the addition operation of all 8-bit data is completed, and the serial output result is 01110101, which represents 1010.1110.

In step 3, the result 1010.1110 of the operation performed by the serial adder is divided into two parts according to the output sequence, the decimal part 1110 is output in advance, and the integer part 1010 is output subsequently.

Step 4, the fraction part (1110) enters the fraction bit serial-parallel conversion register bit by bit, and then outputs the fraction part in parallel. Here, the serial-in-parallel-out process is controlled by the precision configuration controller, and since the decimal place has 4 bits, the precision configuration controller can control the serial-parallel conversion register to output 4 decimal places in parallel after inputting 4 clocks.

Step 5, the integer part (1010) enters an integer floating point shift controller, the shifting number of the decimal shift Reg is controlled according to the input integer value, and the corresponding shifting number of 1010 is 8+2=10. The bit width of the integer floating point shift controller is provided by the precision configuration controller, which is also used to control the start and end of the shift.

And 6, inputting the content of the decimal serial-parallel conversion register into a nonlinear conversion unit 1, and outputting the content to a decimal shift Reg after nonlinear conversion. The nonlinear transformation may be an exponential transformation, e.g., 1111 as a result of the transformation, and the implementation may be accomplished using a look-up table or a combinatorial logic circuit.

Step 7, data 1111 output from nonlinear conversion section 1 is input to fractional shift Reg. And controlling the decimal shift Reg to shift according to the value 10 of the integer floating point shift controller.

The result of the shift is 11110000000000 (shifted to the left by 10 bits compared with 1111), the precision configuration controller outputs an enable signal after the shift is completed, and the shifted result is output, and the output is an accumulation item.

And 9, accumulating the output accumulation items one by an accumulator.

Step 10, the nonlinear transformation unit 2 performs another nonlinear transformation on the result from the accumulator. The transformation may take the form of a cascade of two transformations, the first being the original nonlinear transformation of the neural network, e.g. Sigmoid, reLU, etc., and the other being the logarithmic transformation (since the preceding nonlinear transformation unit 1 is an exponential transformation, its inverse is a logarithmic transformation).

Step 11, the output data of the nonlinear conversion unit 2 is subjected to parallel-serial conversion and is sent back to the memory in a serial form.

An example of data processing in the neural network according to the embodiment of the present application will be specifically described below by taking an operand with 3-bit precision as an example.

Step 1, two operands a and B are read out of memory bit by bit simultaneously, where a=11.0 and b=01.1, both having a 2 bit integer portion and a 1 bit fractional portion. Each time 1 bit is read out, the data are read out completely after 011 and 110,3 clock cycles in sequence from right to left.

Step 2, two serial operands (i.e. the two operands read out in step 1) are sent to the serial adder to complete one bit addition. When operation is started, adding 1+0=1 of the first bit is firstly performed, and the carry value is 0; then add 1+1=0 for the second bit, and combine the carry value from the first bit, so the output result is 1 and the carry value is 1. And so on, the addition operation of all 3-bit data is completed, and the serial output result is 1001, which represents 100.1.

Step 3, the result 100.1 of the serial adder operation is divided into two parts according to the output sequence, the decimal part 1 is output in advance, and the integral part 100 is output subsequently.

And 4, the decimal part (1) enters a decimal serial-parallel conversion register, and then the decimal part is output in parallel. Here, the serial-in-parallel-out process is controlled by the precision configuration controller, and since the decimal place has 1 bit, the precision configuration controller can control the serial-parallel conversion register to output 1 decimal place in parallel after inputting 1 clock.

And 5, the integer part (100) enters an integer floating point shift controller, the shifting number of the decimal shift Reg is controlled according to the input integer value, and the corresponding shifting number of 100 is 4.

And 6, inputting the content of the decimal serial-parallel conversion register into a nonlinear conversion unit 1, and outputting the content to a decimal shift Reg after nonlinear conversion. The nonlinear transformation may be an exponential transformation, e.g. the result after transformation is 1.

Step 7, data 1 outputted from the nonlinear conversion means 1 is inputted to the fractional shift Reg. And controlling the decimal shift Reg to shift according to the value 4 of the integer floating point shift controller.

Step 8, the result of the shift is 10000 (4 bits are shifted compared with 1), the precision configuration controller outputs an enabling signal after the shift is completed, and the shifted result is output, and the output is an accumulation item.

And 9, accumulating the output accumulation items one by an accumulator.

Therefore, the embodiment of the application can realize that the multi-precision compatible processing is transferred to the counting of the clock beats by serializing the input and the calculation process of the operand of the adder, and can realize the multi-precision compatible neural network operation architecture based on the embodiment of the application.

Fig. 6 shows a schematic flowchart of a method for data processing in a neural network system according to an embodiment of the present application. The method is performed by a data processing circuit that may include a serial addition circuit, a first nonlinear mapping circuit, an accumulation circuit, and a second nonlinear mapping circuit, the method comprising:

610, serially acquiring each input data in at least one input data and a weight parameter corresponding to each input data through the serial addition circuit, and performing serial addition operation on each input data and the weight parameter corresponding to each input data to obtain at least one first data;

620, performing first nonlinear mapping on each first data in the at least one first data through the first nonlinear mapping circuit to obtain at least one second data, wherein the first nonlinear mapping is exponential transformation with 2 bottoms;

630, accumulating the at least one second data by the accumulating circuit;

640, performing a second nonlinear mapping on the accumulation result output by the accumulation circuit by using the second nonlinear mapping circuit to obtain output data, where the second nonlinear mapping is determined according to the nonlinear mapping of the neural network and the first nonlinear mapping.

Optionally, the data processing circuit further includes a serial-parallel conversion circuit, and the method further includes:

and acquiring at least part of the data of each first data through the serial and outputting the at least part of the data of each first data to the first nonlinear mapping circuit in parallel.

Optionally, the first nonlinear mapping circuit includes a nonlinear mapping unit, a shift control circuit and a shift register circuit, where the performing, by the first nonlinear mapping circuit, first nonlinear mapping on each first data in the at least one first data to obtain at least one second data includes:

the decimal part of each first data is obtained in parallel through the nonlinear mapping unit, the first nonlinear mapping is carried out on the decimal part of each first data, third data is obtained, and the third data is output to the shift register circuit;

Acquiring an integer part of each first data through the shift control circuit, and outputting a control signal to the shift register circuit according to the integer part of each first data;

Optionally, the control signal is k clock signals, the shift register circuit shifts the third data according to the control signal to obtain the at least one second data, and the shift register circuit includes:

and the shift register circuit outputs the third data in series, and outputs k 0 s in series according to the k clock signals after outputting the third data to obtain the second data, wherein k is the value of the integer part of the first data, and k is a positive integer.

Optionally, the shift register circuit shifts the third data according to the control signal to obtain the at least one second data, including:

and the shift register circuit shifts the third data by k bits leftwards according to the control signal to obtain the second data, wherein k is the value of the integer part of the first data, and k is a positive integer.

Optionally, the control signal is k clock signals.

Optionally, the control signal is j clock signals, wherein,

the ith clock signal in the j clock signals corresponds to the ith bit from the low bit to the high bit of the binary number of the integer part, j is the bit number of the binary number of the integer part, i and j are positive integers, i is more than or equal to 1 and less than or equal to j;

the shift register circuit shifts the third data by k bits to the left according to the control signal to obtain the second data, including:

when the ith bit of the binary number is not 0, the shift register circuit shifts the third data by 2 leftwards according to the ith clock signal ^i-1 A bit;

Specifically, the data processing circuit is, for example, as shown in fig. 3 or fig. 4 above. That is, the circuit for data processing in fig. 3 or fig. 4 can implement the respective processes corresponding to the method embodiment shown in fig. 6, and specifically, the circuit for data processing may be referred to the above description, and for avoiding repetition, a detailed description is omitted here.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

It should be understood that the first, second, etc. descriptions in the embodiments of the present application are only used for illustrating and distinguishing the description objects, and no order is used, nor does it indicate that the number of the devices in the embodiments of the present application is particularly limited, and no limitation in the embodiments of the present application should be construed.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A circuit for processing data in a neural network system is characterized by comprising a serial addition circuit, a first nonlinear mapping circuit, an accumulation circuit and a second nonlinear mapping circuit, wherein,

the accumulation circuit is used for accumulating the at least one second data;

the second nonlinear mapping circuit is used for performing second nonlinear mapping on the accumulation result output by the accumulation circuit to obtain output data, wherein the second nonlinear mapping is determined according to the nonlinear mapping of the neural network and the first nonlinear mapping;

the circuit further comprises:

the serial-parallel conversion circuit is used for serially acquiring at least part of the data of each first data and outputting the at least part of the data of each first data to the first nonlinear mapping circuit in parallel;

The first nonlinear mapping circuit comprises a nonlinear mapping unit, a shift control circuit and a shift register circuit, wherein,

2. The circuit according to claim 1, wherein the control signal is k clock signals, the shift register circuit serially outputs the third data, and k 0 s are serially output according to the k clock signals after the third data is output, and the second data is obtained, wherein k is a value of an integer part of the first data, and k is a positive integer.

3. The circuit of claim 1, wherein the shift register circuit shifts the third data by k bits to the left according to the control signal to obtain the second data, wherein k is a value of an integer part of the first data, and k is a positive integer.

4. A circuit according to claim 3, wherein the control signals are k clock signals.

5. A circuit according to claim 3, wherein the control signal is j clock signals, wherein an i-th clock signal of the j clock signals corresponds to an i-th bit from a lower bit to a higher bit of the binary number of the integer part, j is a bit number of the binary number of the integer part, i, j is a positive integer, 1.ltoreq.i.ltoreq.j;

6. A method of data processing in a neural network system, the method performed by a data processing circuit comprising a serial addition circuit, a first nonlinear mapping circuit, an accumulation circuit, and a second nonlinear mapping circuit, the method comprising:

accumulating the at least one second data by the accumulating circuit;

performing second nonlinear mapping on the accumulation result output by the accumulation circuit through the second nonlinear mapping circuit to obtain output data, wherein the second nonlinear mapping is determined according to the nonlinear mapping of the neural network and the first nonlinear mapping;

the data processing circuit further comprises a serial-to-parallel conversion circuit, the method further comprising:

acquiring at least part of the data of each first data through the serial, and outputting the at least part of the data of each first data to the first nonlinear mapping circuit in parallel;

the first nonlinear mapping circuit includes a nonlinear mapping unit, a shift control circuit and a shift register circuit, wherein the first nonlinear mapping circuit performs first nonlinear mapping on each first data in the at least one first data to obtain at least one second data, and the method includes:

7. The method of claim 6, wherein the control signal is k clock signals, the shift register circuit shifts the third data according to the control signal to obtain the at least one second data, comprising:

8. The method of claim 6, wherein the shift register circuit shifts the third data according to the control signal to obtain the at least one second data, comprising:

9. The method of claim 8, wherein the control signal is k clock signals.

10. The method of claim 8, wherein the control signals are j clock signals, wherein,