CN210109863U

CN210109863U - Multiplier, device, neural network chip and electronic equipment

Info

Publication number: CN210109863U
Application number: CN201921433511.7U
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2020-02-21
Anticipated expiration: 2029-08-30

Abstract

The application provides a multiplier, a device, a neural network chip and an electronic device, wherein the multiplier comprises: the system comprises a regular signed number encoding circuit, a compression tree group circuit and an accumulation circuit; the output end of the regular signed number coding circuit is connected with the input end of the compression tree group circuit, the output end of the compression tree group circuit is connected with the input end of the accumulation circuit, the multiplier can compress partial products of target codes through the compression tree group circuit to obtain target operation results, the multiplier can perform regular signed number coding on received data to be processed, the number of effective partial products in multiplication operation is reduced, and therefore complexity of multiplication operation is reduced.

Description

Multiplier, device, neural network chip and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a multiplier, a multiplier device, a neural network chip, and an electronic device.

Background

With the continuous development of digital electronic technology, the rapid development of various Artificial Intelligence (AI) chips has higher and higher requirements for high-performance digital multipliers. As one of algorithms widely used by an intelligent chip, a neural network algorithm is a common operation in which multiplication is performed by a multiplier.

At present, a multiplier takes every three bits of a multiplier as a code, obtains partial products according to the multiplicand, and compresses all the partial products by using a wallace tree to obtain a multiplication result. However, in the conventional technique, the number of non-zero values in the code is large, and the number of the generated corresponding partial products is large, so that the complexity of the multiplier for realizing multiplication operation is high.

SUMMERY OF THE UTILITY MODEL

In view of the foregoing, it is desirable to provide a multiplier, a chip and an electronic device capable of reducing the number of effective partial products obtained during multiplication to reduce the complexity of multiplication of the multiplier.

An embodiment of the present application provides a multiplier, where the multiplier includes: the device comprises a regular signed number coding circuit, a compression tree group circuit and an accumulation circuit, wherein the output end of the regular signed number coding circuit is connected with the input end of the compression tree group circuit, the output end of the compression tree group circuit is connected with the input end of the accumulation circuit, and the compression tree group circuit comprises: a counter and a data processing module;

the regular signed number coding circuit is used for carrying out regular signed number coding processing on received data to obtain a partial product of a target code, the counter is used for obtaining the number of high levels in each row of numerical values of the partial product of the target code, the data processing module is used for carrying out logic operation processing on all signals output by the counter to obtain an accumulation operation result, and the accumulation circuit is used for carrying out accumulation processing on the accumulation operation result.

In one embodiment, the regular signed number encoding circuit comprises: the system comprises a regular signed number coding unit and a partial product acquisition unit, wherein the output end of the regular signed number coding unit is connected with the input end of the partial product acquisition unit; the regular signed number coding unit is used for receiving first data and performing regular signed number coding processing on the first data to obtain target codes, the partial product obtaining unit is used for receiving second data, obtaining original partial products according to the target codes and the second data, performing sign bit expansion processing on the original partial products to obtain partial products after sign bit expansion, and obtaining the partial products of the target codes according to the partial products after sign bit expansion.

In one embodiment, the regular signed number encoding unit includes: a data input port and a target code output port; the data input port is configured to receive the first data subjected to regular signed number encoding processing, and the target encoding output port is configured to output the target encoding obtained by performing regular signed number encoding processing on the received first data.

In one embodiment, the partial product obtaining unit includes: a target code input port, a data input port, and a partial product output port; the target code input port is configured to receive the target code, the data input port is configured to receive the second data, and the partial product output port is configured to output a partial product of the target code obtained according to the target code and the second data.

In one embodiment, the accumulation circuit comprises: an adder for adding the result of the addition operation.

In one embodiment, the adder comprises: a carry signal input port, a sum signal input port and an operation result output port; the carry signal input port is used for receiving a carry signal, the sum signal input port is used for receiving a sum signal, and the operation result output port is used for outputting the target operation result obtained by accumulating the carry signal and the sum signal.

In the multiplier provided by this embodiment, the multiplier may perform regular signed number encoding processing on the received data through a regular signed number encoding circuit to obtain a target code, and obtain a partial product of the target code according to the target code, may perform accumulation processing on the partial product of the target code through a compression tree group circuit to obtain an accumulation operation result, and perform accumulation processing on the accumulation operation result output by the compression tree group circuit again through the accumulation circuit to obtain a target operation result of multiplication operation.

The machine learning arithmetic device provided by the embodiment of the application comprises one or more multipliers; the machine learning arithmetic device is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning arithmetic and transmitting an execution result to other processing devices through an I/O interface;

when the machine learning arithmetic device comprises a plurality of multipliers, the plurality of computing devices are connected through a preset specific structure and transmit data;

the multipliers are interconnected through a PCIE bus and transmit data so as to support larger-scale machine learning operation; a plurality of multipliers share the same control system or own respective control systems; a plurality of multipliers share a memory or own respective memories; the interconnection mode of a plurality of multipliers is any interconnection topology.

The combined processing device provided by the embodiment of the application comprises the machine learning processing device, the universal interconnection interface and other processing devices; the machine learning arithmetic device interacts with the other processing devices to jointly complete the operation designated by the user; the combined processing device may further include a storage device, which is connected to the machine learning arithmetic device and the other processing device, respectively, and is configured to store data of the machine learning arithmetic device and the other processing device.

The neural network chip provided by the embodiment of the application comprises the multiplier, the machine learning arithmetic device or the combined processing device.

The neural network chip packaging structure provided by the embodiment of the application comprises the neural network chip.

The board card provided by the embodiment of the application comprises the neural network chip packaging structure.

The embodiment of the application provides an electronic device, which comprises the neural network chip or the board card.

The chip provided by the embodiment of the application comprises at least one multiplier as described in any one of the above.

An electronic device provided by the embodiment of the application comprises the chip.

Drawings

Fig. 1 is a schematic structural diagram of a multiplier according to an embodiment;

fig. 2 is a schematic diagram of a specific structure of a multiplier according to another embodiment;

fig. 3 is a schematic diagram illustrating a distribution rule of a partial product of a target code obtained by an 8-bit fixed-point number multiplication operation according to another embodiment;

FIG. 4 is a schematic diagram of a connection structure of a compressor sub-circuit for performing an 8-bit fixed-point number multiplication according to another embodiment;

fig. 5 is a schematic diagram illustrating a distribution rule of partial products of all target codes obtained by performing multiplication with 8-bit fixed-point numbers according to another embodiment;

FIG. 6 is a flowchart illustrating a data processing method according to an embodiment;

FIG. 7 is a block diagram of a combined processing device according to an embodiment;

FIG. 8 is a block diagram of another integrated processing device according to an embodiment;

fig. 9 is a schematic structural diagram of a board card according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The multiplier provided by the application can be applied to an AI chip, a Field-Programmable Gate Array (FPGA) chip or other hardware circuit devices for multiplication processing, and a specific structural schematic diagram of the multiplier is shown in FIG. 1.

Fig. 1 is a block diagram of a multiplier according to an embodiment. As shown in fig. 1, the multiplier includes: a regular signed number encoding circuit 11, a compression tree group circuit 12, and an accumulation circuit 13; the output end of the regular signed number encoding circuit 11 is connected with the input end of the compression tree group circuit 12, and the output end of the compression tree group circuit 12 is connected with the input end of the accumulation circuit 13. The regular signed number encoding circuit 11 is configured to perform regular signed number encoding processing on received data to obtain a partial product of a target code, the compression tree group circuit 12 is configured to perform accumulation processing on the partial product of the target code to obtain an accumulation operation result, and the accumulation circuit 13 is configured to perform accumulation processing on the accumulation operation result.

Specifically, the regular signed number encoding circuit 11 may include a plurality of data processing units having different functions, and the data received by the regular signed number encoding circuit 11 may include a multiplier and a multiplicand in a multiplication operation. Optionally, the data processing units with different functions may be data processing units with a regular signed number encoding function. Alternatively, the multiplier and the multiplicand may be fixed-point numbers with the same multi-bit width. Alternatively, the multiplier may accumulate the partial product of the target code obtained by the normal signed number encoding circuit 11 by the compression tree group circuit 12 to obtain an accumulation operation result, and accumulate the accumulation operation result again by the accumulation circuit 13 to obtain a target operation result of the multiplication operation. Alternatively, the regular signed number encoding process described above may be characterized as a data processing procedure by encoding by the values 0, -1 and 1.

It should be noted that the multiplier provided in this embodiment can process fixed-point numbers with fixed bit widths. Optionally, there may be one input port of the data processing unit with different functions, the function of each input port of each data processing unit may be the same, there may also be one output port, the function of each output port of each data processing unit may be different, and the circuit structures of the data processing units with different functions may be different.

In the multiplier provided by this embodiment, the regular signed number coding circuit performs regular signed number coding on received data to obtain a partial product of a target code, the compression tree group circuit can perform accumulation processing on the partial product of the target code to obtain an operation result of accumulation operation, and the accumulation circuit performs accumulation processing on the operation result output by the compression tree group circuit again to obtain a target operation result of multiplication operation; meanwhile, the multiplier can compress the partial product of the target code through the compression tree group circuit, so that the delay of a compression circuit in the multiplier is effectively reduced, and the operation performance of the multiplier and the overall performance of a chip are improved.

Fig. 2 is a schematic structural diagram of a multiplier provided in another embodiment, where the multiplier includes a regular signed number encoding circuit 11, and the regular signed number encoding circuit 11 includes: a regular signed number encoding unit 111 and a partial product acquisition unit 112; the output end of the regular signed number encoding unit 111 is connected with the input end of the partial product obtaining unit 112. The regular signed number encoding unit 111 is configured to receive first data, perform regular signed number encoding processing on the first data to obtain a target code, and the partial product obtaining unit 112 is configured to receive second data, obtain an original partial product according to the target code and the second data, perform sign bit extension processing on the original partial product to obtain a partial product after sign bit extension, and obtain a partial product of the target code according to the partial product after sign bit extension.

Specifically, the regular signed number encoding unit 111 may receive first data, the first data may be a multiplier in a multiplication operation, the multiplication may be a fixed-point number, and the partial product obtaining unit 112 may receive second data, the second data may be a multiplicand in the multiplication operation, the multiplication may be a fixed-point number, and the partial product of the target encoding may be obtained according to the multiplicand and the target encoding. It should be noted that the method of the regular signed number encoding process can be characterized by the following ways: for N-bit multipliers, processing from lower to higher order values, if there are consecutive l (l)>2) bit value 1, successive n bit values 1 can be converted into data "1 (0))_l-1(-1) ", and combining the remaining corresponding (N-l) bit values with the converted (l +1) bit values to obtain a new data; then, the new data is used as the initial data of the next stage of conversion processing until no continuous l (l) exists in the new data obtained after the conversion processing>2) bit value 1; the N-bit multiplier is subjected to regular signed number encoding processing, and the bit width of the obtained target code can be equal to (N + 1). Further, in the regular signed number encoding process, the data 11 can be converted into (100- > 001), that is, the data 11 can be equivalently converted into 10 (-1); data 111 can be converted to (1000-0001), i.e., data 111 can be converted to 100(-1) equivalently; and so on, the others are continued by l (l)>2) bit value 1 conversion process in a similar manner

For example, the multiplier received by the regular signed number encoding unit 111 is "001010101101110", the first new data obtained by performing the first-stage conversion processing on the multiplier is "0010101011100 (-1) 0", the second new data obtained by continuing the second-stage conversion processing on the first new data is "0010101100 (-1)00(-1) 0", the third new data obtained by continuing the third-stage conversion processing on the second new data is "0010110 (-1)00(-1)00(-1) 0", the fourth new data obtained by continuing the fourth-stage conversion processing on the third new data is "00110 (-1)0(-1)00(-1)00(-1) 0", the fifth new data obtained by continuing the fifth-stage conversion processing on the fourth new data is "010 (-1)0(-1)00(-1)00 (1) 0", and if the fifth new data does not have a continuous l (l > -2) bit value 1, the fifth new data may be called an initial code, and after the initial code is subjected to one bit complementing process, the representation regular signed number coding process is completed to obtain an intermediate code, wherein the bit width of the initial code may be equal to the bit width of the multiplier. Optionally, after the regular signed number encoding unit 111 performs the regular signed number encoding processing on the multiplier, to obtain new data (i.e. initial encoding), if the highest-order value and the second-order highest-order value in the new data are "10" or "01", the regular signed number encoding unit 111 may supplement a first-order value 0 to the highest-order position of the highest-order value of the new data, to obtain the corresponding middle-encoded high three-order values which are "010" or "001", respectively. Optionally, the bit width of the intermediate code may be equal to the bit width of the data currently processed by the multiplier plus 1. Optionally, the bit width of the target code may be equal to the fixed-point bit width N currently received by the multiplier plus 1, or may be equal to the number of the original partial products.

It should be noted that, the partial product obtaining unit 112 may obtain corresponding original partial products according to each bit value included in the target code and the received multiplicand, perform sign bit extension processing on each original partial product to obtain sign bit extended partial products, and perform shift processing on the sign bit extended partial products to obtain the partial products of the target code. Alternatively, the number of sign bit extended partial products may be equal to the number of original partial products. Optionally, the original partial product may be a partial product without sign bit extension, a bit width of the original partial product may be equal to N, and a bit width of the partial product after sign bit extension may be equal to a sum of the bit width of the original partial product and a bit number of sign bit extension, or may be equal to 2N, that is, the bit number of sign bit extension may be equal to N. Alternatively, the number of partial products of the target code may be equal to the number of partial products after sign bit extension.

Wherein, in the distribution rule of partial products of all target codes, the bit width M of the partial product of the first target code₀May be equal to the bit width 2N of the first corresponding sign bit extended partial product, the bit width M of the second target encoded partial product₁The bit width 2N of the partial product after the second corresponding sign bit expansion may be one bit less, the one-bit less value may be the one-bit higher value of the partial product after the second corresponding sign bit expansion, the one-bit value may not be subjected to the final addition operation, and so on, the bit width M of each target coded partial product_iBit width M of partial product that can be encoded more than last target_i-1One bit less, and may also be equal to the bit width 2N of each corresponding sign bit extended partial product minus (i-1), where i represents the number of partial products of the target code starting from 1.

In the multiplier provided by this embodiment, the regular signed number encoding unit may perform regular signed number encoding on the received data to obtain the target code, the partial product obtaining unit obtains the original partial product according to each bit value included in the target code, performs sign bit expansion on the original partial product to obtain the partial product after sign bit expansion, and obtains the partial product of the target code according to the partial product after sign bit expansion, further performs compression on the partial product of the target code through the compression tree group circuit to obtain the operation result, and performs accumulation on the operation result through the accumulation circuit to obtain the target operation result of the multiplication operation, and the multiplier may perform regular signed number encoding on the received data by using the regular signed number encoding circuit to reduce the number of the effective partial products obtained during the multiplication operation, therefore, the complexity of the multiplier for realizing multiplication operation is reduced, the operation efficiency of the multiplication operation is improved, and the power consumption of the multiplier is effectively reduced; meanwhile, the multiplier can effectively reduce the delay of a compression circuit and improve the operational performance of the multiplier and the overall performance of a chip.

In one embodiment, the regular signed number encoding unit 111 includes: a data input port 1111 and a target code output port 1112; the data input port 1111 is configured to receive the first data subjected to regular signed number encoding, and the target encoding output port 1112 is configured to output the target encoding obtained by performing regular signed number encoding on the received first data.

Specifically, if the regular signed number encoding unit 111 receives first data through the data input port 1111, the first data may be a multiplier in a multiplication operation, and the regular signed number encoding unit 111 may perform regular signed number encoding processing on the received multiplier to obtain a target code, and output the target code through the target code output port 1112. It should be noted that the target code may include three values, which are-1, 0 and 1.

In the multiplier provided by this embodiment, the regular signed number encoding unit may perform regular signed number encoding on the received data to obtain the target code, the partial product obtaining unit may obtain a corresponding partial product of the target code according to each bit value in the target code, and compress the partial product of the target code through the compression tree group circuit to obtain an operation result, and accumulate the operation result through the accumulation circuit to obtain a target operation result of multiplication; meanwhile, the multiplier can effectively reduce the delay of a compression circuit and improve the operational performance of the multiplier and the overall performance of a chip.

In one embodiment, the multiplier comprises a partial product obtaining unit 112, and the partial product obtaining unit 112 comprises: a target code input port 1121, a data input port 1122, and a partial product output port 1123; the target code input port 1121 is configured to receive the target code, the data input port 1122 is configured to receive the second data, and the partial product output port 1123 is configured to output a partial product of the target code obtained according to the target code and the second data.

Specifically, the partial product obtaining unit 112 may receive, through the target code input port 1121, each bit value in the target code output by the regular signed number coding unit 111, where the value may be-1, 0, and 1, obtain an original partial product according to three values in the obtained target code and the second data received by the data input port 1122, perform sign bit extension processing on the original partial product to obtain a partial product after sign bit extension, perform shift processing on the partial product after sign bit extension to obtain a partial product of the target code, and then output the partial product of the target code through the partial product output port 1123. Alternatively, the second data received by the data input port 1122 may be a multiplicand in a multiplication operation, and the multiplicand may be a fixed-point number. Optionally, each bit value in the original partial product and the sign bit extended partial product is binary data 0 or 1, where 0 may represent a low level signal and 1 may represent a high level signal.

It should be noted that, the values of the sign bit extension bits in the partial product after the sign bit extension are all equal, and may be equal to the highest-order value in the original partial product, or may be understood that the values of the high N bits in the partial product after the sign bit extension are all equal, where N represents the bit width of the data currently received by the multiplier. Optionally, the sign bit extension bit may have a bit width equal to N-1.

For example, if the multiplier currently processes 8 bits by 8 bits fixed point multiplication, an original partial product obtained by the partial product obtaining unit 112 is "p₈p₇p₆p₅p₄p₃p₂p₁p₀", the partial product after sign bit extension can be represented as" p₈p₈p₈p₈p₈p₈p₈p₈p₇p₆p₅p₄p₃p₂p₁p₀”。

In the multiplier provided by this embodiment, the partial product obtaining unit may obtain the corresponding partial product after sign bit expansion according to each bit value included in the target code, perform carry shift processing on the partial product after sign bit expansion to obtain the partial product of the target code, perform compression processing on the partial product after sign bit expansion through the compression tree group circuit to obtain an operation result, and perform accumulation processing on the operation result through the accumulation circuit to obtain a target operation result of multiplication operation, where the number of effective partial products that can be obtained by the multiplier is small, so that the complexity of realizing multiplication operation by the multiplier is reduced, the operation efficiency of multiplication operation is improved, and the power consumption of the multiplier is effectively reduced; meanwhile, the multiplier can effectively reduce the delay of a compression circuit and improve the operational performance of the multiplier and the overall performance of a chip.

In one embodiment, continuing with the detailed structural diagram of the multiplier shown in fig. 2, the multiplier includes the compression tree set circuit 12, and the compression tree set circuit 12 includes: and the compressor sub-circuits 121 to 12n are used for accumulating the number of each column in the partial products of all the received target codes to obtain the accumulation operation result, wherein the plurality of the compressor sub-circuits 121 to 12n are used for accumulating the number of each column in the partial products of all the received target codes to obtain the accumulation operation result.

Specifically, each of the compressor sub-circuits in the compression tree group circuit 12 may include a plurality of basic logic devices with different functions, and in addition, the compression sub-circuits 121 to 12n may be understood as a circuit that performs logic operation processing on a multi-bit input signal to obtain a two-bit output signal. Alternatively, the number N of the compressor sub-circuits included in the compression tree group circuit 12 may be equal to the bit width of the partial product after sign bit expansion, that is, 2N, and the N compressor sub-circuits may be processed in parallel, but the connection manner may be serial connection. Optionally, each compressor sub-circuit in the compression tree group circuit 12 may add each column number value in all partial products of target coding, and each compressor sub-circuit may output two paths of signals, that is, Carry signal Carry_iWith a Sum signal Sum_iWhere i may represent the number corresponding to each compressor sub-circuit, and the number of the first compressor sub-circuit is 0. Optionally, each compressorThe number of input signals received by the circuit may be equal to the number of target codes, and may also be equal to the number of partial products after sign bit extension or the number of partial products of target codes. Optionally, the distribution rule of the partial products of all the target codes may be characterized in that the partial product of the first target code may be equal to the partial product after the first corresponding sign bit is expanded, and starting from the partial product of the second target code, the partial product of each target code may be equal to the lower 2N-i bit value in the partial product after the corresponding sign bit is expanded, i represents the number of the partial product of the target code starting from 0, meanwhile, the highest bit value of each target code is located in the same column, the next highest bit value is located in the same column, and so on.

It should be noted that the signal received by each of the compressor sub-circuits in the compression tree group circuit 12 may include the carry input signal Cin_iAnd sign bit extended partial product input signals. Optionally, the partial product input signal of the target code received by each compressor sub-circuit may be each column number of all the partial products of the target code, and the carry signal Cout output by each compressor sub-circuit_iThe number of bits of (a) may be equal to 1. Optionally, the carry input signal Cin received by each compressor sub-circuit in the compression tree group circuit 12_iCarry output signal Cout that can be output by the previous compressor sub-circuit_i-1And the carry input signal Cin received by the first compressor sub-circuit₀Is 0.

For example, if the multiplier is currently processing 8 bits by 8 bits fixed point multiplication, the sign bit extended partial product obtained by the partial product obtaining unit 112 in the multiplier may be represented as "p_i15p_i14p_i13p_i12p_i11p_i10p_i09p_i08p_i0 ₇p_i06p_i05p_i04p_i03p_i02p_i01p_i00”(i1, …, n is 9), and the corresponding partial product of the target code is obtained by each partial product after sign bit expansion, wherein i may represent the number of the partial product after sign bit expansion from 1, when the partial product after sign bit expansion is accumulated by a plurality of compressor sub-circuits 121 to 12n, the distribution rule of the partial products of the 9 target codes may be as shown in fig. 3, in the figure, "○" may represent the value in the original partial product, "●" may represent the sign bit value in the partial product after sign bit expansion, from the rightmost column to the leftmost column, a total of 16 compressor sub-circuits are required to accumulate the partial products of the 4 target codes, a connection circuit diagram of 16 compressor sub-circuits is shown in fig. 4, wherein compressor _ s represents the compressor sub-circuit in fig. 4, s is the number of the compressor sub-circuit from 0, and the solid line connecting two compressor sub-circuits may represent the compressor sub-circuit corresponding to the high-circuit, and the corresponding carry signal output signal of the adjacent compressor sub-circuit is input to the low-circuit.

According to the multiplier provided by the embodiment, the partial product of the target code can be compressed through the compression tree group circuit, so that the delay of the compression circuit in the multiplier is effectively reduced, and the operation performance of the multiplier and the overall performance of a chip are improved; meanwhile, the number of effective partial products which can be obtained by the multiplier is small, so that the complexity of the multiplier for realizing multiplication is reduced, the operation efficiency of the multiplication is improved, and the power consumption of the multiplier is effectively reduced.

In one embodiment, continuing with the detailed structural diagram of the multiplier shown in fig. 2, the multiplier includes the compressor sub-circuits 121 to 12n, and the compressor sub-circuits 121 to 12n include: counters 1211-12 n1 and data processing modules 1212-12 n 2; the plurality of counters 1211 to 12n1 are used for acquiring the number of high levels in the received input signal, and the plurality of data processing modules 1212 to 12n2 are used for performing logic operation processing on the received input signal.

Specifically, the output end of each counter may be connected to the input end of the corresponding data processing module. Optionally, each compressor sub-circuit may include a counter and a counterThe data processing module, the plurality of compressor sub-circuits 121 to 12n in the compression tree group circuit 12 may be connected in a serial manner, wherein an output terminal of the data processing module in each compressor sub-circuit may be connected to an input terminal of a counter in a next compressor sub-circuit, but the plurality of compressor sub-circuits 121 to 12n may perform parallel processing on each column number of partial products of all target codes, wherein n may represent the number of counters included in the plurality of compressor sub-circuits 121 to 12n and may also represent the number of data processing modules included in the plurality of compressor sub-circuits 121 to 12 n. Optionally, the counter in each compressor sub-circuit may receive corresponding column values in partial products of all target codes to obtain the number of high levels in the corresponding column values, and in addition, the data processing module in each compressor sub-circuit may receive output signals of the counter in the compressor sub-circuit, and perform logical operation on the output signals to obtain two output signals, where the two output signals are Sum output signals Sum_iAnd Carry output signal Carry_iI may be the number of the compressor sub-circuit starting from 1.

It should be noted that the counter included in each compressor sub-circuit may include a plurality of input ports, and the number of the input ports may be equal to the data bit width N currently processed by the multiplier plus 1, where each input port may receive all partial products of the target code, corresponding to any one value in each column of values. Meanwhile, the counter may include a plurality of output ports, the number of the output ports may be equal to (N +3), and one of the output ports may output a carry output signal Cout of the counter_iThe remaining output ports may output the number Num _ m of high levels in the counter according to the received input signal, where i may represent the number of possible high levels in the input signal received by the counter.

For example, if the partial product obtaining unit 112 obtains n partial products of the target codes, the counter in each compressor sub-circuit can receive n values, and the corresponding column value in all the partial products of the target codes is smaller than n, and the value is divided by m (m) of the corresponding column<n) out of the values, nullThe missing n-m values can be replaced by the value 0. Optionally, if one of the columns of values in the partial products of all target codes is I₀，I₁，…，I_n-1Num _ m may be equal to pair A₁，A₂… and A_lThe result of the OR logic operation by an OR gate, where m>l，m＝0，1，...，n，A_lThe sum of the n-l values obtained by inverting any n-l value in the n values in the row and the remaining l values not inverted in the n values is the result of AND operation performed by an AND circuit, Cout_iCan be equal to | Num _ m (m)>2, "|" indicates a connection or a symbol, and correspondingly, when n is larger, Cout_iThe more bits there will be. Additionally, Cout output in each compressor sub-circuit_iNumber of bits N_CoutMay be equal to floor (n/2) -1, floor (. lambda.) denotes rounding down, wherein the above Cout_iEach bit value of (A) can be associated with Carry_iThe numerical value of (1) is equivalent, in the present embodiment, Cout_iPriority of Carry_iCarry, it can also be understood that if there are q 1 carry bits and floor (q/2) carry bits in one column of values of all target-encoded partial products, Cout is preferably used_iCarry out Carry and reuse Carry_iCarry out a carry.

For example, if the partial product obtaining unit 112 obtains the partial products of 9 target codes, the distribution rule of the 9 target codes can be continued as shown in fig. 3, the corresponding value of the blank bit below each column can be bit-complemented by the value 0, and the distribution rule after bit-complementing by the value 0 is shown in fig. 5, in which

The multiplier can accumulate 9 target-coded partial products through 16 compressor subcircuits, each column value in the 9 target-coded partial products is received through a counter in each compressor subcircuit, and the number of high-level signals in the 9 values in each column can be 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9, so that the output port of each counter except one carry output signal portThere are 10 count number output ports, and the output number of the 10 count number output ports can be represented by Num _ m, which is Num _0, Num _1, Num _2, Num _3, Num _4, Num _5, Num _6, Num _7, Num _8, and Num _9 in this example, if a column of values in the partial product of 9 target codes is I _9 respectively₀，I₁，I₂，I₃，I₄， I₅，I₆，I₇，I₈Then, the values of Num _0 to Num _9 can be obtained by the following logical operation formula, and another output signal Cout of the counter_iCan be obtained by the following logical operation formula through Num _2 to Num _ 9.

Num_9＝(I₀&I₁&I₂&I₃&I₄&I₅&I₆&I₇&I₈)；

W＝floor(n+Cin_i-1)，l＝floor((m+l)/2)-1。

Alternatively, the value of l may be calculated by the formula l ═ floor ((m + l)/2) -1.

Where m represents the corresponding value m in Num _ m, the carry output signal Cout that can be output by the counter in one of the compressor sub-circuits_iIs l, when the bit width p of the W signal is greater than Cout_iIs the maximum bit width l, then the p-bit signal may be equal to the carry out signal Cout_iWriting the remaining (p-l) bit signal into the Carry output signal Carry output by the data processing module in the compressor sub-circuit_iWherein (p-l) may be equal to 1 or 0; otherwise, when the signal bit width p of W is less than or equal to Cout_iIs l, then the p-bit signal may be equal to the carry out signal Cout_i. In addition, the Num _ m may be equal to permutation and combinationAnd performing OR logic operation on m numerical values in each column of numerical values in the partial product of 9 target codes and the rest un-negated 9-m numerical values to obtain the result of OR logic operation of the formulas. Optionally, the bit width of the carry input signal is equal to the bit width of the carry output signal.

In addition, the data processing block included in each compressor sub-circuit may receive a plurality of input signals, which may be denoted as Num _0, …, Num _ m, and Cin, respectively_iWherein Cin_iCan represent dataCarry input signal of processing module, a plurality of input signals Num _0, …, Num _ m and Cin received by the data processing module_iAll the signals Num _0, …, Num _ m and Cout output by the counter in the previous compressor sub-circuit may be respectively corresponded_iWherein the carry input signal Cin received by the counter in the first compressor sub-circuit_iMay be equal to 0. Optionally, the data processing module included in each compressor sub-circuit may include two output signals, i.e. Sum output signal Sum_iAnd Carry output signal Carry_i. Alternatively, each data processing module may determine an output signal from the received plurality of input signals through an and or gate logic circuit.

Wherein, if there is only one high level signal Num _ l in all the signals Num _0, …, Num _ l, …, Num _ m output by the counter in each compressor sub-circuit, the value of l corresponding to the high level signal Num _ l input in the counter corresponding to the high level signal, and the value of Cin received by the data processing module_iWhen the Sum of the number of the non-zero carry input signals is odd, the Sum output signal Sum output by the data processing module in the compressor sub-circuit_iMay be equal to high level signal 1; if the number of high level signals in all the Num _ i signals output by the counter in each compressor sub-circuit is equal to the number of Cin signals received by the data processing module_iWhen the Sum of the number of the non-zero carry input signals is an even number, the Sum output signal Sum output by the data processing module in the compressor sub-circuit_iMay be equal to low level signal 0; meanwhile, if each compressor sub-circuit is provided with the number of high-level signals in all the Num _ m signals output by the counter and the Cin received by the data processing module_iThe sum of the number of the input signals is odd number, and divided by 2 to get integer downwards, the value after integer is larger than Cout output by the counter_iWhen the number of bits is greater than the number of bits, the Carry output signal Carry output by the data processing module in the compressor sub-circuit_iMay be equal to high signal 1, otherwise, Carry_iMay be equal to low level signal 0.

Illustratively, with continued reference to the example above,corresponding to the carry input signal of the data processing module in the compressor sub-circuit as Cin_iThen, if ((Num _1| Num _3| Num _5| Num _7| Num _9 ═ 1)&(Cin_i＝＝0))| ((Num_0|Num_2|Num_4|Num_6|Num_8＝＝0)&(Cin_i1), then the Sum bit output signal Sum) is output_iMay be equal to a high signal 1 and may otherwise be equal to a low signal 0. Optionally, the carry output signal Cout that can be output by the counter in the compressor sub-circuit_iIs l, when the bit width p of the W signal is greater than Cout_iIs l, then the p-bit signal may be equal to the carry out signal Cout_iWriting the remaining (p-l) bit signal into the Carry output signal Carry output from the data processing module in the compressor sub-circuit_iIn (1).

Optionally, a counter and a data processing module included in each compressor sub-circuit may finally output two paths of signals Carry_iAnd Sum_iAnd i may represent the number of compressor sub-circuits in the compression tree group circuit 12 starting from 0.

In one embodiment, continuing with the detailed structural diagram of the multiplier shown in fig. 2, the multiplier includes the accumulation circuit 13, and the accumulation circuit 13 includes: and an adder 131, wherein the adder 131 is configured to add the accumulated result.

Specifically, the adder 131 may be an adder with different bit widths, and the adder 131 may be a carry look ahead adder. Optionally, the adder 131 may receive the two paths of signals output by the compression tree group circuit 12, perform addition operation on the two paths of output signals, and output a target operation result of the multiplication operation.

According to the multiplier provided by the embodiment, the partial product of the target code can be compressed through the compression tree group circuit to obtain two paths of output signals, and the two output signals are subjected to addition operation through the accumulation circuit to obtain the target operation result of multiplication operation, so that the delay of the compression circuit in the multiplier can be effectively reduced, and the operation performance of the multiplier and the overall performance of a chip are improved; meanwhile, the number of effective partial products which can be obtained by the multiplier is small, so that the complexity of the multiplier for realizing multiplication is reduced, the operation efficiency of the multiplication is improved, and the power consumption of the multiplier is effectively reduced.

In one embodiment, continuing with the detailed structural diagram of the multiplier shown in fig. 2, the multiplier includes the adder 131, and the adder 131 includes: a carry signal input port 1311, a sum signal input port 1312, and an operation result output port 1313; the carry signal input port 1311 is configured to receive a carry signal, the sum bit signal input port 1312 is configured to receive a sum bit signal, and the operation result output port 1313 is configured to output the target operation result obtained by performing accumulation processing on the carry signal and the sum bit signal.

Specifically, the adder 131 may receive the Carry signal Carry output by the compression tree group circuit 12 through the Carry signal input port 1311, receive the Sum signal Sum output by the compression tree group circuit 12 through the Sum signal input port 1312, add the Carry signal Carry and the Sum signal Sum, and output the result through the operation result output port 1313.

It should be noted that, during the multiplication, the multiplier may adopt adders 131 with different bit widths to add the Carry output signal Carry and the Sum output signal Sum output by the compression tree group circuit 12, where the bit width of the data that can be processed by the adder 131 may be equal to 2 times of the bit width N of the data currently processed by the multiplier. Optionally, each compressor sub-circuit in the compression tree group circuit 12 may output a Carry output signal Carry_iAnd a Sum bit output signal Sum_i(i-0, …, 2N-1, i for each compressor sub-circuitCorresponding to the number (n), starting from 0). Optionally, the Carry { [ Carry ] received by the adder 131₀：Carry_2N-2]0, that is, the bit width of the Carry output signal Carry received by the adder 131 is 2N, the first 2N-1 bit values in the Carry output signal Carry correspond to the Carry output signals of the first 2N-1 compressor sub-circuits in the compression tree group circuit 12, and the last bit value in the Carry output signal Carry may be replaced by a value 0. Alternatively, the Sum bit output signal Sum received by the adder 131 has a bit width of 2N, and the value in the Sum bit output signal Sum may be equal to the Sum bit output signal of each of the compressor sub-circuits in the compression tree group circuit 12.

For example, if the multiplier is currently processing 8bit by 8bit fixed point multiplication, the adder 131 may be a 16 bit Carry look ahead adder, and with continued reference to fig. 4, the compression tree group circuit 12 may output the Sum output signal Sum and the Carry output signal Carry of 16 compressor sub-circuits, however, the Sum output signal received by the 16 bit Carry look ahead adder may be the complete Sum signal Sum output by the compression tree group circuit 12, and the received Carry output signal may be all the Carry output signals except the Carry output signal output by the last compressor sub-circuit in the compression tree group circuit 12, and the Carry output signal Carry is combined with the value 0.

According to the multiplier provided by the embodiment, the accumulation circuit can be used for performing accumulation operation on two paths of signals output by the compression tree group circuit and outputting a target operation result of the multiplication operation, so that the delay of the compression circuit in the multiplier can be effectively reduced, and the operation performance of the multiplier and the overall performance of a chip are improved; meanwhile, the number of effective partial products which can be obtained by the multiplier is small, so that the complexity of the multiplier for realizing multiplication is reduced, the operation efficiency of the multiplication is improved, and the power consumption of the multiplier is effectively reduced.

Fig. 6 is a flowchart illustrating a data processing method according to an embodiment, which may be processed by the multipliers shown in fig. 1 and fig. 2, where the embodiment relates to a process of data multiplication. As shown in fig. 6, the method includes:

s101, receiving data to be processed.

Specifically, the multiplier may receive data to be processed through the regular signed number encoding circuit, the data to be processed may be fixed-point numbers, and the fixed-point numbers may be multipliers and multiplicands in the multiplication operation. Wherein the bit width of the multiplier may be equal to the bit width of the multiplicand.

And S102, performing regular signed number coding processing on the data to be processed to obtain a target code.

Specifically, the multiplier may perform regular signed number encoding processing on a multiplier in the received multiplication operation through a regular signed number encoding unit, so as to obtain a target code. Wherein, the bit width of the target code may be equal to the bit width N plus 1 of the multiplier in the multiplication operation.

Optionally, the step of performing regular signed number coding processing on the data to be processed in S102 to obtain the target code may include: and converting continuous l-bit numerical values 1 in the data to be processed into (l +1) bits with the highest numerical value of 1, the lowest numerical value of-1 and the rest of bits of 0 to obtain the target code, wherein l is more than or equal to 2.

It should be noted that the method of the regular signed number encoding process can be characterized by the following ways: for N-bit multipliers, processing from lower to higher order values, if there are consecutive l (l)>2) bit value 1, successive n bit values 1 can be converted into data "1 (0))_l-1(-1) ", and combining the remaining corresponding (N-l) bit values with the converted (l +1) bit values to obtain a new data; then, the new data is used as the initial data of the next stage of conversion processing until no continuous l (l) exists in the new data obtained after the conversion processing>2) bit value 1; the N-bit multiplier is subjected to regular signed number encoding processing, and the bit width of the obtained target code can be equal to (N + 1).

S103, carrying out grading processing through the data to be processed and the target code to obtain a partial product of the target code.

Specifically, the multiplier may obtain the partial product of the target code according to the received multiplicand in the multiplication operation and the target code obtained by the regular signed number encoding unit through the partial product obtaining unit. Optionally, the target code may include three values, which are-1, 1 and 0, respectively. Optionally, the bit width of the target code may be equal to the bit width N plus 1 of the data to be processed received by the multiplier. Alternatively, the number of partial products of the target code may be equal to the bit width of the target code. Optionally, the hierarchical processing may be three-level processing, the first-level processing may be conversion processing, the second-level processing may be sign bit extension processing, and the third-level processing may be shift processing.

And S104, compressing the partial product of the target code to obtain a target operation result.

Specifically, the multiplier may perform compression processing on each column number value in the partial product of all target codes through a compression tree group circuit to obtain a compression result, and output the compression result. Alternatively, the compression process may be a logical operation process.

The data processing method provided by the embodiment receives data to be processed, performs regular signed number coding processing on the data to be processed to obtain a target code, and obtains a partial product of the target code according to the data to be processed and the target code, and performs compression processing on the partial product of the target code to obtain a target operation result, and the method can perform regular signed number coding processing on the received data to be processed, so that the number of effective partial products obtained in a multiplication operation process is reduced, the complexity of multiplication operation is reduced, the operation efficiency of multiplication operation is improved, and the power consumption of a multiplier is effectively reduced; meanwhile, the method can effectively reduce the delay of a compression circuit in the multiplier, and improve the operational performance of the multiplier and the overall performance of a chip.

Another embodiment provides a data processing method, in which the step S103 performs hierarchical processing on the data to be processed and the target code to obtain a partial product of the target code, including:

and S1031, performing conversion processing according to the data to be processed and the target code to obtain an original partial product.

It should be noted that the number of original partial products may be equal to the bit width of the target code. Alternatively, the conversion process may be characterized as converting the target code into the original partial product based on the data to be processed, which may be a multiplicand.

Illustratively, if the partial product fetch unit receives an 8-bit multiplicand "x₇x₆x₅x₄x₃x₂x₁x₀"(i.e., X), then the partial product acquisition unit may be based on the multiplicand" X₇x₆x₅x₄x₃x₂x₁x₀"(i.e., X) directly obtains the corresponding original partial product with three values-1, 0, 1 contained in the target code, where the original partial product may be-X when the value of one bit in the target code is-1, the original partial product may be 0 when the value of one bit in the target code is 0, and the original partial product may be X when the value of one bit in the target code is 1.

S1032, sign bit expansion processing is carried out on the original partial product, and the partial product after sign bit expansion is obtained.

Specifically, the bit width of the partial product after sign bit extension may be equal to 2 times of the bit width N of the data currently processed by the multiplier, the bit width of the original partial product may be equal to N, and the bit number of the sign bit extension bit may be equal to N. Optionally, the sign bit extension processing may be understood as filling a value of the sign bit extension bit with a value of a sign bit in the original partial product, where the value of the sign bit may be a highest-order value in the original partial product, and obtaining a 2N-bit-wide sign bit extended partial product. Optionally, in the distribution rule of the partial products after all sign bit extensions, the highest-order numerical value in the partial products after all sign bit extensions may be located in the same column, the lowest-order numerical value may be located in the same column, and other corresponding numerical values may also correspond to the same column.

Illustratively, if the multiplier currently processes 8 bits by 8 bits fixed point multiplication, an original partial product obtained by the partial product obtaining unit is p₈p₇p₆p₅p₄p₃p₂p₁p₀The partial product after sign bit extension can be expressed as p₈p₈p₈p₈p₈p₈p₈p₈p₇p₆p₅p₄p₃p₂p₁p₀。

S1033, shifting the partial product after the sign bit expansion to obtain the partial product of the target code.

Specifically, each partial product after sign bit expansion is shifted to obtain a corresponding partial product of the target code. Alternatively, the number of partial products after sign bit extension may be equal to the number of partial products of the target code.

Optionally, the step of performing shift processing on the partial product after sign bit extension in the above S1033 to obtain the target encoded partial product may specifically include: and carrying out left shift operation on the partial product after the sign bit is expanded to obtain the partial product of the target code.

It should be noted that each bit value in the target code may correspond to a number, and the number starts from 1. Alternatively, the target code may include a high order value and a low order value. Optionally, the number corresponding to the partial product after sign bit extension obtained from the lower-order value may start from 1. Optionally, the partial product of the first target code may be equal to the partial product after the first corresponding sign bit is extended, and starting from the partial product of the second target code, each partial product of the target codes may be a (2N-i +1) -bit value left-shifted from the partial product after the corresponding sign bit is extended, and a (2N-i +1) -bit value equivalent to the left-shifted partial product after the corresponding sign bit is extended is not subjected to the final accumulation operation, where i represents a number corresponding to each bit value in the target codes.

According to the data processing method provided by the embodiment, an original partial product is obtained according to data to be processed and a target code, sign bit expansion processing is carried out on the original partial product to obtain a partial product after sign bit expansion, shift processing is carried out on the partial product after sign bit expansion to obtain a partial product of the target code, accumulation processing is further carried out on the partial product of the target code, and a target operation result is output; meanwhile, the method can carry out regular signed number coding processing on the received data to be processed, thereby reducing the number of effective partial products obtained in the multiplication process, reducing the complexity of multiplication, improving the operation efficiency of multiplication and effectively reducing the power consumption of the multiplier.

In one embodiment, the step of compressing the partial product of the target code in S104 to obtain the target operation result may be specifically implemented by the following processes:

and S1041, receiving each column of numerical values in the partial products of all target codes, and acquiring the number of high-level signals in each column of numerical values.

Specifically, the multiplier may receive a corresponding column of values in the partial product of all target codes through each compressor sub-circuit in the compression tree group circuit, and the number of high-level signals in the received column of values may be obtained through a counter in the compressor sub-circuit.

It should be noted that the counter may also obtain the carry input signal Cin received by the data processing module in the next compressor sub-circuit_i。

S1042, carrying out XOR logic operation according to the number of the high level signals to obtain two paths of output signals.

Specifically, the multiplier may obtain the number of high level signals corresponding to a column of values in the partial product of all target codes according to a counter in each compressor sub-circuit, and a carry input signal Cin received by a data processing module in the compressor sub-circuit_iPerforming XOR logic operation to obtain two output signals of each compressor sub-circuit, Carry output signal Carry_iAnd bit output signal Sum_iWhere i may represent the number from 0 of each compressor sub-circuit in the compression tree group circuit.

And S1043, accumulating the two paths of output signals to obtain the target operation result.

Specifically, the multiplier may be configured to multiply the Carry output signal Carry obtained by the accumulation circuit for each of the compressor sub-circuits in the compression tree group circuit_iAnd bit output signal Sum_iAnd performing accumulation processing and outputting a target operation result of multiplication operation. Alternatively, the accumulation process may be understood as that all Carry output signals Carry of the compressor tree group circuit are output by the accumulation circuit_iAnd replacing the last Sum signal Sum by a value 0_2N-1And adding all the subsequent sum bit signals.

In the data processing method provided by this embodiment, each column of values in partial products of all target codes is received, the number of high level signals in each column of values is obtained, an exclusive or logic operation is performed according to the number of high level signals in each column of values, two paths of output signals are obtained, and the two paths of output signals are accumulated to obtain a target operation result, so that the delay of a compression circuit in a multiplier can be effectively reduced, and the operation performance of the multiplier and the overall performance of a chip are improved; meanwhile, the method can carry out regular signed number coding processing on the received data to be processed, thereby reducing the number of effective partial products obtained in the multiplication process, reducing the complexity of multiplication, improving the operation efficiency of multiplication and effectively reducing the power consumption of the multiplier.

The embodiment of the application also provides a machine learning operation device, which comprises one or more multipliers mentioned in the application, and is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning operation, and transmitting the execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one multiplier is included, the multipliers can be linked and transmit data through a specific structure, for example, the PCIE bus interconnects and transmits data to support larger-scale machine learning operations. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

The embodiment of the application also provides a combined processing device which comprises the machine learning arithmetic device, the universal interconnection interface and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user. Fig. 7 is a schematic view of a combined treatment apparatus.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices can cooperate with the machine learning calculation device to complete calculation tasks.

And the universal interconnection interface is used for transmitting data and control instructions between the machine learning arithmetic device and other processing devices. The machine learning arithmetic device obtains the required input data from other processing devices and writes the input data into a storage device on the machine learning arithmetic device; control instructions can be obtained from other processing devices and written into a control cache on a machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.

Alternatively, as shown in fig. 8, the configuration may further include a storage device, and the storage device is connected to the machine learning arithmetic device and the other processing device, respectively. The storage device is used for storing data in the machine learning arithmetic device and the other processing device, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or the other processing device.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

In some embodiments, a chip is also claimed, which includes the above machine learning arithmetic device or the combined processing device.

In some embodiments, a chip package structure is provided, which includes the above chip.

In some embodiments, a board card is provided, which includes the above chip package structure. As shown in fig. 9, fig. 9 provides a card that may include other kits in addition to the chip 389, including but not limited to: memory device 390, receiving means 391 and control device 392;

the memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 grains (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 grains are adopted in each group of memory units, the theoretical bandwidth of data transmission can reach 25600 MB/s.

In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The receiving device is electrically connected with the chip in the chip packaging structure. The receiving device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the receiving device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the receiving device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the receiving apparatus.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing andor a plurality of processing circuits in the chip.

In some embodiments, an electronic device is provided that includes the above board card.

The electronic device may be a multiplier, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, a headset, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

It should be noted that, for simplicity of description, the foregoing method embodiments are described as a series of circuit combinations, but those skilled in the art should understand that the present application is not limited by the described circuit combinations, because some circuits may be implemented in other ways or structures according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are all alternative embodiments, and that the devices and modules referred to are not necessarily required for this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A multiplier, characterized in that it comprises: the device comprises a regular signed number coding circuit, a compression tree group circuit and an accumulation circuit, wherein the output end of the regular signed number coding circuit is connected with the input end of the compression tree group circuit, the output end of the compression tree group circuit is connected with the input end of the accumulation circuit, and the compression tree group circuit comprises: a counter and a data processing module;

2. The multiplier of claim 1, wherein the regular signed number encoding circuit comprises: the system comprises a regular signed number coding unit and a partial product acquisition unit, wherein the output end of the regular signed number coding unit is connected with the input end of the partial product acquisition unit; the regular signed number coding unit is used for receiving first data and performing regular signed number coding processing on the first data to obtain target codes, the partial product obtaining unit is used for receiving second data, obtaining original partial products according to the target codes and the second data, performing sign bit expansion processing on the original partial products to obtain partial products after sign bit expansion, and obtaining the partial products of the target codes according to the partial products after sign bit expansion.

3. The multiplier of claim 2, wherein the regular signed number encoding unit comprises: a data input port and a target code output port; the data input port is configured to receive the first data subjected to regular signed number encoding processing, and the target encoding output port is configured to output the target encoding obtained by performing regular signed number encoding processing on the received first data.

4. The multiplier according to claim 2 or 3, wherein the partial product obtaining unit comprises: a target code input port, a data input port, and a partial product output port; the target code input port is configured to receive the target code, the data input port is configured to receive the second data, and the partial product output port is configured to output a partial product of the target code obtained according to the target code and the second data.

5. The multiplier of claim 1, wherein the accumulation circuit comprises: an adder for adding the result of the addition operation.

6. A machine learning operation device, wherein the machine learning operation device comprises one or more multipliers according to any one of claims 1 to 5, and is configured to obtain input data and control information to be operated from other processing devices except the multipliers in the machine learning operation device, execute a specified machine learning operation, and transmit an execution result to other processing devices except the multipliers in the machine learning operation device through an I/O interface;

when the machine learning arithmetic device comprises a plurality of multipliers, the multipliers are connected through a preset structure and transmit data;

7. A combined processing apparatus, characterized in that the combined processing apparatus comprises the machine learning arithmetic apparatus according to claim 6, a common interconnection interface, and processing means other than the machine learning arithmetic apparatus in the combined processing apparatus;

and the machine learning arithmetic device interacts with other processing devices except the machine learning arithmetic device in the combined processing device to jointly complete the calculation operation designated by the user.

8. The combined processing device according to claim 7, further comprising: and a storage device connected to each of the machine learning arithmetic device and the combined processing device except the machine learning arithmetic device and the storage device, for storing data of the machine learning arithmetic device and the combined processing device except the machine learning arithmetic device and the storage device.

9. A neural network chip, comprising the machine learning computation device of claim 6 or the combined processing device of claim 8.

10. An electronic device, characterized in that it comprises a chip according to claim 9.