CN111258546A - Multiplier, data processing method, chip and electronic equipment - Google Patents

Multiplier, data processing method, chip and electronic equipment Download PDF

Info

Publication number
CN111258546A
CN111258546A CN201811450841.7A CN201811450841A CN111258546A CN 111258546 A CN111258546 A CN 111258546A CN 201811450841 A CN201811450841 A CN 201811450841A CN 111258546 A CN111258546 A CN 111258546A
Authority
CN
China
Prior art keywords
partial product
data
circuit
multiplier
bit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811450841.7A
Other languages
Chinese (zh)
Other versions
CN111258546B (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN201811450841.7A priority Critical patent/CN111258546B/en
Priority to PCT/CN2019/120994 priority patent/WO2020108486A1/en
Publication of CN111258546A publication Critical patent/CN111258546A/en
Application granted granted Critical
Publication of CN111258546B publication Critical patent/CN111258546B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • G06F7/527Multiplying only in serial-parallel fashion, i.e. one operand being entered serially and the other in parallel
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a multiplier, a data processing method, a chip and an electronic device, wherein the multiplier comprises: the output end of the coding circuit is connected with the input end of the malformed Wallace tree group circuit, the output end of the malformed Wallace tree group circuit is connected with the input end of the accumulation circuit, and the multiplier can remove the processing 0 value on the premise of completely ensuring the operation accuracy of the multiplier, thereby effectively reducing the power consumption of the multiplier.

Description

Multiplier, data processing method, chip and electronic equipment
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a multiplier, a data processing method, a chip, and an electronic device.
Background
With the continuous development of digital electronic technology, the rapid development of various Artificial Intelligence (AI) chips has higher and higher requirements for high-performance digital multipliers. As one of algorithms widely used by an intelligent chip, a neural network algorithm is a common operation in which multiplication is performed by a multiplier.
At present, a multiplier uses a booth algorithm to obtain a partial product, performs compression of the partial product through a wallace tree, accumulates compression results with an adder, and outputs a final result.
However, in the conventional technique, after sign bit extension for removing partial products and operation for removing negation and adding one are performed, when partial product compression is performed by using the wallace tree, even if the sign bit of the partial product and the data bit corresponding to negation and adding one are both 0, signal inversion is still required, which results in more power consumption of the multiplier.
Disclosure of Invention
In view of the above, it is desirable to provide a multiplier, a data processing method, a chip and an electronic device.
An embodiment of the present invention provides a multiplier, where the multiplier includes: the output end of the encoding circuit is connected with the input end of the malformed Wallace tree group circuit, and the output end of the malformed Wallace tree group circuit is connected with the input end of the accumulation circuit;
the encoding circuit is used for encoding received data to obtain a partial product of a target code, the malformed Wallace tree group circuit is used for accumulating the partial product of the target code, and the accumulating circuit is used for accumulating received input data.
In one embodiment, the encoding circuit includes: the Booth encoding circuit comprises a Booth encoding sub-circuit and a partial product obtaining sub-circuit, wherein the output end of the Booth encoding sub-circuit is connected with the input end of the partial product obtaining sub-circuit; the Booth coding sub-circuit is used for performing Booth coding processing on received data to obtain a coded signal, and the partial product obtaining sub-circuit is used for obtaining an original partial product according to the coded signal and performing optimization processing on the original partial product to obtain the partial product of a target code.
In one embodiment, the booth encoding sub-circuit comprises: the data input port is used for receiving data subjected to Booth coding processing, and the coding signal output port is used for outputting a coding signal obtained after the received data are subjected to Booth coding processing.
In one embodiment, the partial product acquisition sub-circuit comprises: the sign bit correction and inversion device comprises a sign bit correction and expansion unit and a correction and inversion unit, wherein the sign bit correction and expansion unit is used for performing sign bit elimination and expansion processing on the original partial product to obtain a partial product subjected to sign bit elimination and expansion, and the correction and inversion unit is used for performing one-bit addition processing on the original partial product after the cancellation and inversion to obtain a one-bit addition value.
In one embodiment, the misshapen wallace tree set circuit comprises: a plurality of deformed Wallace tree sub-circuits for performing a modified accumulation process on the partial product of the target code.
In one embodiment, the accumulation circuit comprises: and the adder is used for performing addition operation on the two received data with the same bit width.
In one embodiment, the adder comprises: the carry output signal input port is used for receiving a carry output signal, the sum bit output signal input port is used for receiving a sum bit output signal, and the result output port is used for outputting a result of accumulation processing of the carry output signal and the sum bit output signal.
In the multiplier provided by this embodiment, the coding circuit codes the received data to obtain the partial product of the target code, the deformed wallace tree group circuit can accumulate the partial product of the target code, and the accumulation circuit accumulates the accumulated result obtained by the deformed wallace tree group circuit again to obtain the final operation result.
The embodiment of the invention provides a data processing method, which comprises the following steps:
receiving data to be processed;
coding the data to be processed to obtain a coding result, and obtaining a partial product of a target code according to the data to be processed and the coding result;
and correcting and accumulating the partial product of the target code to obtain an operation result.
In one embodiment, the encoding the data to be processed to obtain an encoding result, and obtaining a partial product of a target code by performing optimization processing according to the data to be processed and the encoding result, includes:
performing Booth coding processing on the data to be processed to obtain a coding signal;
and obtaining the partial product of the target code through optimization processing according to the data to be processed and the coding signal.
In one embodiment, the obtaining the partial product of the target code according to the data to be processed and the code signal and through an optimization process includes:
obtaining an original partial product according to the data to be processed and the coded signal;
carrying out sign bit elimination extension processing on the original partial product to obtain a partial product subjected to sign bit elimination extension;
obtaining an plus one bit value in the partial product of the target code according to the code signal;
and obtaining the partial product of the target code according to the partial product after eliminating sign bit extension and the plus one bit value.
According to the data processing method provided by the embodiment, the data to be processed is received, the data to be processed is coded to obtain a coding result, the partial product of the target code is obtained through optimization processing according to the data to be processed and the coding result, and the partial product of the target code is corrected and accumulated to obtain an operation result.
The embodiment of the present invention provides a machine learning arithmetic device, which includes one or more multipliers described in the first aspect; the machine learning arithmetic device is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning arithmetic and transmitting an execution result to other processing devices through an I/O interface;
when the machine learning arithmetic device comprises a plurality of multipliers, the multipliers can be linked through a specific structure and transmit data;
the multipliers are interconnected through a PCIE bus and transmit data so as to support larger-scale machine learning operation; a plurality of multipliers share the same control system or own respective control systems; a plurality of multipliers share a memory or own respective memories; the interconnection mode of a plurality of multipliers is any interconnection topology.
The combined processing device provided by the embodiment of the invention comprises the machine learning processing device, the universal interconnection interface and other processing devices; the machine learning arithmetic device interacts with the other processing devices to jointly complete the operation designated by the user; the combined processing device may further include a storage device, which is connected to the machine learning arithmetic device and the other processing device, respectively, and is configured to store data of the machine learning arithmetic device and the other processing device.
The neural network chip provided by the embodiment of the invention comprises the multiplier, the machine learning arithmetic device or the combined processing device.
The neural network chip packaging structure provided by the embodiment of the invention comprises the neural network chip.
The board card provided by the embodiment of the invention comprises the neural network chip packaging structure.
The embodiment of the application provides an electronic device, which comprises the neural network chip or the board card.
An embodiment of the present invention provides a chip, including at least one multiplier as described in any one of the above.
The electronic equipment provided by the embodiment of the invention comprises the chip.
Drawings
Fig. 1 is a schematic structural diagram of a multiplier according to an embodiment;
fig. 2 is a schematic diagram of a specific structure of a multiplier according to another embodiment;
fig. 3 is a schematic diagram illustrating a distribution rule of all partial products of a target code obtained by 8-bit data multiplication according to another embodiment;
FIG. 4 is a schematic diagram illustrating a connection structure of a misshapen Wallace tree sub-circuit for performing an 8-bit data multiplication according to another embodiment;
fig. 5 is a schematic flow chart of a data processing method according to an embodiment;
FIG. 6 is a flowchart illustrating a method for obtaining a partial product of a target code according to another embodiment;
FIG. 7 is a flowchart illustrating a specific method for obtaining a partial product of a target code according to another embodiment;
FIG. 8 is a block diagram of a combined processing device according to an embodiment;
FIG. 9 is a block diagram of another integrated processing device according to an embodiment;
fig. 10 is a schematic structural diagram of a board card according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The multiplier provided by the application can be applied to an AI chip, a Field-Programmable Gate Array (FPGA) chip or other hardware circuit devices for multiplication processing, and a specific structural schematic diagram of the multiplier is shown in FIG. 1.
Fig. 1 is a schematic diagram of a specific structure of a multiplier according to an embodiment, and as shown in fig. 1, the multiplier includes: the device comprises an encoding circuit 11, a malformed Wallace tree group circuit 12 and an accumulation circuit 13, wherein the output end of the encoding circuit 11 is connected with the input end of the malformed Wallace tree group circuit 12, and the output end of the malformed Wallace tree group circuit 12 is connected with the input end of the accumulation circuit 13. The encoding circuit 11 is configured to perform encoding processing on received data to obtain a partial product of a target code, the deformed wallace tree group circuit 12 is configured to perform accumulation processing on the partial product of the target code, and the accumulation circuit 13 is configured to perform accumulation processing on received input data.
Specifically, the encoding circuit 11 may include a plurality of data processing units with different functions, and the data received by the encoding circuit 11 may be used as a multiplier in a subsequent multiplication operation and may also be used as a multiplicand in the multiplication operation. Alternatively, the data processing unit with different functions may be a data processing unit with a binary encoding function. Alternatively, the multiplier and the multiplicand may be fixed-point numbers with multi-bit widths. Optionally, the deformed wallace tree group circuit 12 may perform accumulation processing on the numerical values in the partial product of the target code obtained by the encoding circuit 11 to obtain an accumulation result, and perform accumulation processing on the accumulation result obtained by the deformed wallace tree group circuit 12 again by using the accumulation circuit 13 to obtain a final result of the multiplication operation.
It should be noted that, when the multiplier performs the same multiplication, the multiplier and the multiplicand received by the encoding circuit 11 are data with the same bit width, and in this embodiment, the multiplier may process data with a fixed bit width, and the fixed bit width may be equal to 8 bits, 16 bits, 32 bits, or may be equal to 64 bits, which is not limited in this embodiment. Optionally, there may be one input port of the data processing unit with different functions, the function of each input port of each data processing unit may be the same, there may also be one output port, the function of each output port of each data processing unit may be different, and the circuit structures of the data processing units with different functions may be different.
In the multiplier provided by the embodiment, the coding circuit is used for coding the received data to obtain the partial product of the target code, the deformed wallace tree group circuit can accumulate the partial product of the target code, and the accumulation circuit is used for accumulating the accumulation result obtained by the deformed wallace tree group circuit again to obtain the final operation result, so that the process of processing the 0 value can be eliminated on the premise of completely ensuring the operation accuracy of the multiplier, and the power consumption of the multiplier is effectively reduced; in addition, the multiplier can use the deformed Wallace tree group circuit to perform accumulation processing on the partial product of the target code, so that the delay of the multiplier is reduced, and the area of an AI chip occupied by the multiplier is effectively reduced.
Fig. 2 is a schematic diagram of a specific structure of a multiplier according to another embodiment, where the multiplier includes, as an embodiment, an encoding circuit 11, and the encoding circuit includes: a Booth coding sub-circuit 111 and a partial product obtaining sub-circuit 112, wherein the output end of the Booth coding sub-circuit 111 is connected with the input end of the partial product obtaining sub-circuit 112; the booth coding sub-circuit 111 is configured to perform booth coding processing on the received data to obtain a coded signal, and the partial product obtaining sub-circuit 112 is configured to obtain an original partial product according to the coded signal, and perform optimization processing on the original partial product to obtain the partial product of the target code.
Specifically, the booth coding unit sub-circuit 111 may receive data and perform booth coding processing on the data to obtain a coded signal. Optionally, the number of the encoded signals may be equal to 1/2 of the data bit width N currently received by the multiplier, or may be equal to the number of the original partial products. Optionally, the data received by the booth encoding sub-circuit 111 may be a multiplier in a multiplication operation, where the multiplier may be a floating point number, and the partial product obtaining sub-circuit 112 may receive a multiplicand in a multiplication operation, where the multiplicand may also be a floating point number. Before the booth encoding process, the booth encoding sub-circuit 111 may automatically perform a bit-filling process on the received data, where the bit-filling process may be characterized by filling a bit value 0 after the lowest bit value in the data. Illustratively, if the multiplier can currently handle 8-bit by 8-bit fixed point multiplication, the multiplier is y7y6y5y4y3y2y1y0Before the booth encoding process, the booth encoding sub-circuit 111 may automatically perform a bit-filling process on the multiplier to convert the multiplier into y7y6y5y4y3y2y1y00。
In addition, the partial product obtaining sub-circuit 112 may obtain a corresponding original partial product according to each encoded signal, and perform optimization processing on each original partial product to obtain a partial product of the target code. Optionally, the original partial product may be a partial product without sign bit extension, and a bit width of the original partial product may be equal to N +1, where N represents a bit width of data currently processed by the multiplier. Optionally, the optimization process may include a sign bit extension elimination process and a negative partial product negation elimination plus one bit process.
In the multiplier provided by this embodiment, the booth coding sub-circuit may perform booth coding on the received data to obtain a coded signal, and then the partial product obtaining sub-circuit may obtain a corresponding original partial product according to each coded signal, and perform optimization processing on the original partial product to obtain a partial product of a target code, thereby ensuring that the accuracy of a multiplication result is improved and the power consumption of the multiplier is effectively reduced on the basis of the optimization processing of the multiplier.
In one embodiment, continuing with the specific structural diagram of the multiplier shown in fig. 2, the encoding circuit 11 includes the booth encoding sub-circuit 111, and the booth encoding sub-circuit 111 includes: the data input port 1111 is used for receiving data subjected to booth coding processing, and the code signal output port 1112 is used for outputting a code signal obtained by performing booth coding processing on the received data.
Specifically, if the booth coding sub-circuit 111 receives a data through the data input port 1111, the booth coding sub-circuit 111 may automatically perform bit padding on the data to obtain a data having a bit width greater than that of the original data by one bit, and meanwhile, the booth coding sub-circuit 111 may perform bit padding on the dataThe data of (2) is subjected to booth encoding processing to obtain a plurality of encoded signals, and the plurality of encoded signals are output through the encoded signal output port 1112. Optionally, the booth encoding sub-circuit 111 may receive a multiplier in the multiplication operation through the data input port 1111, and the booth encoding sub-circuit 111 may perform booth encoding processing on the multiplier. Optionally, each time the booth coding processing is performed, the data after bit padding may be divided into multiple groups of data to be coded, and the booth coding sub-circuit 111 may perform the booth coding processing on the divided multiple groups of data to be coded at the same time. Optionally, the principle of dividing the multiple groups of data to be encoded may be characterized in that every 3-bit value adjacent to each other in the data after bit padding is used as a group of data to be encoded, and the highest-order value in each group of data to be encoded may be used as the lowest-order value in the next group of data to be encoded. Optionally, the encoding rule for implementing the booth encoding process by the booth encoding sub-circuit 111 can be seen in table 1, where y in table 12i+1,y2iAnd y2i-1Can represent the corresponding numerical value of each group of data to be coded, X can represent the multiplicand received by the coding circuit 11, and the corresponding coded signal PP is obtained after Booth coding processing is carried out on each group of corresponding data to be codedi(i ═ 0, 1, 2.., n). Optionally, as shown in table 1, the encoded signal obtained after the booth encoding process may include five classes, and each class of encoded signal is defined as-2X, -X, and 0, respectively. Illustratively, if the multiplicand received by the encoding circuit 11 is x7x6x5x4x3x2x1x0Then X can be represented as X7x6x5x4x3x2x1x0
TABLE 1
Figure BDA0001886618840000081
Illustratively, if the Booth encoding sub-circuit 111 receives a fixed-point number y of 8 bits as the multiplier7y6y5y4y3y2y1y0Then the data after bit padding can bey7y6y5y4y3y2y1y00, when i is 0, y2i+1=y1,y2i=y0,y2i-1=y-1Then y is-1Can represent y0The latter complement value 0 (i.e., the multiplier may be expressed as y)7y6y5y4y3y2y1y0y-1) In the Booth encoding process, y can be coded- 1y0y1,y1y2y3,y3y4y5And y5y6y7And respectively carrying out Booth coding on the four groups of fixed point numbers to be coded to obtain 4 coded signals, wherein the highest bit numerical value in each group of fixed point numbers to be coded can be used as the lowest bit numerical value in the next group of fixed point numbers to be coded.
In the multiplier provided by this embodiment, the booth coding sub-circuit may perform booth coding on the received data to obtain the coded signals, the partial product obtaining sub-circuit may obtain the partial product of the target code according to each coded signal, the deformed wallace tree group circuit performs accumulation processing on the partial product of the target code, and the accumulation circuit performs accumulation processing on the accumulation result obtained by the deformed wallace tree group circuit again to obtain the final operation result, which can remove the process of processing a 0 value on the premise of completely ensuring the operation accuracy of the multiplier, thereby effectively reducing the power consumption of the multiplier; in addition, the multiplier can use the deformed Wallace tree group circuit to accumulate partial products of target codes, so that the delay of the multiplier is reduced, and meanwhile, the area of an AI chip occupied by the multiplier can be effectively reduced.
In one embodiment, continuing with the detailed structural diagram of the multiplier shown in fig. 2, the multiplier includes the partial product obtaining sub-circuit 112, and the partial product obtaining sub-circuit 112 includes: a sign bit correction extension unit 1121 and a sign bit correction negation unit 1122, where the sign bit correction extension unit 1121 is configured to perform sign bit cancellation extension processing on the original partial product to obtain a partial product with sign bit cancelled, and the sign bit correction negation unit 1122 is configured to perform sign bit cancellation and then add one bit to the original partial product to obtain an added bit value.
Specifically, when the sign bit extension removal processing is performed on the original partial product by the sign bit extension correction unit 1121, 1 addition processing and judgment processing may be performed on the high two-bit value in the original partial product, so as to obtain the partial product after sign bit extension removal. Optionally, the bit width of the partial product after sign bit expansion is removed may be equal to (M +1), where M represents the bit width of the original partial product, and M may be equal to N +1, where N represents the bit width of the data received by the multiplier. Optionally, the partial product after sign bit extension removal has one more digit value (i.e., an additional digit value) than the original partial product, the additional digit value may be located at the highest digit of the partial product after sign bit extension removal, and 1 is added to the high two digit value of the original partial product, the obtained sum signal may be the value of two adjacent digits after the highest digit of the partial product after sign bit extension removal, and meanwhile, the sign correction extension unit 1121 may perform judgment processing according to the high two digit value of the original partial product to determine the additional digit value of the partial product after sign bit extension removal.
It should be noted that, if the highest numerical value of the original partial product is represented by a and the next highest numerical value is represented by B, after adding 1 to the highest numerical value a and the next highest numerical value B in the original partial product, an additional one-bit numerical value in the partial product after eliminating sign bit expansion can be obtained, and the additional one-bit numerical value can be represented by Q. Optionally, the extra-one-bit value Q in the partial product after the sign bit extension is eliminated may be determined jointly according to the highest-order value a and the second-order value B in the original partial product, and the determination rule (i.e., the judgment processing rule) may be referred to in table 2.
TABLE 2
Figure BDA0001886618840000101
Illustratively, if the multiplier currently processes 8 bits by 8 bits data multiplication, one of the original results is obtainedInitial partial product of z9iz8iz7iz6iz5iz4iz3iz2iz1iThe corresponding partial product obtained after eliminating sign bit expansion is z10iz9i’z8i’z7iz6iz5iz4iz3iz2iz1iThen for the highest bit value z in the original partial product9iAnd the next highest numerical value z8iAfter 1 is added, the corresponding bit z in the partial product after the sign bit is eliminated and expanded is obtained9i’And z8i’Value of (a), z9i’And z8i’May be equal to z9iAnd z8iAdding 1 to the corresponding value to obtain the corresponding sum signal, and eliminating one extra bit z in the partial product after sign bit expansion10iCan pass through the highest bit value z in the original partial product9iSecond highest numerical value z8iAnd the determination rules in table 2. Optionally, in the booth encoding process, the number of obtained encoded signals may be equal to the number of obtained original partial products, and may also be equal to the number of partial products after sign bit extension is eliminated.
In addition, the multiplier may perform erasure and inversion on each original partial product through the modification and inversion unit 1122, and then add one bit to obtain an added one-bit value, and the partial product obtaining sub-circuit 112 combines each extended partial product with a sign-removed bit with a corresponding added one-bit value to obtain a partial product of the target code. Optionally, the modified negation unit 1122 may obtain a corresponding one-bit-added value according to the encoded signal corresponding to each original partial product. Optionally, the bit width of the partial product of the target coding may be equal to the bit width of the partial product after sign bit expansion is eliminated, or may be equal to the bit width of the partial product after sign bit expansion plus 1, where one more bit may be called as plus one bit, and a plus one bit value in each partial product of the target coding may be located at a lower two bits after a lowest bit value in the partial product after sign bit expansion is eliminated. Alternatively, the column number of all partial products of the target code may be equal to 2 times the bit width of the data currently processed by the multiplier.
It should be noted that the modified negation unit 1122 may obtain a corresponding one-bit-added value according to each encoded signal. In addition, the number of the obtained plus one-bit values can be equal to the number of the coded signals, and the number of partial products after eliminating sign bit expansion can also be obtained.
Optionally, in a distribution rule of all partial products of the target codes, a partial product of a first target code may be equal to a partial product after the first sign bit is removed from being extended, starting from a partial product of a second target code, each partial product of the target codes may be equal to a partial product after each sign bit is removed from being extended, and a partial product obtained by combining an added bit value corresponding to a partial product after the previous sign bit is removed from being extended is combined, and the added bit value may be located at the lower two bits of the lowest bit value of the corresponding partial product after the sign bit is removed from being extended. However, the partial product of the last target code may be equal to the corresponding plus one bit value obtained by the partial product after the last sign bit removal extension, and it is also understood that the last plus one bit value has no sign bit removal extension partial product that can be combined. Meanwhile, in the distribution rule of all the partial products of the target codes, the lowest order value of the partial product of the first target code may be located in the same column as the lowest order value of the partial product of the second target code, and from the partial product of the third target code, the lowest order value of each partial product of the target codes may be located in the same column as the value corresponding to the two higher orders of the lowest order of the partial product of the previous target code. Illustratively, continuing with the above example, the multiplier performs 8-bit by 8-bit data multiplication to obtain the distribution rule of all partial products of the target code, as shown in fig. 3
Figure BDA0001886618840000111
Indicating an plus one-bit value, "●" indicating an extra one-bit value Q obtained after the sign bit extension removal process, and "○" indicating the value of the other bits excluding the extra one-bit value Q in the partial product after the sign bit extension removal process.
Optionally, the partial product after the first sign bit is removed from the extension may correspond to the lowest-order coded signal in the low-order coded signal, and so on, and the partial product after the last sign bit is removed from the extension may correspond to the highest-order coded signal in the high-order coded signal, where the low-order coded signal may be a corresponding coded signal obtained by performing booth coding on low-order data in the data received by the multiplier, and the high-order coded signal may be a corresponding coded signal obtained by performing booth coding on high-order data in the data received by the multiplier. Alternatively, if the multiplier receives data with a bit width of N bits, the lower N/2 bit data may be referred to as lower data, and the upper N/2 bit data may be referred to as upper data.
Optionally, the multiplier may perform cancellation and negation on the original partial product through the correction negation unit 1122 and then add one bit to obtain a value added with one bit, and then perform sign bit cancellation and extension processing on the original partial product through the sign bit correction and extension unit 1121 to obtain a partial product after sign bit cancellation and extension.
In the multiplier provided by this embodiment, the partial product obtaining sub-circuit may obtain a corresponding original partial product according to a coding signal obtained by the booth coding sub-circuit, and perform sign bit elimination extension processing and one bit addition processing after negation elimination on the original partial product to obtain a partial product of a target code, so as to ensure that the accuracy of a multiplication result can be improved and the power consumption of the multiplier can be effectively reduced on the basis of sign bit elimination extension processing and one bit addition processing after negation elimination of the multiplier.
In one embodiment, continuing with the specific structural diagram of the multiplier shown in fig. 2, the multiplier includes the deformed wallace tree group circuit 12, and the deformed wallace tree group circuit 12 includes: and the deformed Wallace tree subcircuits 121-12 n, wherein the plurality of deformed Wallace tree subcircuits 121-12 n are used for performing correction and accumulation processing on the partial product of the target code.
Specifically, the circuit structure of the malformed Wallace tree sub-circuits 121-12 n can be realized by a combination of full adders and/or half adders, and in addition, it can be understood that the malformed Wallace tree sub-circuits 121-12 n can be used for processing a multi-bit input signalThe circuit is used for adding a plurality of input signals to obtain a two-bit output signal. Optionally, the number n of the malformed walsh tree sub-circuits included in the malformed walsh tree group circuit 12 may be equal to 2 times the bit width of the data currently processed by the multiplier, and the n malformed walsh tree sub-circuits may perform parallel processing on the partial product of the target code, but the connection may be serial connection. Alternatively, each of the misshapen Wallace tree sub-circuits in the misshapen Wallace tree bank circuit 12 may add each of the columns of all partial products of the target code, and each of the misshapen Wallace tree sub-circuits may output two signals, namely a carry output signal and a Sum output signal SumiWherein, the Carry output signal can be CarryiOr 0, and the Sum bit output signal may be SumiI may denote the corresponding number of each malformed Wallace tree sub-circuit, the first malformed Wallace tree sub-circuit being numbered 0. Optionally, the number of input signals received by each of the malformed Wallace tree sub-circuits may be equal to 1, 2, …, or n, where n may be equal to the number of partial products after sign bit expansion plus 1, it being understood that the number of input signals may be different for each of the malformed Wallace tree sub-circuits, and the internal structure of each of the malformed Wallace tree sub-circuits may also be different.
In addition, during the addition operation of the multiplier on each column of all partial products of the target code, two column values in all partial products of the target code are added by 1 (i.e. the correction and 1 process) through two malformed Wallace tree subcircuits in the malformed Wallace tree group circuit 12, that is, the input signals of the two malformed Wallace tree subcircuits corresponding to the two column values in all partial products of the target code have one more bit correction signal, which is 1, in addition to the value in the partial product after sign bit expansion or in addition to the value in the partial product after sign bit expansion and the value of one more bit. In the present embodiment, if the n malformed walsh tree sub-circuits in the malformed walsh tree group circuit 12 are numbered 1, 2, …, i, …, n, the malformed walsh tree group circuit 12 may add 1 to the corresponding two column numbers in the partial product of the target code through the i-th and n-th malformed walsh tree sub-circuits, and at the same time, if the column number corresponding from the lowest order value to the highest order value in all the partial products of the target code is numbered 1, 2, …, n/2.
For example, if the multiplier currently processes 8bit by 8bit fixed point multiplication, the distribution rule of all partial products of the target code obtained by the partial product obtaining sub-circuit 112 may be as shown in fig. 3, each deformed Wallace tree sub-circuit may receive all the values of the corresponding columns in all the partial products of the target code, and the multiplier needs to perform the correction processing by the 8 th and 16 th deformed Wallace tree sub-circuits, which, in addition to receiving all the values of the corresponding columns in all the partial products of the target code, have one more input signal port receiving a signal of 1, at which time, the connection circuit diagrams of the 16 deformed Wallace tree sub-circuits in the deformed Wallace tree group circuit 12 and the two deformed Wallace tree sub-circuits implementing the correction plus 1 processing are both as shown in fig. 4, wherein Wallace _ i in fig. 4 represents a Wallace tree sub-circuit, i is the number of the malformed Wallace tree sub-circuits starting from 0, and the solid line connecting every two malformed Wallace tree sub-circuits indicates that the malformed Wallace tree sub-circuit corresponding to the high-order number has an output carry connection signal, and the dotted line indicates that the malformed Wallace tree sub-circuit corresponding to the high-order number does not have an output carry connection signal. Alternatively, the carry-connect signal may be characterized as the carry-out signal input to each of the malformed Wallace tree sub-circuits to the next adjacent malformed Wallace tree sub-circuit.
It should be noted that the carry connect signal of each malformed Wallace tree sub-circuit may be used as the carry input signal for the next malformed Wallace tree sub-circuit, and the carry input signal of the first malformed Wallace tree sub-circuit may be equal to 0. Optionally, each misshapen Wallace tree sub-circuit inputThe number N of bits of the output carry-out connection signalCoutMay be equal to floor ((N)I+NCin) /2) -1, wherein NIThe number of data (i.e., including the input signal and the carry input signal) input bits, N, representing the malformed Wallace tree sub-circuitCinRepresenting the number of bits of the carry input signal of the malformed Wallace tree sub-circuit, floor (·) representing a floor function, NCoutThe number of bits of the carry-connect signal representing the minimum number of outputs. In addition, the carry output signals of the second malformed Wallace tree sub-circuit 122 in the malformed Wallace tree set circuit 12, and the penultimate malformed Wallace tree sub-circuit 12(n-1), may each be equal to 0.
In the multiplier provided by the embodiment, the multiplier performs accumulation processing on partial products of target codes through the deformed wallace tree group circuit, obtains an accumulation result through the deformed wallace tree group circuit through the accumulation circuit, performs accumulation processing again to obtain a final operation result, and can remove the process of processing a 0 value on the premise of completely ensuring the operation accuracy of the multiplier, thereby effectively reducing the power consumption of the multiplier; in addition, the multiplier can also use a deformed Wallace tree group circuit to carry out accumulation processing on partial products of target codes, so that the area of an AI chip occupied by the multiplier is effectively reduced.
In one embodiment, continuing with the detailed structural diagram of the multiplier shown in fig. 2, the multiplier includes the accumulation circuit 13, and the accumulation circuit 13 includes: and the adder 131 is used for adding the received two data with the same bit width.
Specifically, the adder 131 may be an adder with different bit widths, and the adder 131 may be a carry look ahead adder. Optionally, the adder 131 may receive the two signals output by the deformed wallace tree group circuit 12, perform addition operation on the two output signals, and output a final multiplication result.
According to the multiplier provided by the embodiment, the two paths of signals output by the malformed Wallace tree group circuit can be accumulated through the accumulation circuit, and the final multiplication result is output.
In one embodiment, the adder 131 includes: carry output signal input port 1311, sum bit output signal input port 1312, and result output port 1313, where carry output signal input port 1311 is configured to receive a carry output signal, sum bit output signal input port 1312 is configured to receive a sum bit output signal, and result output port 1313 is configured to output a result of accumulation of the carry output signal and the sum bit output signal.
Specifically, the adder 131 may receive the Carry output signal Carry output by the deformed wallace tree group circuit 12 through the Carry output signal input port 1311, receive the Sum output signal Sum output by the deformed wallace tree group circuit 12 through the Sum output signal input port 1312, add the Carry output signal Carry and the Sum output signal Sum, and output the result through the result output port 1313.
It should be noted that, during the multiplication, the multiplier may adopt adders 131 with different bit widths to perform addition operation on the Carry output signal Carry and the Sum output signal Sum output by the malformed wallace tree group circuit 12, where the bit width of the data that can be processed by the adder 131 may be equal to 2 times of the bit width N of the data currently processed by the multiplier. Optionally, each of the malformed Wallace tree subcircuits in the malformed Wallace tree set circuit 12 may output a Carry output signal CarryiAnd a Sum bit output signal Sumi(i ═ 0, …, 2N-1, i is the corresponding number for each misshapen wallace tree sub-circuit, starting with number 0). Optionally, the Carry { [ Carry ] received by the adder 1310:Carry2N-2]0), that is, the bit width of the Carry output signal Carry received by the adder 131 is N, the first 2N-2 bit values in the Carry output signal Carry correspond to the Carry output signals of the first 2N-2 malformed wallace tree sub-circuits in the malformed wallace tree group circuit 12, and the last bit value in the Carry output signal Carry may be replaced by 0. Optionally, the Sum bit output signal Sum received by the adder 131 is bit wide2N, the value in the Sum bit output signal Sum may be equal to the Sum bit output signal of each of the malformed wallace tree sub-circuits in the malformed wallace tree set circuit 12.
Illustratively, if the multiplier is currently processing 8-bit by 8-bit data multiplication, the adder 131 may be a 16-bit Carry look ahead adder, and with continued reference to fig. 4, the malformed wallace tree group circuit 12 may output the Sum output signal Sum and the Carry output signal Carry of the 16 compressor sub-circuits, however, the Sum output signal received by the 16-bit Carry look ahead adder may be the full Sum output signal Sum output by the malformed wallace tree group circuit 12, and the Carry output signal received may be the Carry output signal Carry of the malformed wallace tree group circuit 12 after all Carry output signals except the Carry output signal output by the last malformed wallace tree sub-circuit are combined with 0.
According to the multiplier provided by the embodiment, two paths of signals output by the malformed Wallace tree group circuit can be accumulated through the accumulation circuit, and a final multiplication result is output, so that the process of processing a 0 value can be eliminated on the premise that the operation accuracy of the multiplier is completely ensured, and the power consumption of the multiplier is effectively reduced; in addition, the multiplier can also use a deformed Wallace tree group circuit to perform accumulation processing on the partial product of the target code, so that the delay of the multiplier is reduced, and meanwhile, the area of an AI chip occupied by the multiplier is effectively reduced.
Fig. 5 is a flowchart illustrating a data processing method according to an embodiment, which may be processed by the multipliers shown in fig. 1 and fig. 2, where the embodiment relates to a process of data multiplication. As shown in fig. 5, the method includes:
s101, receiving data to be processed.
In particular, the multiplier may receive data to be processed, which may be a multiplier and a multiplicand in a multiplication operation, through the encoding circuit. Optionally, the bit widths of the multiplier to be processed and the multiplicand received by the encoding circuit may be 8 bits, 16 bits, 32 bits or 64 bits, which is not limited in this embodiment. Wherein, the bit width of the multiplier to be processed can be equal to the bit width of the multiplicand to be processed.
S102, coding the data to be processed to obtain a coding result, and obtaining a partial product of a target code according to the data to be processed and the coding result.
Specifically, the multiplier may perform binary coding on the received multiplier to be processed through a coding circuit to obtain a result of the binary coding. Alternatively, the number of binary-coded results may be equal to 1/2 times the bit width of the data currently being processed by the multiplier. Alternatively, the number of partial products of the target encoding may be equal to the number of binary encoding results.
S103, correcting and accumulating the partial product of the target code to obtain an operation result.
Specifically, the multiplier can accumulate each column number value in all partial products of the target code through the deformed Wallace tree sub-circuit, can perform correction and 1 addition processing through two deformed Wallace tree sub-circuits in the deformed Wallace tree sub-circuit in the accumulation processing process, outputs the Carry output signal and the sum output signal after the correction and 1 addition processing through the deformed Wallace tree group circuit, and finally outputs all Carry output signals Carry of the deformed Wallace tree group circuit through the accumulation circuitiAnd replacing the last Sum output signal Sum with 0N-1And accumulating all the sum bit output signals and outputting the final operation result. It should be noted that, if the multiplier is currently processing N-bit data operation, and 2N deformed wallace tree sub-circuits are serially connected in the deformed wallace tree group circuit, and the number corresponding to each deformed wallace tree sub-circuit starts from 0, the deformed wallace tree group circuit may add 1 through the nth deformed wallace tree sub-circuit and the 2N-1 deformed wallace tree sub-circuit.
According to the data processing method provided by the embodiment, the data to be processed is received, the data to be processed is coded to obtain a coding result, the partial product of the target code is obtained through optimization processing according to the data to be processed and the coding result, and the partial product of the target code is corrected and accumulated to obtain an operation result, so that the process of processing a 0 value is eliminated on the premise that the operation accuracy of the multiplier is completely guaranteed, and the power consumption of the multiplier is effectively reduced; in addition, the method can accumulate the partial product of the target code by using a full adder with fewer stages through the deformed Wallace tree group circuit, thereby reducing the delay of the multiplier and effectively reducing the area of the AI chip occupied by the multiplier.
As shown in fig. 6, a multiplication method according to another embodiment is provided, where the data to be processed in the above step S102 is encoded to obtain an encoding result, and a partial product of a target code is obtained through optimization processing according to the data to be processed and the encoding result, and the method includes:
and S1021, performing Booth coding processing on the data to be processed to obtain a coded signal.
Specifically, the multiplier may perform booth coding processing on the received multiplier to be processed through the booth coding sub-circuit to obtain a coded signal. Optionally, in the booth encoding process, data with a bit width of 3 bits in the input multiplier may obtain data after one-bit encoding, the encoding rule in the booth encoding process may refer to table 1, and it can be known from table 1 that the booth encoding sub-circuit performs booth encoding processing on the multiplier to obtain five different types of encoded signals, where each type of encoded signal is defined as-2X, -X, and 0, respectively.
S1022, obtaining the partial product of the target code through optimization processing according to the data to be processed and the code signal.
Specifically, the optimization process may include a sign bit extension elimination process and an inversion elimination and bit addition process.
According to the data processing method provided by the embodiment, booth coding processing is performed on the data to be processed to obtain a coding signal, the partial product of the target code is obtained through optimization processing according to the data to be processed and the coding signal, and then the partial product of the target code is accumulated to obtain an operation result.
In one embodiment, as shown in fig. 7, the obtaining the partial product of the target code according to the data to be processed and the code signal in S1022 through an optimization process includes:
s1022a, obtaining an original partial product according to the data to be processed and the coded signal.
Specifically, the number of original partial products may be equal to the number of encoded signals. Illustratively, if the partial product fetch sub-circuit receives an 8-bit multiplicand x7x6x5x4x3x2x1x0(i.e., X), then the partial product fetch subcircuit may be based on the multiplicand X7x6x5x4x3x2x1x0(i.e., X) directly obtains the corresponding original partial product with five types of encoded signals-2X, -X, X and 0, when the encoded signal is-2X, the original partial product may be obtained by inverting X by one bit, and then adding 1 to X, when the encoded signal is 2X, the original partial product may be obtained by left shifting X by one bit, when the encoded signal is-X, the original partial product may be obtained by inverting X by one bit and then adding 1 to X, when the encoded signal is X, the original partial product may be data in which X is combined with a higher bit value of the highest bit of X, wherein the higher bit value of the highest bit of X may be equal to the sign bit value of X, and when the encoded signal is +0, the original partial product may be 0, that is, each bit value of the 9-bit products is equal to 0.
S1022b, sign bit removal expansion processing is performed on the original partial product to obtain a partial product after sign bit removal expansion.
Specifically, the multiplier may add 1 to the highest-order value and the next-order value of each original partial product by correcting the sign bit extension unit, and perform judgment processing according to the highest-order value and the next-order value of the original partial product to obtain the partial product from which the sign bit extension is removed. Optionally, the bit width of the partial product after the sign bit extension is removed may be equal to the bit width of the original partial product plus 1.
It should be noted that, the most significant digit value Q in the partial product after the sign bit is removed is determined by the most significant digit value and the next most significant digit value of the original partial product after the sign bit is removed is processed by adding 1 and determined according to the most significant digit value and the next most significant digit value of each original partial product, the determination rule can be referred to table 2, and the two adjacent digit values after the most significant digit value Q can be respectively equal to the two corresponding digit values in the original partial product and processed by adding 1 to obtain two sum-digit signals.
And S1022c, obtaining the plus one digit value in the partial product of the target code according to the coding signal.
Specifically, the modified negation unit in the multiplier may obtain the corresponding one-bit added value according to all received code signals. Optionally, the above-mentioned rule for determining the one-digit addition value may be characterized in that, if, in the multiplication operation, the multiplicand received by the multiplier is Y, the multiplier is X, and after performing booth coding on the multiplier, the obtained coded signal may include five classes, where each class of coded signal is defined as-2X, -X, X and 0, the modified negation unit 1122 may directly obtain a corresponding one-digit addition value according to the five classes of coded signals, where the one-digit addition value may be 1 when the coded signal is-2X, the one-digit addition value may be 0 when the coded signal is 2X, the one-digit addition value may be 1 when the coded signal is-X, the one-digit addition value may be 0 when the coded signal is X, and the one-digit addition value may be 0 when the coded signal is ± 0.
S1022d, obtaining the partial product of the target code according to the partial product after eliminating sign bit expansion and the plus one bit value.
Specifically, all partial products after sign bit extension removal and all corresponding plus one bit numerical values can be combined through the partial product acquisition sub-circuit to obtain the partial product of the target code. In the distribution rule of the partial products of the target codes, the partial product of the first target code may be equal to the partial product after the first sign bit is eliminated and from the partial product of the second target code, each partial product of the target codes may be equal to the partial product after each sign bit is eliminated and a partial product obtained by combining an added bit value corresponding to the partial product after the previous sign bit is eliminated and the added bit value may be located at the lower two bits of the lowest bit value of the combined partial products after the sign bit is eliminated and expanded. However, the partial product of the last target code may be equal to the corresponding plus one bit value obtained by the partial product after the last sign bit removal extension, and it is also understood that the last plus one bit value has no sign bit removal extension partial product that can be combined. Meanwhile, in the distribution rule of all the partial products of the target codes, the lowest order value of the partial product of the first target code may be located in the same column as the lowest order value of the partial product of the second target code, and from the partial product of the third target code, the lowest order value of each partial product of the target codes may be located in the same column as the value corresponding to the two higher orders of the lowest order of the partial product of the previous target code.
For example, if the multiplier is currently processing 8 bits by 8 bits data multiplication, the distribution rule of all partial products of the target code may continue to refer to fig. 3.
According to the data processing method provided by the embodiment, an original partial product is obtained according to the data to be processed and the coding signal, sign bit elimination extension processing is performed on the original partial product to obtain a partial product with sign bit elimination extension, a plus one value in the partial product of a target code is obtained according to the coding signal, the partial product of the target code is obtained according to the partial product with sign bit elimination extension and the plus one value, accumulation processing can be performed on the partial product of the target code, and an operation result is obtained.
The embodiment of the application also provides a machine learning operation device, which comprises one or more multipliers mentioned in the application, and is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning operation, and transmitting the execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one multiplier is included, the multipliers can be linked and transmit data through a specific structure, for example, the PCIE bus interconnects and transmits data to support larger-scale machine learning operations. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.
The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.
The embodiment of the application also provides a combined processing device which comprises the machine learning arithmetic device, the universal interconnection interface and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user. Fig. 8 is a schematic view of a combined treatment apparatus.
Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices can cooperate with the machine learning calculation device to complete calculation tasks.
And the universal interconnection interface is used for transmitting data and control instructions between the machine learning arithmetic device and other processing devices. The machine learning arithmetic device obtains the required input data from other processing devices and writes the input data into a storage device on the machine learning arithmetic device; control instructions can be obtained from other processing devices and written into a control cache on a machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.
Alternatively, as shown in fig. 9, the configuration may further include a storage device, and the storage device is connected to the machine learning arithmetic device and the other processing device, respectively. The storage device is used for storing data in the machine learning arithmetic device and the other processing device, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or the other processing device.
The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.
In some embodiments, a chip is also claimed, which includes the above machine learning arithmetic device or the combined processing device.
In some embodiments, a chip package structure is provided, which includes the above chip.
In some embodiments, a board card is provided, which includes the above chip package structure. As shown in fig. 10, fig. 10 provides a card that may include other kits in addition to the chip 389, including but not limited to: memory device 390, receiving means 391 and control device 392;
the memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).
DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are adopted in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600 MB/s.
In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.
The receiving device is electrically connected with the chip in the chip packaging structure. The receiving device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the receiving device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the receiving device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the receiving apparatus.
The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing andor a plurality of processing circuits in the chip.
In some embodiments, an electronic device is provided that includes the above board card.
The electronic device may be a data processor, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.
The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
It should be noted that, for simplicity of description, the foregoing method embodiments are described as a series of circuit combinations, but those skilled in the art should understand that the present application is not limited by the described circuit combinations, because some circuits may be implemented in other ways or structures according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are all alternative embodiments, and that the devices and modules referred to are not necessarily required for this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (17)

1. A multiplier, characterized in that it comprises: the output end of the encoding circuit is connected with the input end of the malformed Wallace tree group circuit, and the output end of the malformed Wallace tree group circuit is connected with the input end of the accumulation circuit;
the encoding circuit is used for encoding received data to obtain a partial product of a target code, the malformed Wallace tree group circuit is used for accumulating the partial product of the target code, and the accumulating circuit is used for accumulating received input data.
2. The multiplier of claim 1, wherein the encoding circuit comprises: the Booth encoding circuit comprises a Booth encoding sub-circuit and a partial product obtaining sub-circuit, wherein the output end of the Booth encoding sub-circuit is connected with the input end of the partial product obtaining sub-circuit; the Booth coding sub-circuit is used for performing Booth coding processing on received data to obtain a coded signal, and the partial product obtaining sub-circuit is used for obtaining an original partial product according to the coded signal and performing optimization processing on the original partial product to obtain the partial product of a target code.
3. The multiplier of claim 2, wherein the booth encoding subcircuit comprises: the data input port is used for receiving data subjected to Booth coding processing, and the coding signal output port is used for outputting a coding signal obtained after the received data are subjected to Booth coding processing.
4. The multiplier of claim 2, wherein the partial product acquisition sub-circuit comprises: the correction sign bit extension unit is used for carrying out sign bit elimination extension processing on the original partial product to obtain a partial product after sign bit elimination extension; and the correction negation unit is used for performing elimination negation on the original partial product and then adding one bit to obtain an added one bit value.
5. The multiplier of claim 1, wherein the misshapen wallace tree set circuit comprises: a plurality of deformed Wallace tree sub-circuits for performing a modified accumulation process on the partial product of the target code.
6. The multiplier of claim 1, wherein the accumulation circuit comprises: and the adder is used for performing addition operation on the two received data with the same bit width.
7. The multiplier of claim 6, wherein the adder comprises: the carry output signal input port is used for receiving a carry output signal, the sum bit output signal input port is used for receiving a sum bit output signal, and the result output port is used for outputting a result of accumulation processing of the carry output signal and the sum bit output signal.
8. A method of data processing, the method comprising:
receiving data to be processed;
coding the data to be processed to obtain a coding result, and obtaining a partial product of a target code according to the data to be processed and the coding result;
and correcting and accumulating the partial product of the target code to obtain an operation result.
9. The method of claim 8, wherein the encoding the data to be processed to obtain an encoding result, and obtaining a partial product of a target code through an optimization process according to the data to be processed and the encoding result comprises:
performing Booth coding processing on the data to be processed to obtain a coding signal;
and obtaining the partial product of the target code through optimization processing according to the data to be processed and the coding signal.
10. The method of claim 9, wherein obtaining the partial product of the target code according to the data to be processed and the code signal and through an optimization process comprises:
obtaining an original partial product according to the data to be processed and the coded signal;
carrying out sign bit elimination extension processing on the original partial product to obtain a partial product subjected to sign bit elimination extension;
obtaining an plus one bit value in the partial product of the target code according to the code signal;
and obtaining the partial product of the target code according to the partial product after eliminating sign bit extension and the plus one bit value.
11. A machine learning operation device, wherein the machine learning operation device comprises one or more multipliers according to any one of claims 1 to 7, and is used for acquiring input data and control information to be operated from other processing devices, executing specified machine learning operation, and transmitting the execution result to other processing devices through an I/O interface;
when the machine learning arithmetic device comprises a plurality of multipliers, the plurality of computing devices can be connected through a specific structure and transmit data;
the multipliers are interconnected through a PCIE bus and transmit data so as to support larger-scale machine learning operation; a plurality of multipliers share the same control system or own respective control systems; a plurality of multipliers share a memory or own respective memories; the interconnection mode of a plurality of multipliers is any interconnection topology.
12. A combined processing apparatus, characterized in that the combined processing apparatus comprises the machine learning arithmetic apparatus according to claim 11, a universal interconnect interface and other processing apparatus;
and the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user.
13. The combined processing device according to claim 12, further comprising: and a storage device connected to the machine learning arithmetic device and the other processing device, respectively, for storing data of the machine learning arithmetic device and the other processing device.
14. A neural network chip, wherein the machine learning chip comprises the machine learning arithmetic device of claim 11 or the combined processing device of claim 12.
15. An electronic device, characterized in that it comprises a chip according to claim 14.
16. The utility model provides a board card, its characterized in that, the board card includes: a memory device, a receiving device and a control device and a neural network chip according to claim 14;
wherein the neural network chip is respectively connected with the storage device, the control device and the receiving device;
the storage device is used for storing data;
the receiving device is used for realizing data transmission between the chip and external equipment;
and the control device is used for monitoring the state of the chip.
17. The board of claim 16,
the memory device includes: a plurality of groups of memory cells, each group of memory cells is connected with the chip through a bus, and the memory cells are: DDR SDRAM;
the chip includes: the DDR controller is used for controlling data transmission and data storage of each memory unit;
the receiving device is as follows: a standard PCIE interface.
CN201811450841.7A 2018-11-30 2018-11-30 Multiplier, data processing method, chip and electronic equipment Active CN111258546B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811450841.7A CN111258546B (en) 2018-11-30 2018-11-30 Multiplier, data processing method, chip and electronic equipment
PCT/CN2019/120994 WO2020108486A1 (en) 2018-11-30 2019-11-26 Data processing apparatus and method, chip, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811450841.7A CN111258546B (en) 2018-11-30 2018-11-30 Multiplier, data processing method, chip and electronic equipment

Publications (2)

Publication Number Publication Date
CN111258546A true CN111258546A (en) 2020-06-09
CN111258546B CN111258546B (en) 2022-08-09

Family

ID=70946417

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811450841.7A Active CN111258546B (en) 2018-11-30 2018-11-30 Multiplier, data processing method, chip and electronic equipment

Country Status (1)

Country Link
CN (1) CN111258546B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1056939A (en) * 1990-05-31 1991-12-11 三星电子株式会社 Use the parallel multiplier of skip array and modified wallace tree
US6434587B1 (en) * 1999-06-14 2002-08-13 Intel Corporation Fast 16-B early termination implementation for 32-B multiply-accumulate unit
CN101882127A (en) * 2010-06-02 2010-11-10 湖南大学 Multi-core processor
CN102722352A (en) * 2012-05-21 2012-10-10 华南理工大学 Booth multiplier
CN105739945A (en) * 2016-01-22 2016-07-06 南京航空航天大学 Modified Booth coding multiplier based on modified partial product array
CN106897046A (en) * 2017-01-24 2017-06-27 青岛朗思信息科技有限公司 A kind of fixed-point multiply-accumulator

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1056939A (en) * 1990-05-31 1991-12-11 三星电子株式会社 Use the parallel multiplier of skip array and modified wallace tree
US6434587B1 (en) * 1999-06-14 2002-08-13 Intel Corporation Fast 16-B early termination implementation for 32-B multiply-accumulate unit
CN101882127A (en) * 2010-06-02 2010-11-10 湖南大学 Multi-core processor
CN102722352A (en) * 2012-05-21 2012-10-10 华南理工大学 Booth multiplier
CN105739945A (en) * 2016-01-22 2016-07-06 南京航空航天大学 Modified Booth coding multiplier based on modified partial product array
CN106897046A (en) * 2017-01-24 2017-06-27 青岛朗思信息科技有限公司 A kind of fixed-point multiply-accumulator

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
M. J. RAO: "A high speed and area efficient Booth recoded Wallace tree multiplier for fast arithmetic circuits", 《2012 ASIA PACIFIC CONFERENCE ON POSTGRADUATE RESEARCH IN MICROELECTRONICS AND ELECTRONICS》 *
M.RAVINDRA: "DESIGN AND IMPLEMENTATION OF 32 BIT HIGH LEVEL WALLACE TREE MUTIPLIER", 《INTERNATIONAL JOURNAL OF TECHNICAL RESEARCH AND APPLICATIONS》 *
张清宇: "余数系统中算法单元及关键技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Also Published As

Publication number Publication date
CN111258546B (en) 2022-08-09

Similar Documents

Publication Publication Date Title
CN110413254B (en) Data processor, method, chip and electronic equipment
CN110673823B (en) Multiplier, data processing method and chip
CN110515587B (en) Multiplier, data processing method, chip and electronic equipment
CN110554854A (en) Data processor, method, chip and electronic equipment
CN111381808A (en) Multiplier, data processing method, chip and electronic equipment
CN111258541B (en) Multiplier, data processing method, chip and electronic equipment
CN111258633B (en) Multiplier, data processing method, chip and electronic equipment
CN111258544B (en) Multiplier, data processing method, chip and electronic equipment
CN113031912A (en) Multiplier, data processing method, device and chip
CN111258542B (en) Multiplier, data processing method, chip and electronic equipment
CN209879493U (en) Multiplier and method for generating a digital signal
CN210109863U (en) Multiplier, device, neural network chip and electronic equipment
CN111258545B (en) Multiplier, data processing method, chip and electronic equipment
CN209895329U (en) Multiplier and method for generating a digital signal
CN210006031U (en) Multiplier and method for generating a digital signal
CN110515586B (en) Multiplier, data processing method, chip and electronic equipment
CN210109789U (en) Data processor
CN210006029U (en) Data processor
CN111258546B (en) Multiplier, data processing method, chip and electronic equipment
CN113031915A (en) Multiplier, data processing method, device and chip
CN110647307A (en) Data processor, method, chip and electronic equipment
CN209962284U (en) Multiplier, device, chip and electronic equipment
CN110688087A (en) Data processor, method, chip and electronic equipment
CN111258540B (en) Multiplier, data processing method, chip and electronic equipment
CN210006082U (en) Multiplier, device, neural network chip and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant