CN112801276B - Data processing method, processor and electronic equipment - Google Patents

Data processing method, processor and electronic equipment Download PDF

Info

Publication number
CN112801276B
CN112801276B CN202110172431.6A CN202110172431A CN112801276B CN 112801276 B CN112801276 B CN 112801276B CN 202110172431 A CN202110172431 A CN 202110172431A CN 112801276 B CN112801276 B CN 112801276B
Authority
CN
China
Prior art keywords
data
unit
multiplier
weight data
processing data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110172431.6A
Other languages
Chinese (zh)
Other versions
CN112801276A (en
Inventor
裴京
马骋
王松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202110172431.6A priority Critical patent/CN112801276B/en
Publication of CN112801276A publication Critical patent/CN112801276A/en
Application granted granted Critical
Publication of CN112801276B publication Critical patent/CN112801276B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The disclosure relates to a data processing method, a processor and an electronic device. The processor includes a plurality of compute cores, each compute core including a dendrite unit, an axon unit, and a storage unit. The method comprises the steps of reading a plurality of processing data and a plurality of weight data through an axon unit in one access to a storage unit, flexibly distributing the plurality of processing data and the plurality of weight data to two output ports of the axon unit through a distributor of the axon unit, and carrying out multiplication and/or addition operation on the plurality of processing data and the plurality of weight data in parallel through a multiplier-adder array. The method can reduce the times of accessing the memory, improve the flexibility and the calculation efficiency of data processing and reduce the calculation power consumption under the condition of processing data with the same capacity.

Description

Data processing method, processor and electronic equipment
Technical Field
The present disclosure relates to the field of information processing technologies, and in particular, to a data processing method, a processor, and an electronic device.
Background
In the technical field of artificial intelligence, neural network algorithms are increasingly applied to various industries and achieve better effects. However, with the rapid development of neural network algorithms, the complexity of the algorithms is higher and higher. Whether Artificial Neural Networks (ANNs) or Spiking Neural Networks (SNNs), or other models including a leak integrated-and-Fire (LIF) model with kinetic properties, the scale of calculation of the model is increasing. Processing these models by artificial intelligence chips requires significant computational resources, taking a significant amount of time and power.
Disclosure of Invention
In view of this, the present disclosure provides a data processing method, a processor and an electronic device.
According to an aspect of the present disclosure, there is provided a data processing method applied to a processor including a plurality of computing cores, each including a dendrite unit, an axon unit, and a storage unit; wherein the axon unit comprises a distributor, the dendrite unit comprises a multiplier-adder array, and the method realizes multiplication and/or addition operation of processing data and weight data;
the method comprises the following steps: the axon unit acquires the processing data and the weight data from the storage unit; the axon unit distributes the processing data and the weight data to two output ports of the axon unit through the distributor; the two output ports of the axon unit are connected with the input port of a multiplier-adder array in the dendrite, and the multiplier-adder array is used for carrying out multiplication and/or addition operation according to the processing data and the weight data sent by the input port to obtain an operation result;
wherein the axon unit reads a plurality of processing data and a plurality of weight data in one access to the storage unit, and distributes the plurality of processing data and the plurality of weight data to two output ports of the axon unit through the distributor to multiply and/or add the plurality of processing data and the plurality of weight data in parallel through the multiplier-adder array.
In one possible implementation, the method further includes: the axon unit determines a control signal according to the total bit width of the processing data and the weight data, the bit width of the processing data and the weight data and the operation type in the two output ports, and transmits the control signal to the dendrite unit; and the dendritic unit starts a multiplier-adder in the multiplier-adder array according to the received control signal, and multiplies and/or adds the processing data and the weight data to obtain an operation result.
In one possible implementation form of the method, the control signal comprises a first enable signal and a second enable signal; the multiplier-adder array includes a plurality of multiplier-adders each including a plurality of multipliers performing multiplication operations of different precisions and a plurality of adders performing addition operations of different precisions, the first enable signal is used for determining enabled multipliers and/or adders in each enabled multiplier-adder; the second enabling signal is used for determining the number of rows and the number of columns of the enabled multiplier-adder.
In a possible implementation manner, the first enable signal is determined according to a bit width and an operation type of the processing data and the weight data, and the second enable signal is determined according to a total bit width, a bit width and an operation type of the processing data and the weight data in the two output ports.
In a possible way in the implementation mode, the first-stage gas-liquid separator is provided with a gas-liquid separator, the axon unit distributes the processing data and the weight data to two output ports of the axon unit through the distributor, and the axon unit comprises: the two output ports are a first output port for transmitting processing data and a second output port for transmitting weight data, and the bit width of the first output port is greater than that of the second output port; the axon unit distributes the processing data to a first output port and distributes the weight data to a second output port through the distributor; and under the condition that the bit width of the weight data is greater than the bit width of the second output port and the bit width of the processing data is not greater than the bit width of the second output port, the axon unit distributes the processing data to the second output port and distributes the weight data to the first output port through the distributor.
According to another aspect of the present disclosure, there is provided a processor comprising a plurality of computing cores, each computing core comprising a dendrite unit, an axon unit, and a storage unit; wherein the axon unit comprises a dispenser, the dendritic cell includes an array of multiply-and-add devices, the processor is used for realizing multiplication and/or addition operation of the processing data and the weight data, and comprises the following steps:
the axon unit acquires the processing data and the weight data from the storage unit; the axon unit distributes the processing data and the weight data to two output ports of the axon unit through the distributor; the two output ports of the axon unit are connected to the input ports of the multiplier-adder array in the dendrites, the multiplier-adder array is used for carrying out multiplication and/or addition operation according to the processing data and the weight data sent by the input port to obtain an operation result;
wherein the axon unit reads a plurality of processing data and a plurality of weight data in one access to the storage unit, and distributes the plurality of processing data and the plurality of weight data to two output ports of the axon unit through the distributor to multiply and/or add the plurality of processing data and the plurality of weight data in parallel through the multiplier-adder array.
In one possible implementation manner, the axon unit determines a control signal according to a total bit width of the processing data and the weight data, a bit width of the processing data and the weight data, and an operation type in the two output ports, and transmits the control signal to the dendrite unit; and the dendritic unit starts a multiplier-adder in the multiplier-adder array according to the received control signal, and multiplies and/or adds the processing data and the weight data to obtain an operation result.
In one possible implementation of the method according to the invention, the control signal comprises a first enable signal and a second enable signal; the multiplier-adder array comprises a plurality of multiplier-adders, each multiplier-adder comprises a plurality of multipliers for carrying out multiplication operations with different precisions and a plurality of adders for carrying out addition operations with different precisions, and the first enabling signal is used for determining enabled multipliers and/or adders in each enabled multiplier-adder; and the second enabling signal is used for determining the number of rows and the number of columns of the enabled multiplier-adder.
In a possible implementation manner, the first enable signal is determined according to a bit width and an operation type of the processing data and the weight data, and the second enable signal is determined according to a total bit width, a bit width and an operation type of the processing data and the weight data in the two output ports.
In one possible implementation, the axon unit distributing the processing data and the weight data to two output ports of the axon unit through the distributor includes: the two output ports are a first output port for transmitting processing data and a second output port for transmitting weight data, and the bit width of the first output port is greater than that of the second output port; the axon unit distributes the processing data to a first output port and distributes the weight data to a second output port through the distributor; and under the condition that the bit width of the weight data is greater than the bit width of the second output port and the bit width of the processing data is not greater than the bit width of the second output port, the axon unit distributes the processing data to the second output port and distributes the weight data to the first output port through the distributor.
According to another aspect of the present disclosure, there is provided an artificial intelligence chip, the chip comprising a processor as described above.
According to another aspect of the present disclosure, there is provided an electronic device, the electronic device comprises the artificial intelligence chip.
The axon unit reads a plurality of processing data and a plurality of weight data in one access to the storage unit, flexibly distributes the processing data and the weight data to two output ports of the axon unit through the distributor, and performs multiplication and/or addition operation on the processing data and the weight data in parallel through the multiplier-adder array. The method can reduce the times of accessing the memory, improve the flexibility and the calculation efficiency of data processing and reduce the calculation power consumption under the condition of processing data with the same capacity.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.
FIG. 1 shows a schematic diagram of a processor according to an embodiment of the present disclosure;
FIG. 2 shows a schematic diagram of a data processing method according to an embodiment of the present disclosure;
FIG. 3 shows an embodiment in accordance with the present disclosure a flow chart of the data processing method of (1);
FIG. 4 shows a schematic diagram of data allocation according to an embodiment of the present disclosure;
figure 5 shows a schematic diagram of an axon unit output port, according to an embodiment of the disclosure;
FIG. 6 shows a schematic diagram of a set of multiplier-adder parallel operations of a multiplier-adder array according to an embodiment of the disclosure;
FIG. 7 shows a schematic representation of a system according to the present invention Disclosed embodiments determining the working mode of the multiplier-adder;
fig. 8 shows a schematic diagram of a multiplier-adder according to an embodiment of the disclosure;
figure 9 shows a schematic diagram of a multiplier-adder array according to an embodiment of the present disclosure;
FIG. 10 shows a block diagram of an electronic device according to an embodiment of the present disclosure;
fig. 11 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is to be understood that the described embodiments are only some embodiments, but not all embodiments, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to and includes any and all possible processes for one or more of the associated listed items.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
A multiplier-adder (Multiply Add sealer, MAC) is used to complete a Multiply and Add operation once per clock cycle, the number of Multiply-Add operations performed per unit time may be used as an indicator of processor performance.
With the continuous enlargement of the scale of the neural network and the continuous increase of the scale of the calculated data quantity, the data quantity of the processing data and the weight data is continuously increased. The multiplier-adder is used as a basic operation unit in the neural network, data needs to be frequently read from a storage unit for multiplication-addition operation, the access times of the storage unit frequently cause a large data reading clock in the operation process, and the calculation efficiency is low and the power consumption is large. For example, in the case of a processor executing simple instructions on large data, the processor will frequently execute big data input/output operation, and the arithmetic unit in the processor will be in idle state.
In the related art, the number of times the multiplier-adder in the processor is called increases, the number of multiplier-adders integrated in the processor also increases, and the area and power consumption of the multiplier-adder in the processor also increase.
In order to solve the above technical problems, the present application discloses a data processing method, a processor and an electronic device, which can read a plurality of processing data and a plurality of weight data at a time for parallel operation, under the condition of processing data with the same capacity, the times of accessing the memory are reduced, the flexibility and the calculation efficiency of data processing are improved, and the calculation power consumption is reduced.
Fig. 1 shows a schematic diagram of a processor according to an embodiment of the present disclosure, which, as shown in fig. 1, includes a plurality of computing cores, each including a processing component and a storage component. The processing means may comprise a dendrite unit, an axon unit, a soma unit, a routing unit. The storage part may include a plurality of storage units.
In one possible implementation, the processor may be a brain-like computing processor, i.e., a processor that, with reference to the processing mode of the brain, the processing efficiency is improved and the power consumption is reduced by simulating the transmission and processing of information by neurons in the brain. The processor may include multiple computing cores that may independently handle different tasks, such as: convolution operation task, pooling task or full connection task, etc.; the same task may also be processed in parallel, i.e., each compute core may process different portions of the same task assigned, for example: partial layers of convolution operation tasks in the multilayer convolution neural network operation task and the like. It should be noted that the present disclosure does not limit the number of computing cores in a processor, and the tasks executed by the computing cores.
Within the computing core, processing components and storage components may be provided. The processing components may include axon units, neurite units, soma units, and routing units. The processing component can simulate the processing mode of neurons of the brain on information, wherein the axon unit is used for caching data (such as cacheable processing data and weight data) and providing data to be operated for the dendrite unit, the dendrite unit is used for receiving the data sent by the axon unit and performing multiply-accumulate operation on the received data, the soma unit is used for integrating and transforming signals, various nonlinear processing operations are completed on the basis of the processing result of the dendrite unit, and the routing unit is used for performing information transmission with other computing cores. The processing unit in the computing core may perform read-write access on multiple storage units of the storage unit to perform data interaction with the storage unit in the computing core, and may respectively undertake respective data processing tasks and/or data transmission tasks to obtain data processing results, or perform communication with other computing cores. The present disclosure does not limit the field of application of the processing component.
In one possible implementation manner, the storage unit may include more than two storage units, and the storage unit may be a Static Random-Access Memory (SRAM). For example, the memory cells may include SRAM having a read/write width of 32B and a capacity of 64 KB. The present disclosure does not limit the read and write width and capacity of the memory cell.
In one possible implementation, fig. 2 shows a schematic diagram of a data processing method according to an embodiment of the present disclosure. As shown in fig. 2, each compute core includes a dendrite unit, an axon unit, and a storage unit; wherein the axon unit comprises a distributor, the dendrite unit comprises a multiplier-adder array, and the method realizes multiplication and/or addition operation of processing data and weight data.
For example, as shown in fig. 2, the axon unit may be in connected communication with the storage component and the dendrite unit, e.g., the axon unit may read the processing data and the weight data from the storage component and transmit the read processing data and the weight data to the dendrite unit.
The axon unit includes an allocator for allocating data read from the storage unit by the axon unit, for example, the allocator may perform a data allocation operation on processor data and weight data read from the storage unit by the axon unit, and then the axon unit transmits the allocated processor data and weight data to the dendrite unit.
The dendrite unit includes a multiplier-adder array for multiplying and/or adding data (e.g., processing data and weight data) received by the dendrite unit to obtain a multiplication-addition result. For example, as shown in fig. 2, the multiplier-adder array includes k × n multiplier-adders arranged in n rows and k columns, and the number of multiplier-adders in the multiplier-adder array is not limited by the present disclosure.
It should be appreciated that multiplying and/or adding the process data and the weight data may include: accumulation operation of the processing data, addition operation of the processing data and the weight data, multiplication operation of the processing data and the weight data, weighted summation operation (convolution operation) of the processing data and the weight data, and the like, which are not limited in the present disclosure.
In one possible implementation, fig. 3 shows a flowchart of a data processing method according to an embodiment of the present disclosure, as shown in fig. 3, the method includes:
in step S1, the axon unit acquires the processing data and the weight data from the storage unit;
in step S2, the axon unit distributes the processing data and the weight data to two output ports of the axon unit through the distributor;
in step S3, two output ports of the axon unit are connected to an input port of a multiplier-adder array in a dendrite, where the multiplier-adder array is configured to perform multiplication and/or addition operation according to the processing data and the weight data sent from the input port to obtain an operation result.
In step S1, the storage unit may store the processing data and the weight data, and the storage unit is accessed through the axon unit to read the processing data and the weight data.
In one possible implementation, the axon unit reads a plurality of processing data and a plurality of weight data in one access to the storage unit, and distributes the plurality of processing data and the plurality of weight data to two output ports of the axon unit through the distributor to multiply and/or add the plurality of processing data and the plurality of weight data in parallel through the multiplier-adder array. <xnotran> , , , , . </xnotran>
For example, assuming that a plurality of images are subjected to multilayer convolution operation, a plurality of image data and weight data of each layer are stored in a storage unit, the head addresses of the plurality of image data and the head addresses of the weight data of each layer stored in the storage unit are accessed by an axon unit, and the plurality of image data (i.e., a plurality of processed data) and the weight data of each layer (i.e., a plurality of weight data) can be read from the storage unit at a time in a reading order (for example, in the order from a low address to a high address, which is not limited by the present disclosure) set in the storage unit.
The axon unit can allow the processing data and the weight data to be stored in the same storage unit due to the reading mode of the axon unit, and the axon unit reads a plurality of weight data from the storage unit at a time, and the number of times of accessing the computing core storage unit can be reduced due to the plurality of processing data.
In step S2, the axon unit may control the distributor through the axon control command, perform data distribution on the processing data and the weight data read by the axon unit, and transmit the distributed processing data and weight data to two output ports of the axon unit.
In one possible implementation, the axon unit includes two output ports, the two output ports being a first output port for transmitting processed data, the bit width of the first output port is greater than that of the second output port;
the axon unit distributes the processing data to a first output port and distributes the weight data to a second output port through the distributor;
wherein, in case that the bit width of the weight data is larger than the bit width of the second output port, and the bit width of the processing data is not larger than the bit width of the second output port, the axon unit distributes the processing data to a second output port and distributes the weight data to a first output port through the distributor.
For example, fig. 4 shows a schematic diagram of data allocation according to an embodiment of the present disclosure. As shown in fig. 4, the axon unit's splitter may include two output ports, an a port (i.e., a first output port) and a B port (i.e., a second output port). The bit width of the port a of the distributor is 32 bits (int [0,7] x 4, which is equivalent to int 32), the bit width of the port B of the distributor is 9 bits (int [0,8], which is equivalent to int8 with sign bit, that is, int 9), and the distributor can distribute the received processing data x and the weight data w to the port a or the port B. And the axon unit transmits the distributed weight data w and the processing data x to the dendritic unit through the A port and the B port to perform multiply-add operation.
The calculation check process data and the weight data are subjected to data processing, for example, convolution operation is performed on a plurality of pieces of image data, and in most cases, the data volume of the process data is larger than that of the weight data, so that an a port with a large bit width ratio can be set as a port for transmitting the process data, and a B port with a small bit width ratio can be set as a port for transmitting the weight data.
It should be understood that in the case where the bit width of the weight data is greater than the bit width of the B port, and the bit width of the processing data is less than or equal to the bit width of the B port, the a port having a larger bit width ratio may be set as the port for transmitting the weight data, and the B port having a smaller bit width ratio may be set as the port for transmitting the processing data.
For example, in the case of a liquid, figure 5 shows a schematic diagram of an axon unit output port, according to an embodiment of the disclosure. As shown in fig. 5, the axon unit performs data distribution of the weight data w and the processing data x read from the storage unit, and sends data to be calculated (processing data and weight data) distributed to the a port (i.e., the first output port) and the B port (i.e., the second output port) to the dendrite unit. The axon unit can control the N multipliers and adders of the dendrite unit to multiply and/or add received data to be operated to obtain an operation result by sending a control signal to the dendrite unit.
As shown in fig. 5, it is assumed that the a port is a port for transmitting processing data x and has a port bit width of 32 bits (i.e., a [0 ] 31), and the B port is a port for transmitting weight data w and has a port bit width of 9 bits (i.e., B [0:8 ]).
When the data type of the data to be operated is int32, the distributor of the axon unit distributes the data to be operated with the data type of int32 to the port A, and transmits the data to be operated to the dendrite unit through the port A. For example, the distributor of the axon unit may distribute the read data of type int32 of the data to be subjected to the accumulation operation to the a port, and transmit the data to be subjected to the accumulation operation to the dendrite unit through the a port.
Or, in the case that the data type to be calculated is int8 (or int 9), the distributor of the axon unit distributes the processing data with the data type of int8 to the a port, and distributes the weight data with the data type of int8 (or int9, that is, int8 with a symbol) to the B port. The bit width of the a port of the axon unit is 32 bits, and 4 processing data with bit width type int8 (8 bits) can be read at a time. For example, the distributor of the axon unit may distribute the processing data to be subjected to convolution operation and having a data type of int8 to the a port, distribute the weight data having a data type of int8 to the B port, and transmit the read data to be subjected to convolution operation to the dendrite unit through the a port and the B port, so that 4 pieces of processing data having a data type of int8 and one piece of weight data having a data type of int8 (or int 9) may be transmitted at one time.
Or, in the case that the data type to be calculated is int2, the distributor of the axon unit distributes the processing data with the data type of int2 to the a port, and distributes the weight data with the data type of int2 to the B port. Wherein, the bit width of the A port of the axon unit is 32 bits, processing data with 16 bit width types of int2 (2 bits) can be read at one time; the bit width of the B port of the axon unit is 9 bits, and 4 bits wide weight data with type int2 (2 bits) can be read at a time. For example, the distributor of the axon unit may distribute the processing data to be subjected to convolution operation and having a data type of int2 to the a port, distribute the weight data having a data type of int2 to the B port, and transmit the read data to be subjected to convolution operation to the dendrite unit through the a port and the B port, so that 16 pieces of processing data having a data type of int2 and 4 pieces of weight data having a data type of int2 may be transmitted at one time.
Or, under the condition that the data to be calculated is processing data of int2 and weight data of int8 (or int 9), the distributor of the axon unit distributes the processing data of int2 to the port a and distributes the weight data of int8 (or int 9) to the port B. The bit width of the port a of the axon unit is 32 bits, and 16 pieces of processing data with bit width type int2 (2 bits) can be read at one time; the bit width of the B port of the axon unit is 9 bits, and 1 bit width weight data with type int8 (or int 9) can be read at a time. For example, the distributor of the axon unit may distribute the processing data to be subjected to convolution operation and having data type int2 to the a port, distribute the weight data having data type int8 (or int 9) to the B port, the read data to be subjected to convolution operation is transmitted to the dendritic unit through the A port and the B port, 16 processing data with data type int2 and 1 weight data with data type int8 (or int 9) can be transmitted at one time.
Or, under the condition that the data to be calculated is processing data of int32 and weight data of int8 (or int 9), the distributor of the axon unit distributes the processing data of int32 to the a port and distributes the weight data of int8 (or int 9) to the B port. The bit width of the port A of the axon unit is 32 bits, and 1 piece of processing data with the bit width type of int32 can be read at one time; the bit width of the B port of the axon unit is 9 bits, and 1 bit width weight data with type int8 (or int 9) can be read at a time. And the axon unit transmits the read data to be operated to the dendrite unit through the A port and the B port.
Or, in a case where weight data whose data type is int32 and int8 (or int 9) processing data are to be calculated (that is, in a case where bit width of int32 weight data is greater than bit width of the B port and bit width of int8 processing data is not greater than bit width of the B port), the distributor of the axon unit distributes the weight data whose data type is int32 to the a port, and distributes the processing data whose data type is int8 (or int 9) to the B port. The bit width of the port A of the axon unit is 32 bits, and 1 bit width type weight data of int32 can be read at one time; b-port of axon unit is 9 bits wide, and is, process data of type int8 (or int 9) can be read 1 bit wide at a time. And the axon unit transmits the read data to be operated to the dendrite unit through the A port and the B port.
It will be appreciated that the a and B ports of the allocator are virtual ports through which the processing data x and the weight data w may be transmitted to the dendritic unit. Wherein the assignor of the axon units may be hardware pointers. The present disclosure does not limit the port bit width of the distributor and the specific form of the distributor.
By arranging the distributor in the axon unit, under the condition that the bit widths of the two output ports of the axon unit are not changed, the distributor of the axon unit can flexibly distribute the processing data and the weight data to the first output port or the second output port according to the bit widths of the processing data and the weight data, can flexibly support the distribution and transmission of data with different precisions (namely data with different bit widths), improves the flexibility of data calculation, and improves the calculation effect.
In step S3, as shown in fig. 5, two output ports (a port and B port) of the axon unit are connected to the input ports of the multiplier-adder array in the dendrite, and the multiplier-adder array (including multiplier-adder 1 to multiplier-adder N) of the dendrite unit can multiply and/or add the weight data and the processing data sent from the a port and the B port to obtain an operation result.
As shown in fig. 2, the dendrite unit includes a multiplier-adder array, and the multiplier-adder array may include k × n multiplier-adders, and the number of multiplier-adders in the multiplier-adder array is not limited in the present disclosure;
in one possible implementation, the multiplier-adder array may be logically based on a crossbar array (crossbar), which may operate on t processed data at a time, and n sets of weight data to obtain n sets of operation results. Wherein, each group of weight data can correspond to k convolution kernels, or k weight data), the method described in the embodiment of the present application may be executed to perform operations on k weight data and t processing data in each set of weight data.
For example, assuming that the multiplier-adder array may include n sets of multiplier-adders, each set of k multiplier-adders, fig. 6 shows a schematic diagram of a set (any one of the n sets) of multiplier-adder parallel operations in the multiplier-adder array according to an embodiment of the disclosure. As shown in fig. 6, a group of multiply adders may include k multiply adders, and each column in the figure may represent one multiply adder, i.e., multiply adder 1 through multiply adder k.
A group of multiply-add units can read t processed data and k convolution kernels at a time (i.e., one instruction cycle, including t sub-times, i.e., sub-time 1 to sub-time t). The data processing method according to the embodiment of the present application may be performed once per instruction cycle.
Wherein t processed data can be represented as x 1 ,x 2 ,…,x t I.e. each multiplier-adder reads the processed data x at sub-time 1 1 Reading the processed data x at sub-time 2 2 And so on, reading the processing data x at the sub-time t t
The k convolution kernels may be denoted as w 1 ,w 2 ,…,w k The multiplier-adder group can read k convolution kernels in parallel at one time, namely, the multiplier-adder 1 reads the convolution kernels w 1 Multiplier-adder 2 reads the convolution kernel w 2 And so on, i.e. the multiplier-adder k reads the convolution kernel w k
It should be understood that the number k of convolution kernels can be set according to the performance of the computation kernel (e.g., the number of multipliers and adders, the read bit width of the storage unit, etc.), and the disclosure is not limited thereto.
As shown in FIG. 6, the set of multiplier-adder arrays can process t data (x) at a time 1 ~x t ) K convolution kernels (w) 1 ~w k ) Performing convolution operation, the operation result of the multiplier-adder 1 is
Figure BDA0002939197550000121
The result of the operation of the multiplier-adder 2 is
Figure BDA0002939197550000122
By analogy, the operation result of the multiplier-adder k is
Figure BDA0002939197550000123
Wherein each x i (i =1 to t) may be associated in parallel with each convolution kernel w 1 ~w k And (6) performing operation.
The multiplier-adder array may include n sets of multiplier-adder groups as shown in fig. 6, so that the multiplier-adder array may perform the multiplication-addition operation on the t processed data and the n sets of weight data in parallel to obtain n sets of operation results. Wherein each set of weight data may include k convolution kernels. The operation of each group of multiplier-adder array can refer to the method shown in fig. 6, and will not be described in detail here.
The multiply-add operation of a plurality of processing data and a plurality of weight values can be processed in parallel through the multiply-add array, the condition that the multiply-add operation of one processing data and one weight value data is carried out by the multiply-add device to access a storage unit once is avoided, the access times of the storage unit can be reduced under the condition that the capacity of data to be operated is the same, and the calculation efficiency is improved.
In one possible implementation manner, the axon unit determines a control signal according to a total bit width of the processing data and the weight data, a bit width of the processing data and the weight data, and an operation type in the two output ports, and transmits the control signal to the dendrite unit;
and the dendritic unit starts a multiplier-adder in the multiplier-adder array according to the received control signal, and multiplies and/or adds the processing data and the weight data to obtain an operation result.
For example, as shown in fig. 5, in the process that the axon unit distributes the processing data x and the weight data w to two output ports of the axon unit through the distributor, i.e., to the port a representing the first output port and the port B representing the second output port, the axon unit determines the control signal according to the total bit width of the processing data and the weight data in the a port and the B port, the bit width of the processing data and the weight data, and the operation type.
For example, assuming that the bit width of the port a is 32 bits, the bit width of the port B is 9 bits, the processing data is of int8 type (i.e. the bit width of the processing data is 8), the weight data is of int2 type (i.e. the bit width of the weight data is 2), and the axon unit may transmit the processing data of int8 type and the weight data of int2 type to the dendrite for multiplication.
The sum of the bit widths of the processing data of the plurality of int8 types which can be read by the port A of the axon unit at one time is not more than the bit width of the port A; the sum of bit widths of the weight data of multiple int2 types that can be read by the B port at a time is not greater than the bit width of the B port.
When two output ports of the axon unit transmit 4 bits of processed data of int8 type and 4 bits of weight data of int2 type to the dendrite unit at a time, and perform multiplication operation, the axon unit may determine the control signal according to the total bit width 32 bits of the processed data (the bit width 8 bits of the processed data is × 4), the total bit width 8 bits of the weight data (the bit width 2 bits of the processed data is × 4), the bit width 8 bits of the processed data, the bit width 2 bits of the weight data, and the operation type of multiplying the int8 data and the int2 data.
And the axon unit transmits the control signal to the dendrite unit, so that the dendrite unit can start the multiplier-adder in the multiplier-adder array according to the received control signal, and multiply the processing data of the int8 type and the weight data of the int2 type in parallel to obtain an operation result.
The axon unit sends a control signal to the dendritic unit, so that the starting of the multiplier-adder in the multiplier-adder array in the dendritic unit can be flexibly controlled, the operation of data with different precisions can be flexibly supported, and the parallelism and the flexibility of calculation are increased.
In one possible implementation, the control signal includes a first enable signal and a second enable signal;
the multiplier-adder array comprises a plurality of multiplier-adders, each multiplier-adder comprises a plurality of multipliers for carrying out multiplication operations with different precisions and a plurality of adders for carrying out addition operations with different precisions, and the first enabling signal is used for determining enabled multipliers and/or adders in each enabled multiplier-adder;
the second enabling signal is used for determining the number of rows and the number of columns of the enabled multiplier-adder.
In one possible implementation, fig. 7 shows a schematic diagram of determining a multiplier-adder operating mode according to an embodiment of the disclosure. As shown in fig. 7, the axon unit performs bit width judgment according to the data types of the processing data and the weight data, and determines a calculation mode and a corresponding first enable signal. The axon unit sends the first enable signal to the dendrite unit, and each multiplier-adder of the dendrite unit can determine the working mode of the multiplier-adder according to the sent first enable signal.
The calculation mode may include: int32 is added, int2 is multiplied by int2, int2 is multiplied by int9 (i.e. signed int 8), int9 is multiplied by int 9.
Wherein, int32 addition represents a calculation mode in which the data type of the processing data of int32 is accumulated by the dendrite unit; the axon unit sends a first enabling signal corresponding to the calculation mode to the dendrite unit, and can control a corresponding multiplier-adder in the multiplier-adder array to be in a working mode of opening a 32-bit adder (accumulator).
The int2 and int2 are multiplied to represent a calculation mode that the processing data with the data type of int2 and the weight data with the data type of int2 are subjected to multiplication operation by the dendritic unit; the axon unit sends a first enabling signal corresponding to the calculation mode to the dendrite unit, the corresponding multiplier-adder in the multiplier-adder array can be controlled to be in the working mode of opening the exclusive-or gate and the and gate (namely, the 2-bit multiplier).
The multiplication of int2 and int9 represents a calculation mode in which the dendritic cell performs multiplication on the processing data with the data type of int2 and the weight data with the data type of int9, or the dendritic cell performs multiplication on the processing data with the data type of int9 and the weight data with the data type of int 2; the axon unit sends a first enabling signal corresponding to the calculation mode to the dendrite unit, and the corresponding multiplier-adder in the multiplier-adder array can be controlled to be in a working mode of starting the multiplier and the adder.
The int9 and int9 are multiplied to represent a calculation mode that the processing data with the data type of int9 and the weight data with the data type of int9 are subjected to multiplication operation by the dendritic unit; the axon unit sends a first enabling signal corresponding to the calculation mode to the dendrite unit, and the corresponding multiplier-adder in the multiplier-adder array can be controlled to be in a working mode of starting the multiplier and the adder.
In one possible implementation, fig. 8 shows a schematic diagram of a multiplier-adder according to an embodiment of the disclosure. As shown in fig. 8, the multiplier-adder Mult8_2 includes an adder Add32_16 unit, a multiplier Mult8 unit (i.e., int8 precision multiplier), and four Mult2 units (i.e., int2 precision multipliers) composed of xor gates and gates. The number of Mult2 units in the multiplier-adder, which are composed of exclusive-or gates and gates, is not limited in this disclosure.
When the first enable signal received by the multiplier-adder is S1 (i.e. corresponding to int32 addition calculation mode), the adder Add32 of the adder Add32_16 unit can be turned on by the first enable signal S1 to be in an operating state, and the processing data sent through the a port is accumulated. In this case, the multiplier Mult8 unit and the Mult2 unit composed of the exclusive or gate and the and gate in fig. 8 are in an off state.
Under the condition that the first enable signal received by the dendrite unit is S (corresponding to a int9 and int9 multiplication mode, or int2 and int9 multiplication mode), the multiplier Mult8 unit can be turned on through the first enable signal S, so that the Mult8 unit is in a working state, and the processing data sent through the a port and the B port is multiplied by the weight number. In this case, the Mult2 unit composed of the xor gate and the and gate on the right side of fig. 8 is in a closed state.
Under the condition that the first enable signal received by the dendrite unit is S0 (namely int2 and int2 multiplication calculation mode), a Mult2 unit consisting of an exclusive-or gate and an and gate can be started through the first enable signal S0, so that the Mult2 unit is in a working state, and processing data with data type int2 sent through an A port and a B port are multiplied by weight number. In this case, the multiplier Mult8 unit on the left side of fig. 8 is in an off state.
In summary, under the condition that the data type of the data to be calculated includes int9, for example, the data type of the data to be calculated is an 8-bit signed or unsigned data type, the Mult8 unit can be in an on state; when the data type of the data to be operated is int2 (2 bits), for example, the data type of the data to be operated is binary (0,1) or ternary (-1,0,1), so that the Mult2 unit is in a working state.
In a possible implementation manner, in the case of equal data amount, for example, in the case of data types int2 and int2 being multiplied, the multiplier Mult8 unit and the adder Add32 on the left side of fig. 8 may be used, and the energy consumption of the multiplier-adder array composed of 32 multiplier-adders is calculated as follows:
Mult8:0.2×1=0.2pJ
Add32:0.1=0.1pJ
MACs:0.3×32=9.6pJ
400MHz:9.6×400=3.84mw
the Mult2 unit and adder Add16 on the right side of fig. 8 can also be used, and the energy consumption of the multiplier-adder array composed of 32 multiplier-adders is calculated as follows:
Mult2:0.05÷16×4=0.01pJ
Add16:0.05×4=0.2pJ
MACs:0.21×32=6.7pJ
400MHz:6.7×400=2.7mW
from the comparison of the energy consumption calculation data, it can be seen that the use of Mult2 unit can reduce the power consumption in the same case, i.e. 3.84-2.7=1.14mw. Wherein, 400MHz is the working frequency of the computing core, namely the frequency of executing instructions in unit time, may be determined based on the specific performance of the computing core hardware, and the disclosure is not limited.
The dendritic unit can determine different calculation modes according to the data types with different precisions, and start different multipliers and adders, namely, the multipliers for carrying out multiplication operations with different precisions in the multipliers and the adders for carrying out addition operations with different precisions in the multipliers and the adders according to the first enabling signal, so that the operation of data with different precisions can be flexibly supported, the calculation efficiency of a chip is improved, and the power consumption of the multiplier and adder array is reduced.
In one possible implementation, the dendritic unit generates, in dependence on the second enable signal, the number of multiplier adders turned on in the multiplier-adder array is determined.
For example, the second enable signal may include a row enable signal and a column enable signal, and the number of multiplier-accumulator sets that are turned on in the multiplier-accumulator array and the number of multiplier-accumulators that are turned on in each set are determined based on the row enable signal and the column enable signal.
The row enable signal is used to control the number of banks (i.e., the number of rows that are turned on) that are turned on for the multiplier-adder array, and which row is specifically turned on, e.g., 1 bank, 2 banks, 3 banks, n banks, etc., can be flexibly turned on according to the row enable signal. The column enable signal is used to control the number of times the multiplier-adders in each group are turned on (i.e., the number of columns that are turned on), and which column is specifically turned on, e.g., 1, 2, 3, k, etc. can be flexibly turned on per group according to the column enable signal. The number of sets and the number of each set provided in the multiplier-adder array are not limited in the present disclosure.
Wherein each multiplier-adder in the multiplier-adder array may comprise at least one enable terminal. And controlling the enabling end of each multiplier-adder in the multiplier-adder array according to the row enabling signal and the column enabling signal so that each multiplier-adder is in an opening or closing state.
For example, fig. 9 shows a schematic diagram of a multiplier-adder array according to an embodiment of the disclosure. As shown in fig. 9, the multiplier-adder array includes 128 multiplier-adders, arranged in a 32 x 4 array, i.e. the multipliers and adders are arranged in 4 groups (rows), with 32 multipliers and adders in each group.
Each multiplier-adder includes a first input, a second input, an enable, and an output. The first input end and the second input end are used for inputting data to be subjected to multiply-add operation, for example, the first input end InA (ina0 to ina3) is used for inputting processing data (data sent by the port a), and the second input end InB (InB _0 to InB _ 31) is used for inputting weight data (data sent by the port B). The output terminal V is used for outputting the result of the multiplication and/or addition operation. The enable terminal en is used for turning on each multiplier-adder so that each multiplier-adder is in a working state, and the multiplier-adders which are not turned on are in a closed state.
As shown in fig. 9, the row enable signal (ver _ mac _ en [0:1 ]) and the column enable signal (hor _ mac _ en [0:5 ]) pass through the and gate and then act on the enable terminal en of each multiplier-adder. Wherein, the row enable signal (ver _ mac _ en [0:1 ]) comprises four variables ver _ mac _ en _ 0-ver _ mac _ en _3, which can respectively control the starting conditions of the four groups of multiplier-adders in the transverse direction in fig. 9; the column enable signal (hor _ mac _ en [ 0:5) includes 32 variables hor _ mac _ en _0 to hor _ mac _ en _31, which can respectively control the number of multiplier-adders turned on in each group, i.e. the on-state of the vertical 32-column multiplier-adders in fig. 9.
For example, assuming that the enable terminal en of the multiplier-adder is high to turn on the multiplier-adder, in case that it is required to turn on the multiplier-adder of the first group, the hor _ mac _ en _0 of the row enable signal may be set to 1 (high level), and the ver _ mac _ en _0 of the column enable signal may also be 1 (high level), after the row enable signal hor _ mac _ en _0 and the column enable signal ver _ mac _ en _0 pass through the and gate (ver _ mac _ en _0 and ver \mac _ en \0 = 1) are high level, which acts on the enable terminal en of the first multiplier-adder of the first group to turn on the multiplier-adder.
Assuming that the multiplier is turned off in the case where the multiplier of the first group is not needed, the hor _ mac _ en _0 of the row enable signal may be set to 0 (low level), or ver _ mac _ en _0 of the column enable signal may be set to 0 (low level), and after passing the row enable signal hor _ mac _ en _0 and the column enable signal ver _ mac _ en _0 through the and gate (ver _ mac _ en _0 &veru mac \\ en \0 = 1), the multiplier is turned off.
It should be understood that the present disclosure is not limited to the method for controlling the enable terminal en of the multiplier-adder, and the multiplier-adder may be turned on for a low level through the enable terminal en of the multiplier-adder.
Since each multiplier-adder in the multiplier-adder array has an enable control terminal en, the enable control terminal en of each multiplier-adder in the multiplier-adder array can be controlled to be turned on or off by receiving the second enable signals (the row enable signal and the column enable signal) sent by the axon unit through the dendrite unit.
Therefore, the number of turned-on multipliers in the multiplier-adder array can be determined by the second enable signal, all or a part of the multipliers in the multiplier-adder array can be turned on, for example, the number of groups of the multiplier-adder array can be set to 8 groups, 16 groups, 32 groups, etc., or the number of multipliers per group can be set to 32, 64, 128, etc. The multiplier-adder device which does not participate in the operation can be closed, and the power consumption of the operation of the multiplier-adder device is reduced.
In a possible implementation manner, the first enable signal is determined according to a bit width and an operation type of the processing data and the weight data, and the second enable signal is determined according to a total bit width, a bit width and an operation type of the processing data and the weight data in the two output ports.
In one possible implementation, the axon unit may set the first enable signal and the second enable signal according to the total bit width of the processing data and the weight data in the two output ports, the data type of the processing data and the weight (i.e., bit width, int32, int8 or int9, int 2), and the operation type (convolution, addition, multiplication).
Wherein, the first enabling signal can be determined according to the data type (int 32, int8 or int9, int 2) and the operation type (convolution operation, addition operation, multiplication operation) of the processing data and the weight data; the second enable signal may be determined according to a total bit width of the plurality of process data and the plurality of weight data, a data type (int 32, int8 or int9, int 2) of the process data and the weight data, and an operation type (convolution operation, addition operation, multiplication operation) in the two output ports of the axon unit.
For example, the axon unit may send data (processing data and weight data) to be multiplied to the dendrite unit in parallel through the a port and the B port.
Suppose an axon unit sends 8 processed data (x) of data type int2 over an A port (32 bits wide) 1 ~x 8 ) Sending 4 weight data (w) with data type int2 through B port (bit width 9) 1 ~w 4 ). The axon unit may be based on the process data (x) 1 ~x 8 ) And weight data (w) 1 ~w 4 ) And the pattern of the multiplication operation determines the first enable signal and the second enable signal and sends the first enable signal and the second enable signal to the dendritic cell.
Wherein the data (x) can be processed according to 1 ~x 8 ) And weight data (w) 1 ~w 4 ) Determines the first enable signal according to the data type int2 and the operation type of the multiplication; data (x) can be processed according to axon unit A port 1 ~x 8 ) Total bit width of 16 (bit width of processing data of int2 type 2 × 8), B port weight data (w) 1 ~w 4 ) Total bit width of 8 (bit width of 2 × 4 processed data of int2 type)' processing data (x) 1 ~x 8 ) And weight data (w) 1 ~w 4 ) And the type of operation of the multiplication determines the second enable signal.
It should be understood that the total bit width of the data sent by the a port is not more than 32 bits, that is, the number of the data of int2 type that can be transmitted by the a port at a time may be any integer not exceeding 16, and 8 are taken as an example above. Under the condition that the total bit width of the data sent by each output port is not more than the bit width of each port, the number of the data sent by each port every time is not limited by the method and the device.
In one example, the number of multiplier-adders is turned on, which is obtained by dividing the total bit width of the data in the a port by the data bit width of the a port (the number of data in the a port is obtained), dividing the total bit width of the data in the B port by the data bit width of the B port (the number of data in the B port is obtained), and multiplying the two results, for example, the dendrite unit turns on 32 (8 × 4) multiplier-adders in the multiplier-adder array according to the received second enable signal, and turns on each multiplier-adder by the Mult2 unit composed of the xor gate and the and gate according to the received first enable signal. The specific row and column where the multiplier-adder is turned on may be determined according to the idle state of the multiplier-adder, which is not limited in this application. The adder and/or multiplier started in the multiplier-adder should meet the requirements of operation type and operation precision, the operation precision is determined by the bit width of the processing data and the weight data, for example, the processing data and the weight data are both int2, the operation mode is multiplication, and then the multiplier Mult2 unit of 2 by 2 is started.
The 32 multipliers and adders opened in the multiplier and adder array can be used for processing the received data (x) in parallel 1 ~x 8 ) And weight data (w) 1 ~w 4 ) And performing multiplication operation.
For example, the multiplier-adder array may be turned on into 4 groups of 8 multiplier-adders each, with the first group of 8 multiplier-adders performing processing on data (x) 1 ~x 8 ) And weight data (w) 1 ) Wherein a first multiplier-adder of the first set performs the processing of the data (x) 1 ) And weight data (w) 1 ) Multiplying operation of (2); the second group of 8 multiplier-adders can execute processing data (x) 1 ~x 8 ) And weight data (w) 2 ) Multiplication of (1); the third set of 8 multiply-add units may perform processing on data (x) 1 ~x 8 ) And weight data (w) 3 ) Multiplication of (1); the fourth group of 8 multiplier-adders can execute processing data (x) 1 ~x 8 ) And weight data (w) 4 ) The multiplication of (1).
For example, the axon unit may send data (processing data and weight data) to be multiplied to the dendrite unit in parallel through the a port and the B port.
Hypothesis axon unit communicationSending 4 processed data (x) with data type int8 through A port (bit width 32) 1 ~x 4 ) Sending 1 weight data (w) with data type int8 through B port (bit width 9) 1 ). The axon unit may be based on the process data (x) 1 ~x 4 ) And weight data (w) 1 ) And the pattern of the multiplication operation determines the first enable signal and the second enable signal and sends the first enable signal and the second enable signal to the dendritic cell.
Wherein the data (x) can be processed according to 1 ~x 4 ) Data of (d) and weight data (w) 1 ) Determines the first enable signal according to the data type int8 and the operation type of the multiplication; data (x) can be processed according to axon unit A port 1 ~x 4 ) Total bit width 32 (bit width 8 × 4 of int8 type processing data), B port weight data (w) 1 ) Total bit width of 8 (bit width of 8 × 1 processing data of int8 type), processing data (x) 1 ~x 4 ) And weight data (w) 1 ) And the type of operation of the multiplication determines the second enable signal.
The dendrite unit turns on 4 (4 × 1) multiply adders in the multiply-and-add array according to the received second enable signal, and turns on a multiplier Mult8 unit of each multiply-and-add according to the received first enable signal.
4 multipliers and adders on the multiplier-adder array can perform parallel processing on received data (x) 1 ~x 4 ) And weight data (w) 1 ) And performing multiplication operation.
For example, the multiplier-adder array may turn on 1 set of 4 multiplier-adders, the first multiplier-adder may perform processing on data (x) 1 ) And weight data (w) 1 ) Multiplication of (1); the second multiplier-adder can execute the processing data (x) 2 ) And weight data (w) 1 ) Multiplication of (1); the third multiplier-adder can execute the processing data (x) 3 ) And weight data (w) 1 ) Multiplication of (1); the fourth multiplier-adder can execute the processing data (x) 4 ) And weight data (w) 1 ) The multiplication of (1).
For example, the axon unit may send data to be accumulated to the dendrite unit through the a port.
Suppose that the axon unit sends 1 transaction data (x) of data type int32 through the a port (bit wide 32) 1 ) The axon unit may be based on the process data (x) 1 ) And the mode of the accumulation operation determines the first enable signal and the second enable signal and sends the first enable signal and the second enable signal to the dendritic cell.
Wherein the data (x) can be processed according to 1 ) The data type int32 of the data and the type of operation accumulated determine a first enable signal; data (x) can be processed according to axon unit A port 1 ) Total bit width 32 (bit width 32 × 1 of processing data of int32 type), processing data (x) 1 ) The data type int32 and the type of operation accumulated determine the second enable signal.
The dendrite unit turns on 1 multiplier-adder in the multiplier-adder array according to the received second enable signal, and turns on the adder of the multiplier-adder array according to the received first enable signal. 1 multiplier-adder with multiplier-adder array on can process data (x) received 1 ) And performing accumulation operation.
The dendritic unit can set the number of groups of the multiplier-adder array and the number of multipliers in each group according to the operation requirement through an enabling signal, and the working mode of each multiplier-adder increases the parallelism and flexibility of calculation. Furthermore, the dendritic unit can close the multiplier-adder which does not participate in the operation according to the enable signal, and the power consumption of the operation of the multiplier-adder is reduced.
In one possible implementation, the embodiment of the disclosure further provides a processor, the processor includes a plurality of compute cores, each compute core including a dendrite unit, an axon unit, and a storage unit; wherein the axon unit comprises a distributor, the dendrite unit comprises a multiplier-adder array, and the processor is used for realizing multiplication and/or addition operation of the processing data and the weight data, and comprises: the axon unit acquires the processing data and the weight data from the storage unit; the axon unit distributes the processing data and the weight data to two output ports of the axon unit through the distributor; and the two output ports of the axon unit are connected with the input ports of a multiplier-adder array in the dendrite, and the multiplier-adder array is used for carrying out multiplication and/or addition operation according to the processing data and the weight data sent by the input ports to obtain an operation result.
Wherein the axon unit reads a plurality of processing data and a plurality of weight data in one access to the storage unit, and distributes the plurality of processing data and the plurality of weight data to two output ports of the axon unit through the distributor to multiply and/or add the plurality of processing data and the plurality of weight data in parallel through the multiplier-adder array.
In one possible implementation manner, the axon unit determines a control signal according to a total bit width of the processing data and the weight data, a bit width of the processing data and the weight data, and an operation type in the two output ports, and transmits the control signal to the dendrite unit; and the dendritic unit starts a multiplier-adder in the multiplier-adder array according to the received control signal, and performs multiplication and/or addition operation on the processing data and the weight data to obtain an operation result.
In one possible implementation, the control signal includes a first enable signal and a second enable signal; the multiplier-adder array comprises a plurality of multiplier-adders, each multiplier-adder comprising a plurality of multipliers for performing multiplication operations of different precision and a plurality of adders for performing addition operations of different precision, the first enable signal for determining enabled multipliers and/or adders in each enabled multiplier-adder; and the second enabling signal is used for determining the number of rows and the number of columns of the enabled multiplier-adder.
In a possible implementation manner, the first enable signal is determined according to a bit width and an operation type of the processing data and the weight data, and the second enable signal is determined according to a total bit width, a bit width and an operation type of the processing data and the weight data in the two output ports.
In one possible implementation, the axon unit distributing the processing data and the weight data to two output ports of the axon unit through the distributor includes: the two output ports are a first output port for transmitting processing data and a second output port for transmitting weight data, and the bit width of the first output port is greater than that of the second output port; the axon unit distributes the processing data to a first output port and distributes the weight data to a second output port through the distributor; and under the condition that the bit width of the weight data is greater than the bit width of the second output port and the bit width of the processing data is not greater than the bit width of the second output port, the axon unit distributes the processing data to the second output port and distributes the weight data to the first output port through the distributor.
In a possible implementation manner, an embodiment of the present disclosure further provides an artificial intelligence chip, where the chip includes one or more processors as described above.
In one possible implementation, the disclosed embodiments provide an electronic device, including one or more of the chips described above.
Fig. 10 is a block diagram illustrating a combined processing device 1200 according to an embodiment of the present disclosure. As shown in fig. 10, the combined processing device 1200 includes a computing processing device 1202 (e.g., an artificial intelligence processor including multiple computing cores as described above), an interface device 1204, other processing devices 1206, and a storage device 1208. Depending on the application scenario, one or more computing devices 1210 (e.g., computing cores) may be included in the computing processing device.
In one possible implementation, a computing processing device of the present disclosure may be configured to perform operations specified by a user. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware architecture of an artificial intelligence processor core. When multiple computing devices are implemented as artificial intelligence processor cores or as part of a hardware structure of an artificial intelligence processor core, computing processing devices of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure.
In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively perform user-specified operations. Other Processing devices of the present disclosure may include one or more types of general and/or special purpose processors such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an artificial intelligence processor, and the like, depending on the implementation. These processors may include, but are not limited to, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, etc., and the number may be determined based on actual needs. As mentioned previously, the computational processing device of the present disclosure can be considered to have a single core structure or an isomorphic multiple core structure only. However, when considered together, a computing processing device and other processing devices may be considered to form a heterogeneous multi-core structure.
In one or more embodiments, the other processing devices can interface the computing processing device of the present disclosure (which can be embodied as an artificial intelligence, e.g., a computing device associated with neural network operations) with external data and controls, performing basic controls including, but not limited to, data handling, turning the computing device on and/or off, and the like. In other embodiments, other processing devices may cooperate with the computing processing device to collectively perform computational tasks.
In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing processing device may obtain input data from other processing devices via the interface device, and write the input data into a storage device (or memory) on the computing processing device. Further, the computing processing device may obtain the control instruction from the other processing device via the interface device, and write the control instruction into the control cache on the computing processing device slice. Alternatively or optionally, the interface device may also read data from the memory device of the computing processing device and transmit the data to the other processing device.
Additionally or alternatively, the combined processing device of the present disclosure may further include a storage device. As shown in the figure, the storage means is connected to the computing processing means and the further processing means, respectively. In one or more embodiments, the storage device may be used to store data for the computing processing device and/or the other processing devices. For example, the data may be data that is not fully retained within internal or on-chip storage of a computing processing device or other processing device.
According to different application scenarios, the artificial intelligence processor disclosed by the present disclosure can be used for a server, a cloud server, a server cluster, a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a drive recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an automatic driving terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance instrument, a B ultrasonic instrument and/or an electrocardiograph.
Fig. 11 shows a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, electronic device 1900 may be provided as a server. Referring to fig. 11, an electronic device 1900 includes a processing component 1922 (e.g., an artificial intelligence processor including multiple computing cores) that further includes one or more computing cores and memory resources, represented by memory 1932, for storing instructions, such as applications, that are executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.
The electronic device 1900 may further include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.
In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed over multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the scheme of the embodiment of the disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.
In the foregoing embodiments, the descriptions of the respective embodiments are focused on, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The electronic device or processor of the present disclosure may also be applied to the fields of the internet, internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or the processor disclosed by the disclosure can also be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, a computationally powerful electronic device or processor according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or processor may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements of the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (12)

1. A data processing method is applied to a processor, wherein the processor comprises a plurality of computing cores, and each computing core comprises a dendritic unit, an axon unit and a storage unit; wherein the axon unit comprises a distributor, the dendrite unit comprises a multiplier-adder array, and the method realizes multiplication and/or addition operation of processing data and weight data;
the method comprises the following steps:
the axon unit acquires the processing data and the weight data from the storage unit;
the axon unit distributes the processing data and the weight data to two output ports of the axon unit through the distributor;
the two output ports of the axon unit are connected with the input port of a multiplier-adder array in the dendrite, and the multiplier-adder array is used for multiplying and/or adding operation according to the processing data and the weight data sent by the input port to obtain an operation result;
wherein the axon unit reads a plurality of processing data and a plurality of weight data in one access to the storage unit, and distributes the plurality of processing data and the plurality of weight data to two output ports of the axon unit through the distributor to multiply and/or add the plurality of processing data and the plurality of weight data in parallel through the multiplier-adder array.
2. The method of claim 1, further comprising: the axon unit determines a control signal according to the total bit width of the processing data and the weight data, the bit width of the processing data and the weight data and the operation type in the two output ports, and transmits the control signal to the dendrite unit;
and the dendritic unit starts a multiplier-adder in the multiplier-adder array according to the received control signal, and performs multiplication and/or addition operation on the processing data and the weight data to obtain an operation result.
3. The method of claim 1 or 2, wherein the control signal comprises a first enable signal and a second enable signal;
the multiplier-adder array comprises a plurality of multiplier-adders, each multiplier-adder comprising a plurality of multipliers for performing multiplication operations of different precision and a plurality of adders for performing addition operations of different precision, the first enable signal for determining enabled multipliers and/or adders in each enabled multiplier-adder;
and the second enabling signal is used for determining the number of rows and the number of columns of the enabled multiplier-adder.
4. The method of claim 3, wherein the first enable signal is determined according to a bit width of the processing data and the weight data and a type of operation, and the second enable signal is determined according to a total bit width of the processing data and the weight data, a bit width of the processing data and the weight data, and a type of operation in the two output ports.
5. The method of claim 1, wherein the axonal unit distributes the processing data and the weight data to two output ports of axonal unit through the distributor, comprising:
the two output ports are a first output port for transmitting processing data and a second output port for transmitting weight data, and the bit width of the first output port is greater than that of the second output port;
the axon unit distributes the processing data to a first output port and distributes the weight data to a second output port through the distributor;
and under the condition that the bit width of the weight data is greater than the bit width of the second output port and the bit width of the processing data is not greater than the bit width of the second output port, the axon unit distributes the processing data to the second output port through the distributor and distributes the weight data to the first output port.
6. A processor comprising a plurality of compute cores, each compute core comprising a dendrite unit, an axon unit, and a storage unit; wherein the axon unit comprises a distributor and the dendrite unit comprises a multiplier-adder array;
the processor is used for realizing multiplication and/or addition operation of the processing data and the weight data, and comprises the following steps:
the axon unit acquires the processing data and the weight data from the storage unit;
the axon unit distributes the processing data and the weight data to two output ports of the axon unit through the distributor;
the two output ports of the axon unit are connected with the input port of a multiplier-adder array in the dendrite, and the multiplier-adder array is used for carrying out multiplication and/or addition operation according to the processing data and the weight data sent by the input port to obtain an operation result;
wherein the axon unit reads a plurality of processing data and a plurality of weight data in one access to the storage unit, and distributes the plurality of processing data and the plurality of weight data to two output ports of the axon unit through the distributor to multiply and/or add the plurality of processing data and the plurality of weight data in parallel through the multiplier-adder array.
7. The processor according to claim 6, wherein the axon unit determines a control signal according to a total bit width of the processing data and the weight data, a bit width of the processing data and the weight data, and an operation type in the two output ports, and transmits the control signal to the dendrite unit;
and the dendritic unit starts a multiplier-adder in the multiplier-adder array according to the received control signal, and performs multiplication and/or addition operation on the processing data and the weight data to obtain an operation result.
8. The processor according to claim 6 or 7, wherein the control signal comprises a first enable signal and a second enable signal;
the multiplier-adder array comprises a plurality of multiplier-adders, each multiplier-adder comprising a plurality of multipliers for performing multiplication operations of different precision and a plurality of adders for performing addition operations of different precision, the first enable signal for determining enabled multipliers and/or adders in each enabled multiplier-adder;
and the second enabling signal is used for determining the number of rows and the number of columns of the enabled multiplier-adder.
9. The processor of claim 8, wherein the first enable signal is determined according to a bit width of the processing data and the weight data and a type of operation, and wherein the second enable signal is determined according to a total bit width of the processing data and the weight data, a bit width of the processing data and the weight data, and a type of operation in the two output ports.
10. The processor of claim 6, wherein the axon unit distributes the processing data and the weight data to two output ports of the axon unit through the distributor, comprising:
the two output ports are a first output port for transmitting processing data and a second output port for transmitting weight data, and the bit width of the first output port is greater than that of the second output port;
the axon unit distributes the processing data to a first output port and distributes the weight data to a second output port through the distributor;
and under the condition that the bit width of the weight data is greater than the bit width of the second output port and the bit width of the processing data is not greater than the bit width of the second output port, the axon unit distributes the processing data to the second output port through the distributor and distributes the weight data to the first output port.
11. An artificial intelligence chip, wherein the chip comprises a processor according to any one of claims 6 to 10.
12. An electronic device, characterized in that the electronic device comprises an artificial intelligence chip according to claim 11.
CN202110172431.6A 2021-02-08 2021-02-08 Data processing method, processor and electronic equipment Active CN112801276B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110172431.6A CN112801276B (en) 2021-02-08 2021-02-08 Data processing method, processor and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110172431.6A CN112801276B (en) 2021-02-08 2021-02-08 Data processing method, processor and electronic equipment

Publications (2)

Publication Number Publication Date
CN112801276A CN112801276A (en) 2021-05-14
CN112801276B true CN112801276B (en) 2022-12-02

Family

ID=75814801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110172431.6A Active CN112801276B (en) 2021-02-08 2021-02-08 Data processing method, processor and electronic equipment

Country Status (1)

Country Link
CN (1) CN112801276B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554162B (en) * 2021-07-23 2022-12-20 上海新氦类脑智能科技有限公司 Axon input extension method, device, equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110221808B (en) * 2019-06-03 2020-10-09 深圳芯英科技有限公司 Vector multiply-add operation preprocessing method, multiplier-adder and computer readable medium
CN111242277B (en) * 2019-12-27 2023-05-05 中国电子科技集团公司第五十二研究所 Convolutional neural network accelerator supporting sparse pruning based on FPGA design
CN212112470U (en) * 2020-04-24 2020-12-08 科大讯飞股份有限公司 Matrix multiplication circuit

Also Published As

Publication number Publication date
CN112801276A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN109189474B (en) Neural network processing device and method for executing vector addition instruction
CN111176727B (en) Computing device and computing method
CN117933314A (en) Processing device, processing method, chip and electronic device
CN112799599B (en) Data storage method, computing core, chip and electronic equipment
CN112686379B (en) Integrated circuit device, electronic apparatus, board and computing method
CN112801276B (en) Data processing method, processor and electronic equipment
CN113837922A (en) Computing device, data processing method and related product
CN112084023A (en) Data parallel processing method, electronic equipment and computer readable storage medium
CN112817898A (en) Data transmission method, processor, chip and electronic equipment
CN115437602A (en) Arbitrary-precision calculation accelerator, integrated circuit device, board card and method
CN115373646A (en) Information expansion method, device and related product
CN111353124A (en) Operation method, operation device, computer equipment and storage medium
CN112596881B (en) Storage component and artificial intelligence processor
CN113469333B (en) Artificial intelligence processor, method and related products for executing neural network model
CN113746471B (en) Arithmetic circuit, chip and board card
CN112232498B (en) Data processing device, integrated circuit chip, electronic equipment, board card and method
CN114692847B (en) Data processing circuit, data processing method and related products
CN111290788B (en) Operation method, operation device, computer equipment and storage medium
CN115237371A (en) Computing device, data processing method and related product
CN113791754A (en) Arithmetic circuit, chip and board card
CN115600657A (en) Processing device, equipment and method and related products thereof
CN113934678A (en) Computing device, integrated circuit chip, board card, equipment and computing method
CN112801278A (en) Data processing method, processor, chip and electronic equipment
CN114565075A (en) Apparatus, method and readable storage medium for supporting multiple access modes
CN115081605A (en) Buffer memory, device and board card for temporarily storing neuron data in Winograd convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant