WO2023116923A1 - 一种存算一体装置和计算方法 - Google Patents

一种存算一体装置和计算方法 Download PDF

Info

Publication number
WO2023116923A1
WO2023116923A1 PCT/CN2022/141634 CN2022141634W WO2023116923A1 WO 2023116923 A1 WO2023116923 A1 WO 2023116923A1 CN 2022141634 W CN2022141634 W CN 2022141634W WO 2023116923 A1 WO2023116923 A1 WO 2023116923A1
Authority
WO
WIPO (PCT)
Prior art keywords
calculation
bit
data
storage
module
Prior art date
Application number
PCT/CN2022/141634
Other languages
English (en)
French (fr)
Inventor
华幸成
曾重
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023116923A1 publication Critical patent/WO2023116923A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/491Computations with decimal numbers radix 12 or 20.
    • G06F7/498Computations with decimal numbers radix 12 or 20. using counter-type accumulators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • G06N3/065Analogue means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of chip technology, and in particular to an integrated storage and calculation device and calculation method.
  • neural network In recent years, neural network (NN) has developed rapidly and is widely used in robotics, speech recognition, image recognition, natural language processing, and expert systems.
  • the core calculation of the neural network is matrix-vector multiplication, which is computationally intensive and memory-intensive.
  • general-purpose chips When using general-purpose chips for neural network calculations, general-purpose chips have obvious shortcomings in power consumption, performance, and size. Therefore, in order to improve the computational efficiency of neural networks, it is necessary to customize special-purpose chips (neural network accelerators) for neural networks to perform calculations.
  • the storage and calculation integrated device not only retains the storage and read and write functions of the storage circuit itself, but also supports multiplication and addition operations in parallel, which reduces the amount of data movement and improves energy efficiency, providing an efficient solution for the design of neural network accelerators. plan.
  • the integrated storage and calculation device usually needs to expand multi-bit (bit) data into single-bit/low-bit (such as 2-bit or 4-bit, etc.) data for calculation according to the data bit width, and then combine the calculation results, so The number of expansion calculations is large, resulting in high overhead.
  • the embodiments of the present application provide an integrated storage and calculation device and a calculation method, which are applied to the integrated storage and calculation device, which can reduce overhead and improve calculation efficiency when performing neural network calculations.
  • an embodiment of the present application provides an integrated storage and calculation device, which includes a bit width calculation module, a calculation module, and a result processing module.
  • the calculation module includes a calculation array, and the calculation array includes a plurality of storage calculation units for storing weight data.
  • the bit width calculation module is used to calculate multiple input data, obtain multiple valid data, and input multiple valid data to the calculation module. Multiple input data correspond to multiple valid data one by one. An input data corresponds to the first valid data among the plurality of valid data, and the bit width of the first input data is larger than the bit width of the first valid data.
  • the calculation module is used to obtain the calculation result of each column in the calculation array according to the bits of multiple valid data and weight data, and input the calculation result of each column to the result processing module, wherein the calculation result of one column is the same value of multiple valid data.
  • the bits and a column store the sum of the products calculated by the computing unit.
  • the result processing module is used to perform weighted calculation on the calculation results of each column to obtain the final result.
  • multi-bit input data is expanded into multiple single-bit/low-bit input data for input and calculation according to the data bit width in the prior art, resulting in too many expansion calculations, resulting in Large overhead
  • the method of the present application can dynamically calculate the effective data of the input data, thereby only calculating the effective bits of the input data, effectively reducing the number of calculations performed by the calculation module, reducing the calculation overhead, and improving the calculation of the storage and calculation integrated device efficiency.
  • the bit width calculation module is specifically used to perform mask calculation on multiple input data to obtain a mask value, determine multiple valid data according to the effective bits of the mask value, and divide the multiple valid data one by one
  • the bits are input to the calculation module, so that the calculation module performs calculation bit by bit on multiple valid data. Therefore, the calculation method provided by this application enables the bit width calculation module to obtain effective data of the input data through mask calculation, and input the effective data to the calculation module bit by bit, thereby greatly reducing the calculation times of the calculation array.
  • the calculation array when the calculation array receives the Nth bits corresponding to a plurality of valid data respectively, where N is an integer greater than or equal to 0, the calculation array is used to calculate the Nth bits corresponding to the multiple valid data respectively.
  • the product of bits and bits of weight data; the calculation module also includes an accumulation circuit, and the accumulation circuit is used to add the products calculated by the same column storage calculation unit in the calculation array to obtain the sum of the products calculated by each column storage calculation unit in the calculation array. and.
  • the calculation module calculates the Nth bits corresponding to a plurality of valid data each time, and the number of calculations performed by the calculation module corresponds to the bit width of the valid data. Since the bit width of the valid data is less than The bit width of the input data, so the number of calculations performed by the calculation array can be effectively reduced.
  • the weight data includes multiple weight data
  • the integrated storage and calculation device also includes a weight bit width configuration module; the weight bit width configuration module is used to store bit width information of various weight data, and the bit width information includes each The bit width of each kind of weight data and the identification of the starting column in the calculation array corresponding to each kind of weight data, wherein, the bit width of at least two kinds of weight data among the multiple kinds of weight data is different. Therefore, compared with the fixed bit width of weight data in the prior art, the calculation method provided by this application cannot achieve mixed precision calculation of weight data, resulting in low calculation efficiency.
  • the bit width information of different weight data can realize the deployment and calculation of weight data of multiple bit widths in a single computing array, thereby supporting the calculation of mixed precision of weight data, and effectively improving the computing efficiency of the storage-computing integrated device.
  • the integrated storage and calculation device further includes a control module, and the control module is used to write various weight data into multiple storage and calculation units according to the bit width information. Therefore, in the calculation method provided by this application, the control module can deploy weight data to each storage calculation unit in the calculation array according to the bit width information, thereby including multiple bit widths of weight data in a single calculation array, and realizing weight Data mixed precision calculation improves the calculation efficiency of the storage and calculation integrated device.
  • control module is further configured to determine valid bits of the mask value bit by bit, and generate a first control signal and a second control signal when any bit of the mask value is determined to be valid.
  • the first control signal is used to instruct the calculation module to calculate the sum of the products of each column storage calculation unit in the calculation array
  • the second control signal is used to instruct the result processing module to correspond to each weight data in the calculation array according to the bit width information
  • the sum of the products of the multi-column storage computing units is weighted to obtain a plurality of weighted results corresponding to the Nth bits of the plurality of valid data, and each weighted result in the plurality of weighted results corresponds to a kind of weight data.
  • the control module can generate a control signal according to the effective bits of the mask value, and control the calculation module and the result processing module. Since the number of effective bits of the mask value is the same as the bit width of the effective data, which is usually smaller than the bit width of the input data, the control signal is generated according to the effective bits of the mask value, which can reduce the number of calculations performed by the calculation module and reduce the calculation time. overhead.
  • control module is further configured to generate a third control signal when it is determined that the bit width of the mask value is equal to the bit width of the input data.
  • the third control signal is used to instruct the result processing module to perform weighted calculations according to the bit weights corresponding to the valid bits of the mask value and the multiple weighted results of each bit of multiple valid data to obtain the final result.
  • the final result includes each The weighted result of the weighted data.
  • the result processing module performs weighted calculation according to the bit width information and the bit weight of the effective bit of the mask value, which can accurately combine multiple single-bit effective data and multi-bit weights
  • the calculation result of the data is transformed into the calculation result of multi-bit input data and multi-bit weight data.
  • the number of calculations is effectively reduced and the overhead is reduced.
  • the embodiment of the present application provides a calculation method, which is applied to an integrated storage and calculation device.
  • the integrated storage and calculation device includes a calculation array, and the calculation array includes a plurality of storage and calculation units, and the multiple storage and calculation units are used to store weights. data.
  • the method includes: calculating a plurality of input data to obtain a plurality of effective data, the plurality of input data corresponds to the plurality of effective data one by one, the first input data among the plurality of input data and the first one of the plurality of effective data
  • the valid data corresponds, and the bit width of the first input data is greater than the bit width of the first valid data, and the calculation result of each column in the calculation array is obtained according to the bits of the multiple valid data and the weight data, wherein the calculation result of one column is more than The sum of the product calculated by the same bit of valid data and a column storage calculation unit, and the calculation result of each column is weighted to obtain the final result.
  • the beneficial effects achieved in the second aspect can refer to the beneficial effects in the first aspect.
  • calculating multiple input data to obtain multiple valid data includes: performing mask calculation on multiple input data to obtain a mask value, and determining multiple valid data according to the effective bits of the mask value , according to the bits of multiple valid data and weight data, obtaining the calculation result of each column in the calculation array includes: calculating the multiple valid data bit by bit and the bit of weight data to obtain the calculation result of each column in the calculation array .
  • obtaining the calculation result of each column in the calculation array includes: when the calculation array receives the Nth bits respectively corresponding to multiple valid data, wherein , N is an integer greater than or equal to 0, calculate the product of the Nth bit corresponding to a plurality of valid data and the bit of the weight data, and add the products calculated by the storage calculation unit in the same column in the calculation array to obtain each in the calculation array
  • One column stores the sum of products computed by the compute unit.
  • the method further includes: storing bit width information of various weight data, where the bit width information includes the bit width of each weight data and the identification of the starting column corresponding to each weight data in the calculation array , wherein the bit widths of at least two kinds of weight data among the multiple kinds of weight data are different.
  • the weight data includes multiple types of weight data
  • the method further includes: writing the multiple types of weight data into multiple storage computing units according to the bit width information.
  • the method further includes: determining valid bits of the mask value bit by bit, and generating a first control signal and a second control signal when any bit of the mask value is determined to be valid.
  • the first control signal is used to calculate the sum of the products of each column of storage and calculation units in the calculation array
  • the second control signal is used to calculate the sum of the products of multiple columns of storage and calculation units corresponding to each weight data in the calculation array according to the bit width information
  • a weighted calculation is performed to obtain a plurality of weighted results corresponding to Nth bits of the plurality of valid data, and each weighted result in the plurality of weighted results corresponds to a kind of weight data.
  • the method further includes: when determining that the bit width of the mask value is equal to the bit width of the input data, generating a third control signal, the third control signal is used for The bit weight, and the multiple weighted results of each bit of the multiple effective data are weighted to obtain the final result, and the final result includes the weighted result of each weight data.
  • a computer-readable storage medium stores computer instructions, and when the computer instructions are run on the electronic equipment, the electronic equipment executes the above-mentioned second aspect and any possible design of the second aspect. method.
  • a computer program product when the computer program product is run on a computer, causes an electronic device to execute the method described in the second aspect and any possible design of the second aspect.
  • Fig. 1 is a schematic diagram of an analog computing array
  • Fig. 2 is a schematic diagram of a digital computing array
  • Fig. 3 is a schematic structural diagram of an integrated storage and calculation device provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a computing array provided by an embodiment of the present application.
  • FIG. 5 is a schematic flow chart of a calculation method provided in an embodiment of the present application.
  • FIG. 6 is a schematic diagram of calculating effective data provided by the embodiment of the present application.
  • FIG. 7 is a schematic diagram of a computing module provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a control module provided by an embodiment of the present application.
  • FIG. 9 is a schematic flow chart of a calculation method provided in an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of an integrated storage and calculation device provided by an embodiment of the present application.
  • ANN Artificial neural network
  • Neural network or neural network for short, is a mathematical model or computational model that imitates the structure and function of a biological neural network (central nervous system, such as the brain), and is used to estimated or approximated.
  • the neural network is composed of a large number of nodes (neurons) connected to each other, each node represents a specific output function, called the activation function or activation function (activation function), and the connection between each two nodes represents a The weighted values of the connected signals, called weight data.
  • Neural network accelerator an application specific integrated circuit (ASIC) chip suitable for artificial neural network reasoning or training, which is used to perform neural network calculations and improve the computational efficiency of neural networks.
  • ASIC application specific integrated circuit
  • Algorithms are embedded in the memory, and the calculations in the computer are transferred from the central processing unit (CPU) to the memory for calculation in the storage computing unit (cell), which can greatly reduce the data exchange time And data access energy consumption during the calculation process.
  • Figure 1 shows a schematic diagram of an analog computing array constructed by using analog devices.
  • analog devices can be understood as storage computing units arranged in the form of an array, and analog devices located in the same row share a word line ( word line), analog devices located in the same column share a bit line (bit line).
  • Conductance in analog devices can be understood as weight data
  • voltage can be understood as input data
  • the input voltage of the same word line is the same.
  • the current value output by each bit line represents the sum of the product of the conductance and the voltage of the analog devices (located in the same column) sharing the bit line, that is, the sum of the product of the weight data of the column and the input data.
  • FIG. 2 is a schematic diagram of a digital computing array built with digital devices.
  • each storage computing unit stores a weight data
  • the input unit inputs input data to each storage computing unit in the digital computing array.
  • the input data of the storage calculation unit located in the same row is the same
  • the multiplication calculation of the weight data and the input data is performed on the storage calculation unit
  • the multiplication calculation results on the same column are accumulated through the peripheral accumulation circuit to obtain the weight data of each column and multiple The sum of the products of the input data.
  • Both implementations can input multiple input data in parallel on the row, and perform multiple multiplication and accumulation calculations on the column in parallel.
  • bit width referred to as bit, which is equivalent to bit (bit), indicating the number of binary digits transmitted by the bus at one time.
  • a bit is the smallest unit of data storage in a computer.
  • 11010100 is an 8-bit binary number, that is, the bit width is 8 bits, which can be called 8-bit data.
  • Computing array (crossbar, XB): In this application, it refers to a computing array constructed by storage computing units, and each computing array includes several rows and several columns.
  • Bit weight The unit value corresponding to each fixed position in the number is called the bit weight.
  • the magnitude of the value represented by the "l" in a certain position is called the bit weight of the position.
  • the bit weight of the second digit from right to left in a decimal number is 10
  • the bit weight of the third digit is 100
  • the bit weight of the second digit from right to left in a binary number is 2
  • the third bit The bit weight of the number is 4.
  • N i-1 the bit weight of the j-th digit from left to right in the fractional part is N -j .
  • first and second are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features.
  • the meaning of “coupling” refers to the direct or indirect connection of two or more circuit elements, for example, the coupling of A and B may mean that A is directly connected to B, Or A is connected to B through C.
  • the neural network accelerator uses an integrated storage and calculation device for calculation
  • the calculation array is an analog calculation array constructed with analog devices
  • the analog calculation array is limited by the accuracy of analog devices and the analog-to-digital converter (analog-to-digital converter).
  • ADC analog-to-digital converter
  • DAC Digital-to-analog converter
  • both the input data and the weight data use 16 bits
  • the storage and calculation units use 2 bits, that is, each storage and calculation unit stores 2 bits of data, and 16-bit weight data needs to be stored with 8 storage and calculation units, which can be understood as 8 columns
  • the storage computing unit represents a column of weight data.
  • the 16-bit input data is expressed as a 0/1 voltage sequence with a length of 16, and each clock cycle starts from the low bit to input 1-bit input data in parallel for calculation, that is, each clock cycle stores the calculation unit to calculate Once, each calculation of the product of 1-bit input data and 2-bit weight data requires 16 clock cycles to complete the calculation of 16-bit input data and 16-bit weight data.
  • each column storage computing unit After each clock cycle storage computing unit completes a calculation, each column storage computing unit will get a sum of products (the sum of multiple products obtained after the same single bit of multiple input data is input and calculated in parallel), 16 clock cycles After the calculation is completed, each column storage calculation unit will output the sum of 16 sums of products obtained by 16 calculations. Combining the 8 sums output by the storage and calculation units in 8 consecutive columns by shifting and adding, the sum of the products of each column of weight data and multiple input data can be obtained, which can be understood as I1 in Figure 1.
  • the calculation array is a digital calculation array constructed with digital devices
  • the digital calculation array since the digital calculation array usually tends to perform single/low-bit calculations, multi-bit calculations need to be implemented through multiple single/low-bit calculations.
  • both input data and weight data use 4 bits
  • the storage and calculation unit is a single-bit multiplier, that is, 1-bit data is stored in the storage and calculation unit, and 4-bit weight data needs to be stored with 4 storage and calculation units, which can be understood as 4 columns
  • the storage computing unit represents a column of weight data.
  • the input data is input bit by bit into the storage computing unit located in the same row, and the single bit of each input data will be multiplied by all the bits of the weight data, that is, the single bit of each input data Bits will be multiplied by 4 storage computing units (the 4 storage computing units store a weight data), each storage computing unit calculates the product of 1-bit input data and 1-bit weight data, and the result of the product is a 4-bit data (the product of a single bit of input data and 4 storage and calculation units), and the result of the product will be output to the peripheral accumulation circuit.
  • the peripheral accumulating circuit After each calculation, the peripheral accumulating circuit will add the multiple product results obtained after parallel input calculation of the same single bit of multiple input data in the same column of weight data to obtain 4 bits of multiple input data The corresponding 4 multiplication and accumulation results. Finally, the peripheral accumulating circuit performs corresponding shifting and summing on the four multiplication and accumulation results to obtain the sum of the products of a column of weight data and multiple input data.
  • the bit width of the weight data is also fixed, that is, no matter whether the value of the weight data is large or small, the number of storage and calculation units required to deploy to the calculation array is the same, resulting in calculation less efficient.
  • the integrated storage and calculation device in this application can be understood as a chip, such as a neural network accelerator.
  • a neural network accelerator such as a neural network accelerator.
  • multi-bit input data is expanded into multiple single-bit/low-bit input data for input and calculation according to the data bit width, and the input data bit width and weight data
  • the bit width is fixed, resulting in large calculation overhead and low calculation efficiency.
  • the application uses a memory-computing integrated device for neural network calculations, multiple input data are calculated through the bit width calculation module, and multiple input data are obtained.
  • One-to-one correspondence of multiple valid data and input the multiple valid data to the calculation module, and then the calculation module obtains the calculation result of each column in the calculation array according to the multiple valid data and the bits of the weight data, and calculates each column
  • the calculation results are input to the result processing module, and finally the result processing module performs weighted calculations on the calculation results of each column to obtain the final result. Therefore, the number of times of calculation array expansion and calculation is effectively reduced, the calculation cost is reduced, and the calculation efficiency is improved.
  • the storage-computing integrated device proposed in the embodiment of the present application can be applied to the scene of computing, for example, the scene of neural network computing.
  • the integrated storage and calculation device performs calculations on weight data of multiple neural networks and multiple input data.
  • FIG. 3 it shows a schematic structural diagram of an integrated storage and calculation device.
  • the integrated storage and calculation device may be a chip, and the chip 300 is exemplified in FIG. 3 .
  • the chip 300 includes a data processing unit (processing element, PE) 301, a data exchange module (switch) 302, an input and output module (TxRx) 303, and the like.
  • PE processing element
  • switch data exchange module
  • TxRx input and output module
  • the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the chip 300 .
  • the chip 300 may include more or fewer components than shown in the figure, or combine certain components, or separate certain components, or arrange different components.
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • the data processing unit 301 may include one or more data processing units, and one data processing unit includes multiple computing engines.
  • a part of the calculation engine is used to complete the multiplication and addition calculation of the neural network.
  • the calculation engine used to complete the multiplication and addition calculation of the neural network includes a bit width calculation module 3011, a calculation module 3012, a weight bit width configuration module 3013, a control Module 3014 and result processing module 3015.
  • Another part of the calculation engine is used to complete calculations such as activation, dot product, dot addition and division in the neural network.
  • the bit width calculation module 3011 can be used to calculate the valid data of the input data, for example, perform logical OR calculation on multiple input data to obtain a mask value, determine multiple valid data of the multiple input data according to the mask value, and The multiple valid data obtained by calculation are input to the calculation module.
  • the calculation module 3012 includes a calculation array and an accumulation circuit.
  • the calculation array includes a plurality of storage calculation units arranged in an array, and each storage calculation unit can be used to store bits of weight data, such as storing 1-bit data, 2-bit data or 4-bit data in multi-bit weight data. bit data etc.
  • the computing array includes 8 columns of storage computing units, and each column of storage computing units includes 8 storage computing units. Taking 1-bit data stored in each storage computing unit and 4 bits as weight data as an example, a 4-bit weight data needs to be stored in 4 storage computing units, which can be understood as 4 columns of storage computing units represent a column of weight data and a column of weight data
  • the data includes 8 pieces of 4-bit weight data.
  • the calculation array can be used to calculate multiple valid data and multiple weight data, for example, perform calculations on the same bit (single bit/low bit) of multiple valid data and the bit of weight data stored in each storage calculation unit Multiplication calculation, to obtain multiple product results (in one calculation, as many product results as there are storage computing units in the computing array), and input multiple product results to the accumulation circuit.
  • the accumulation circuit can be used to accumulate multiple product results output by the calculation array, for example, to accumulate multiple product results obtained by the same column storage calculation unit to obtain the sum of the products of each column storage calculation unit, and to obtain multiple The sum of the products is input to the result processing module 3015.
  • the weight bit width configuration module 3013 can be used to store bit width information of multiple weight data, and one column of weight data is a kind of weight data, so it can be understood that the weight bit width configuration module 3013 is used to store bit width information of multiple columns of weight data .
  • the bit width of weight data in the same column is the same, and the bit width of weight data in different columns may be the same or different.
  • the bit width information includes the bit width of each type of weight data and the identification of each type of weight data corresponding to the start column in the calculation array, which can be understood as including the bit width of each column of weight data and the corresponding position of each column of weight data in the calculation array The ID of the starting column. Taking the 8 ⁇ 8 computing array shown in Fig.
  • the computing array stores the computing units in the 0th column, stores the computing units in the 1st column, ..., and stores the computing units in the 7th column.
  • the bit width of the weight data of the 0th column in the bit width information stored by the weight bit width configuration module 3013 is 4 bits
  • the identification of the start column in the calculation array is the storage calculation unit of the 0th column
  • the weight data of the 0th column is as follows As shown in FIG. 4 , it includes storage computing units in column 0-storage computing units in column 3.
  • the control module 3014 can be used to write various weight data stored in the memory into multiple storage calculation units according to the bit width information in the weight bit width configuration module 3013 .
  • the control module 3014 can also be used to generate a control signal to control the calculation module 3012 and the result processing module 3015 .
  • control module 3014 determines that any bit of the mask value obtained by the bit width calculation module 3011 is valid, it generates a first control signal and a second control signal, and the first control signal is used to instruct the calculation module 3012 to perform multiple valid data
  • the same bit (single bit/low bit) of the same bit (single bit/low bit) and the bit of the weight data stored in each storage calculation unit are multiplied, and the sum of the obtained multiple products is input to the result processing module 3015.
  • the second control signal is used to instruct the result processing module 3015 to perform weighted calculations on the sum of the products of the multi-column storage calculation units corresponding to each weight data in the calculation array according to the bit width information, to obtain the Nth bits corresponding to multiple valid data respectively Multiple weighted results of , wherein the lowest bit is the 0th bit, and N is an integer greater than or equal to 0.
  • a third control signal may also be generated, and the third control signal is used to instruct the result processing module 3015 according to the bit weight corresponding to the effective bits of the mask value, and A weighted calculation is performed on multiple weighted results of each bit of the multiple effective data to obtain a weighted result of each type of weighted data.
  • the result processing module 3015 may be configured to execute corresponding actions according to the control signal after receiving the control signal sent by the control module 3014 .
  • the sum of the products of the multi-column storage computing units corresponding to each type of weight data in the computing array is weighted and calculated according to the bit width information, and the Nth bits corresponding to the multiple valid data are respectively obtained. Multiple weighted results.
  • the weighting calculation is performed according to the bit weight corresponding to the valid bit of the mask value and multiple weighting results of each bit of multiple valid data to obtain a weighted result of each weight data.
  • the data exchange module 302 can be used to implement data exchange between various units inside the chip, for example, implement data exchange between the input and output module 303 and multiple data processing units 301 .
  • the input and output module 303 can be used to receive input data and weight data, and can also be used to output the final result obtained in the data processing unit 301 .
  • the input and output module 303 can interact with off-chip memory (stored with input data and weight data), receive the input data and weight data, and input the input data and weight data to the data processing unit 301 through the data exchange module 302 .
  • the final result obtained in the data processing unit 301 may also be output to an off-chip memory or an on-chip cache (not shown in FIG. 3 ), which is not limited in this application.
  • the embodiment of the present application provides a calculation method, which is applied to an integrated storage and calculation device.
  • the integrated storage and calculation device as a chip 300 as an example, the chip 300 includes a bit width calculation module 3011, a calculation module 3012 and a result Processing module 3015 .
  • the calculation module includes a calculation array, and the calculation array includes a plurality of storage calculation units, and each storage calculation unit in the plurality of storage calculation units is used to store bits of weight data, and can refer to the description of the calculation array shown in FIG. 4 .
  • the method includes:
  • Step 501 Perform calculations on multiple input data to obtain multiple valid data.
  • the result of the multiplication calculation for the bit of the input data being 0 is 0, which can be understood as invalid.
  • the multiplication calculation of the bit of the input data is 1 is valid, so the valid data of the input data can be understood as the data composed of the valid bits (bits of 1) of the input data.
  • the plurality of input data corresponds to the plurality of effective data one by one, the first input data among the plurality of input data corresponds to the first effective data among the plurality of effective data, and the bit width of the first input data is larger than that of the first effective data bit width.
  • the first input data may be any input data among a plurality of input data.
  • the result obtained by performing neural network calculation on the input data in the present application is the same as that obtained by performing neural network calculation on the valid data of the input data, the accuracy of the calculation result can be guaranteed.
  • the bit width of the first input data is greater than the bit width of the first effective data, the number of multiplication calculations performed by expanding the effective data of the input data is less than the number of multiplication calculations performed by expanding the input data, which can effectively reduce the calculation module The number of calculations reduces overhead.
  • bit width calculation module 3011 calculates multiple input data to obtain multiple valid data, and then inputs the multiple valid data to the calculation module 3012 .
  • the bit width calculation module 3011 can obtain a plurality of input data from the input and output module 303, the bit width calculation module 3011 calculates the valid data of each input data in the multiple input data, and converts the calculated multiple valid data input to the calculation module 3012 for calculation.
  • step 501 includes: performing mask calculation on a plurality of input data to obtain a mask value, and determining a plurality of valid data according to valid bits of the mask value.
  • the bit width calculation module 3011 performs mask calculation on a plurality of input data to obtain a mask value, and determines a plurality of valid data according to valid bits of the mask value.
  • the valid data of the multiple input data needs to be determined according to the multiple input data
  • the method for calculating the valid data of the multiple input data includes performing mask calculation on the multiple input data.
  • the mask calculation as a logical OR calculation as an example, the logical OR calculation is performed on multiple input data bit by bit, that is, the logical OR calculation is performed on the same bit of multiple input data in the order from the highest bit to the lowest bit, A mask value, that is, a mask value is obtained, and valid data of each input data among the plurality of input data can be determined according to valid bits (bits of 1) of the mask value.
  • the four 8-bit input data are respectively 00001101, 00010100, 00001001 and 00000001.
  • logical OR calculation is performed on the same bit of the 4 8-bit input data, for example, the highest bit (bit 7) of the 4 8-bit input data is all 0, so The logical OR calculation result is 0, and the lowest bit (bit 0) of the four 8-bit input data is 1, 0, 1, and 1 respectively, so the logical OR calculation result is 1.
  • the mask value is 00011101.
  • the effective bits of the mask value are the 4th, 3rd, 2nd, and 0th bits respectively, and extract the numbers corresponding to the 4th, 3rd, 2nd, and 0th bits in multiple input data , which is valid data for each input data. Therefore, the effective data of the four 8-bit input data are 0111, 1010, 0101 and 0001 respectively.
  • the bit width calculation module 3011 obtains the valid data of each input data in the multiple input data
  • the multiple valid data are input to the calculation module 3012 bit by bit, so that the calculation module 3012 calculates the multiple valid data bit by bit and each The bits of the weight data stored in each storage calculation unit are calculated to obtain the calculation result of each column in the calculation array.
  • the calculation result of the calculation module 3012 on the multiple valid data is consistent with the calculation result on the multiple input data.
  • the multiple valid data 0111, 1010, 0101, and 0001 shown in FIG. 6 as an example, the multiple valid data are input to the calculation module 3012 bit by bit in parallel in the order from high bit to low bit. .
  • the highest bits 0, 1, 0 and 1 of a plurality of valid data are first input in parallel into the calculation module 3012, and then the rest of the bits are input in parallel into the calculation module 3012 in turn, so that the calculation module 3012 can be used for multiple valid data. Calculated bit by bit.
  • the bit width calculation module 3011 can also judge the valid bits of multiple input data bit by bit (that is, calculate the mask value of the 4 input data bit by bit), and when it is judged that any bit is valid, the The valid bits of the multiple input data are input to the calculation module 3012 for calculation.
  • the bit width calculation module 3011 judges the effective bits of the four input data bit by bit, and when the fourth bit is judged, determine the The 4th bit is valid, and the 4th bit of the 4 input data is input to the calculation module 3012 for calculation, and so on, if the invalid bit is judged, it is not input to the calculation module 3012 .
  • bit width calculation module 3011 will obtain multiple input data from the input and output module 303, each time obtain multiple input data, each time will calculate the effective data of the multiple input data obtained, and calculate the obtained A plurality of valid data of is input to the computing module 3012.
  • the bit width of valid data is related to multiple input data obtained each time, and the bit width of multiple valid data obtained each time may be the same or different, so the bit width calculation module 3011 can dynamically calculate the multiple input data valid data.
  • the mask calculation may also be other calculation methods, such as directly determining whether the high-order data of the mask is zero by determining the maximum value of multiple input data, which is not limited in this application.
  • the mask value is the input data, and the bit width calculation module 3011 can directly determine the valid data of the input data according to whether each bit of the input data is 1.
  • the bit width calculation module 3011 can also expand the calculated multiple effective data into the remaining low bits and input them to the calculation module 3012 according to different devices and circuit implementations, for example, multiple effective data It is expanded into 2 bits and input to the calculation module 3012, which is not limited in this application.
  • Step 502 Obtain the calculation result of each column in the calculation array according to the bits of the plurality of valid data and weight data.
  • one column of calculation results is the sum of products calculated by the same bit of multiple valid data and one column of storage calculation units.
  • the calculation module 3012 calculates the calculation result of each column in the calculation array according to the multiple valid data and the bits of the weight data stored in each storage calculation unit, and inputs the calculation result of each column to the result processing module 3015.
  • the calculation module 3012 includes a calculation array, and the calculation array includes a plurality of storage calculation units.
  • One weight data is expanded into multiple single-bit/low-bit weight data and stored in multiple storage calculation units.
  • the weight data stored in each storage calculation unit A bit can be understood as a part of bits of weight data stored by each storage and calculation unit, and the part of bits can be a single bit or multiple bits.
  • the calculation module 3012 will perform multiplication calculation on the multiple valid data input by the bit width calculation module 3011 and the bits of the weight data stored in each storage calculation unit. Specifically, each valid data in the multiple valid data will be input to the calculation In different rows in the array, that is, each valid data corresponds to a row of storage computing units, and each valid data is multiplied by bits of weight data stored in each corresponding storage computing unit. After the calculation is completed, each column in the calculation array will correspond to a calculation result, and the calculation result of each column is the sum of the product of multiple valid data and the column, and the calculation module 3012 inputs the calculation result of each column into the result processing module 3015 .
  • step 502 includes: when the calculation array receives the Nth bits corresponding to the multiple valid data, the calculation array calculates the Nth bits corresponding to the multiple valid data and the bits of the weight data product of bits.
  • N is an integer greater than or equal to 0.
  • each calculation a plurality of single bits with the same valid data are calculated in parallel, that is, the Nth bits corresponding to multiple valid data are calculated in parallel, which can be understood as when the calculation array receives multiple valid data corresponding to At the Nth bit, the calculation array performs a calculation.
  • Fig. 7 illustrates a calculation module 700, including a 4 ⁇ 8 calculation array 701, the valid data and weight data of the input data both use 4 bits, and the storage calculation unit uses 1 bit, that is, the storage calculation unit stores 1-bit weight data, and multiplication calculation with 1-bit input data, the valid data of multiple input data are respectively a1b1c1d1, a2b2c2d2, a3b3c3d3 and a4b4c4d4, and a column of weight data in the calculation array 701 is respectively A1B1C1D1, A2B2C2D2, A3B3C3D3 and A4B4C4D4 example.
  • the third bits (highest bits) respectively corresponding to the plurality of valid data are a1, a2, a3 and a4, when the calculation array 701 receives a1, a2, a3 and a4, the calculation array 701 will a1, a2, a3 and a4 are input into different rows of the computing array 701, specifically, a1, a2, a3 and a4 are input into each storage computing unit on the corresponding row.
  • a1 will be multiplied with the bits of the weight data stored in each storage and calculation unit on the corresponding row to obtain multiple product results, that is, a1 ⁇ A1, a1 ⁇ B1, a1 ⁇ C1 and a1 ⁇ D1 equal product result.
  • a2, a3, and a4 will also be multiplied to obtain multiple product results.
  • a1, a2, a3 and a4 it means that a calculation of the calculation array 701 is completed. It can be understood that the 4-bit valid data needs to perform the above calculation process 4 times before the calculation of the entire valid data is completed.
  • b1, b2, b3 and b4, c1, c2, c3 and c4 and d1, d2, d3 and d4 are calculated 3 times respectively.
  • the computing module further includes an accumulating circuit, and the accumulating circuit adds the products calculated by the same column of storage computing units in the computing array to obtain the sum of the products calculated by each column of storage computing units in the computing array.
  • the accumulating circuit will accumulate multiple results obtained by the calculation array, specifically by accumulating multiple product results calculated by the storage calculation unit in the same column in the calculation array to obtain each
  • the calculation result of one column is to obtain the sum of the products of each column storage calculation unit in the calculation array, and input the product sum of each column storage calculation unit to the result processing module 3015 .
  • the accumulating circuit 702 accumulates multiple product results calculated by the storage calculation units in the same column of the calculation array 701 .
  • the accumulation circuit 702 will calculate the sum of the products obtained by the storage and calculation units of each column, and calculate The sum of the products of each column storage calculation unit is input to the result processing module 3015, and the 4-bit effective data requires the accumulation circuit 702 to input 4 times of calculation results to the result processing module 3015.
  • the integrated storage and calculation device further includes a weight bit width configuration module, and the weight bit width configuration module stores bit width information of various weight data.
  • the weight data includes various weight data
  • the weight bit width configuration module may be the weight bit width configuration module 3013 in FIG. 3 .
  • the bit width information includes the bit width of each type of weight data and the identification of each type of weight data corresponding to the starting column in the calculation array.
  • a kind of weight data can be understood as a column of weight data, such as the calculation array 701 of 4 ⁇ 8 in FIG.
  • the array 701 may include multiple columns of weight data (multiple weight data), and at least two types of weight data in the multiple weight data have different bit widths.
  • the 0th column storage calculation unit - the 3rd column storage calculation unit represents the 0th column weight data of the calculation array 701
  • the bit width of the 0th column weight data is 4 bits
  • the 0th column weight data The starting column in the computing array 701 is identified as column 0 storing computing units.
  • the bit width information is shown in the following Table 1, which corresponds to the calculation array 701 shown in FIG. .
  • the bit width of the first type of weight data (the weight data of the 0th column) is 4 bits, and the starting column is identified as the storage and calculation unit of the 0th column, that is, the storage and calculation unit of the 0th column - the storage and calculation unit of the 3rd column represents the first type of weight data (column 0 weight data).
  • the bit width of the second type of weight data (weight data in the first column) is 2 bits, and the starting column is marked as the storage and calculation unit in the fourth column, that is, the storage and calculation unit in the fourth column and the storage and calculation unit in the fifth column represent the second weight data (column 1 weight data).
  • the bit width of the third type of weight data is 2 bits, and the starting column is identified as the storage and calculation unit in the sixth column, that is, the storage and calculation unit in the sixth column and the storage and calculation unit in the seventh column represent the third weight data (column 2 weight data).
  • Weight Data Identification bit width start column id Column 0 weight data 4 bits
  • Column 0 stores the computational unit
  • Column 1 weight data 2 bits
  • Column 4 stores computing units
  • the weight bit width configuration module 3013 of the present application can store the bit width information of various weight data, and the bit width of at least two weight data in the various weight data is different, that is, the single calculation array of the present application can include The weight data of multiple bit widths supports the calculation of mixed precision of weight data, so it can effectively improve the calculation efficiency of the integrated storage and calculation device.
  • the integrated storage and calculation device further includes a control module, and the control module writes various weight data into multiple storage and calculation units according to the bit width information.
  • control module may be the control module 3014 in FIG. 3 .
  • the control module 3014 can write various weight data stored in the memory into multiple storage calculation units according to the bit width information in the weight bit width configuration module 3013 .
  • the control module 3014 converts the memory
  • Each bit of the weight data (A1B1C1D1, A2B2C2D2, A3B3C3D3, and A4B4C4D4) stored in the 0th column is correspondingly written into each storage computing unit in the 0th column storage computing unit-the 3rd column storage computing unit, and so on Until all the various weight data in the memory are written into each storage computing unit in the computing array 701 according to the bit width information shown in Table 1.
  • control module determines valid bits of the mask value bit by bit, and generates a first control signal and a second control signal when any bit of the mask value is determined to be valid.
  • control module 3014 can generate a control signal to control the calculation module 3012 and the result processing module 3015 according to the mask value calculated by the bit width calculation module 3011 .
  • the bit width calculation module 3011 inputs the mask value into the control module 3014 bit by bit, and the control module 3014 determines whether each bit of the mask value is valid (that is, whether it is 1) bit by bit. When one bit is valid, the control module 3014 generates the first control signal and the second control signal. It can be understood that the control module 3014 generates the first control signal and the second control signal several times as there are several effective bits in the mask value.
  • the first control signal is used to instruct the calculation module 3012 to calculate the sum of the products of each column storage calculation unit in the calculation array, which can be understood as instructing the calculation module 3012 to execute the Nth bits corresponding to multiple valid data as shown in FIG. 7 bit calculation, and obtain the sum of the products of each column storage calculation unit in the calculation array.
  • the second control signal is used to instruct the result processing module 3015 to perform weighted calculation on the sum of the products of the multi-column storage calculation units corresponding to each type of weight data in the calculation array according to the bit width information, to obtain the Nth bits corresponding to multiple valid data respectively Multiple weighted results of bits. Since it is possible to know which columns of storage computing units a type of weight data (one column of weight data) corresponds to in the computing array according to the bit width information, the result processing module 3015 can determine multiple columns of storage computing units corresponding to each type of weight data according to the bit width information.
  • the result processing module 3015 performs weighted calculation on the sum of the products of the multi-column storage calculation units corresponding to each type of weight data, specifically, performs weighted calculation according to the bit weight of the weight data bits. For example, a column of storage computing units corresponding to the lowest bit of weight data (the 0th bit), during weighting calculation, the sum of the products of the storage computing units is multiplied by 2 0 and accumulated, and the column corresponding to the most 2 bits of weight data The storage calculation unit, during weighting calculation, the sum of the products of the storage calculation unit is multiplied by 2 2 and then accumulated, and the multiplication by the power of 2 can be realized by shifting on the circuit.
  • the calculation array includes several kinds of weight data (several columns of weight data), and the result processing module 3015 can obtain several weighted results after performing one weight calculation.
  • the result processing module 3015 will obtain multiple weighted results corresponding to the Nth bits of the multiple valid data, wherein each of the multiple weighted results Each weighted result corresponds to one type of weight data.
  • control module 800 shown in FIG. 8 includes a first comparator in the control module 800, and the first comparator is used to compare whether the bits input to the control module 800 are the same as 1, and if they are the same, generate the first If the control signal and the second control signal are not the same, the first control signal and the second control signal are not generated.
  • the bit width calculation module 3011 inputs the mask value into the control module 800 bit by bit in the order from the highest bit to the lowest bit. First, the bit width calculation module 3011 inputs the highest bit (7th bit) 0 of the mask value into the control module 800, and the first comparator in the control module 800 compares that 0 is different from 1, that is, it determines that the bit is not valid bit, the first control signal and the second control signal are not generated. By analogy, when the bit width calculation module 3011 inputs the fourth bit 1 of the mask value into the control module 800, the first comparator in the control module 800 compares 1 with 1, and determines that this bit is an effective bit, A first control signal and a second control signal are generated.
  • the first control signal generated by the control module 800 is input to the calculation module 3012, and is used to instruct the calculation module 3012 to perform a calculation on the Nth bits respectively corresponding to a plurality of valid data.
  • the first control signal generated according to the 4th bit of the mask value will instruct the calculation module 3012 to perform a calculation on the highest bit (3rd bit) 0, 1, 0 and 0 of a plurality of valid data, and obtain
  • Each column in the calculation array stores the sum of products of the calculation units, that is, S3, S2, S1, S0, etc. shown in FIG. 7 .
  • the second control signal generated by the control module 800 will be input into the result processing module 3015 to instruct the result processing module 3015 to perform weighted calculation on the sum of multiple products generated by the calculation module 3012 once.
  • the result processing module 3015 stores and calculates according to the bit width of the weight data in the 0th column in Table 1 is 4 bits, and the starting column identifier is the 0th column
  • the unit determines that the 0th column in the calculation array 701 stores the calculation unit - the 3rd column stores the calculation unit to represent the first type of weight data (the 0th column weight data).
  • control module when the control module determines that the bit width of the mask value is equal to the bit width of the input data, it generates a third control signal.
  • the bit width of the input data is the same as the bit width of the mask value. Since the mask value is input into the control module 3014 bit by bit, when the control module 3014 determines that the bit width of the mask value is the same as the bit width of the input data, it can be determined that the input of the mask value is completed, thereby generating a third control signal. It can be understood that the control module 3014 outputs the third control signal after outputting the first control signal and the second control signal.
  • the third control signal is used to instruct the result processing module 3015 to perform weighted calculations according to the bit weights corresponding to the valid bits of the mask value and multiple weighted results of each bit of multiple valid data to obtain the final result.
  • the final result includes each The weighted result of the weighted data.
  • the control module 800 further includes a counter and a second comparator. Every time a bit of the mask value is input, the counter will perform an operation of adding 1 to record the bit width of the mask value.
  • the second comparator is used to compare whether the bit width of the mask value recorded in the counter is the same as the bit width of the input data, if they are the same, a second control signal is generated, and if not, the second control signal is not generated.
  • the valid bits of the mask value are the 4th bit, the 3rd bit, the 2nd bit and the 0th bit respectively, and the bit weights corresponding to the valid bits are 2 4 , 2 3 , 2 2 and 2 0 respectively.
  • the bit width calculation module 3011 inputs the highest bit (the 7th bit) 0 of the mask value into the control module 800, and the first comparator in the control module 800 compares that 0 is not the same as 1, that is, it is determined that the bit is not valid, The first control signal and the second control signal are not generated.
  • the counter records that the bit width of the mask is 1, and the second comparator compares that the bit width (1) of the mask recorded by the counter is different from the bit width (8) of the input data, and does not generate a third control signal.
  • the bit width calculation module 3011 inputs the lowest bit (the 0th bit) 1 of the mask value into the control module 800
  • the first comparator in the control module 800 compares 1 with 1, and determines that this bit The bit is an effective bit, and generates the first control signal and the second control signal.
  • the counter records that the bit width of the mask is 8, and the second comparator compares that the bit width (8) of the mask recorded by the counter is the same as the bit width (8) of the input data, and generates a third control signal.
  • the third control signal generated by the control module 800 will be input into the result processing module 3015.
  • the result processing module 3015 has received the second control signal 4 times, that is, the sum of the multiple products of the calculation module 3012 has been performed 4 times.
  • Weighted calculation each weighted calculation obtains multiple weighted results (for example, the first weighted calculation obtains sum0 and other weighted results).
  • the third control signal is used to instruct the result processing module 3015 to perform weighting calculation again according to the bit weight corresponding to the valid bit of the mask value and multiple weighting results obtained from multiple weighting calculations to obtain the final result.
  • the final result includes the weighted results of each weight data. It can be understood that the calculation module 700 can obtain three final results, which respectively correspond to the weight data in the 0th column, the weight data in the 1st column and the weight data in the 2nd column.
  • Step 503 performing weighted calculation on the calculation results of each column to obtain a final result.
  • step 503 is specifically that the result processing module 3015 performs weighted calculation on the calculation results of each column to obtain the final result.
  • the calculation result of each column is the sum of the products of the storage calculation units of each column, which can be understood as the calculation results of S3, S2, S1 and S0 in step 502 above.
  • the result processing module 3015 performs weighted calculation on the calculation results of each column. Specifically, it performs weighted calculation according to the bit weight of the weight data bits to obtain multiple sum values, and then performs weighted calculation according to the bit weight corresponding to the effective bits of the mask value to obtain multiple sum values.
  • the out value is the final result. Reference may be made to the description of the above-mentioned control module 3014 (control module 800 ), which will not be repeated here.
  • the input data and weight data include unsigned numbers and signed numbers, wherein the calculation method of unsigned numbers can refer to the example in the embodiment of this application, and the signed numbers can be calculated and Calculation methods such as differential calculations are implemented, which are not limited in this application.
  • a calculation method provided by the embodiment of the present application can be applied to an integrated storage and calculation device, such as a chip.
  • multiple input data are calculated through the bit width calculation module, and the A plurality of valid data of the calculated multiple input data is input to the calculation module, and then the calculation module obtains the calculation result of each column in the calculation array according to the multiple valid data and the bits of the weight data stored in each storage calculation unit, And the calculation result of each column is input to the result processing module, and finally the calculation result of each column is weighted by the result processing module to obtain the final result.
  • the bit-width calculation of this application can dynamically calculate the effective data of the input data, so that only the effective bits of the input data are calculated, effectively reducing the number of calculations by the calculation module and reducing overhead.
  • the existing technology cannot achieve mixed precision calculation of weight data, resulting in low calculation efficiency.
  • This application can use the bit width information of various weight data stored in the weight bit width configuration module to realize weights of various bit widths in a single calculation array. The deployment and calculation of data supports the calculation of mixed precision of weight data and effectively improves the calculation efficiency of the storage and calculation integrated device.
  • the embodiment of the present application provides a schematic flow chart of a calculation method, with the bit width calculation module as the The bit width calculation module 3011, the calculation module is the calculation module 3012, the weight bit width configuration module is the weight bit width configuration module 3013, the control module is the control module 3014, the structure processing module is the result processing module 3015, and the multiple input data are 00011, 00101 and 00010, the calculation array is a 3 ⁇ 3 calculation array, the storage unit uses 1 bit for storage, and only one type of weight data is stored in the weight bit width configuration module 3013 as an example.
  • the calculation process includes:
  • Step 1 The bit width calculation module 3011 calculates multiple input data, obtains mask values of multiple input data and valid data corresponding to each input data, and inputs multiple valid data to the calculation module 3012 .
  • the multiple input data are 00011, 00101, and 00010, and the mask calculation is performed on multiple input data (taking logic or calculation as an example), and the calculated mask value is 00111, so that multiple valid data are determined to be 011 and 101 respectively and 010, and input 011, 101 and 010 bit by bit to the calculation module.
  • the above step 501 refers to the description of the above step 501, which will not be repeated here.
  • Step 2 The control module 3014 writes various weight data into multiple storage and calculation units according to the weight bit width configuration module 3013.
  • the control module 3014 writes multiple weight data into multiple storage computing units according to the bit width information in the weight bit width configuration module 3013, see the computing array shown in Figure 9 . For details, refer to the description of the above control module, which will not be repeated here.
  • Step 3 the control module 3014 generates the first control signal and the second control signal according to the effective bits of the mask value calculated by the bit width calculation module 3011.
  • the highest bit (4th bit) 0 of the mask value is first input into the control module 3014, and the control module 3014 judges that the highest bit is not valid, and does not generate the first control signal and the second control signal.
  • the third bit 0 of the mask value is input into the control module 3014, and the control module 3014 judges that the third bit is not valid, and does not generate the first control signal and the second control signal.
  • the second bit 1 of the mask value is input into the control module 3014, and the control module 3014 judges that the second bit of the mask value is a valid bit, and generates a first control signal and a second control signal.
  • the first bit 1 of the mask value is input into the control module 3014, and the control module 3014 judges that the first bit of the mask value is a valid bit, and generates a first control signal and a second control signal.
  • the column storage calculation unit and the second column storage calculation unit are input to the result processing module 3015 .
  • the 0th bit 1 of the mask value is input into the control module 3014, and the control module 3014 judges that the 0th bit of the mask value is a valid bit, and generates the first control signal and the second control signal.
  • the column storage calculation unit and the second column storage calculation unit are input to the result processing module 3015.
  • Step 4 When the control module 3014 determines that the bit width of the mask value is equal to the bit width of the input data, a third control signal is generated.
  • control module 3014 determines that the bit width of the mask value is 5 bits, it generates a third control signal.
  • the integrated storage and calculation device has completed the calculation of multiple input data and multiple weight data. It can be understood that, in the above steps 1 to 4, it is only taken as an example that the calculation array includes one type of weight data (a column of weight data), and there may actually be multiple types of weight data.
  • the calculation method provided by the embodiment of the present application dynamically calculates the effective data of multiple input data, and only calculates the effective bits of the input data, which can effectively reduce the number of calculations by the calculation module, reduce overhead, and also support weight data mixing Accurate calculation improves the calculation efficiency of the storage and calculation integrated device.
  • the calculation results using the target detection yolov3-tiny model are shown in Table 2 (the data set is the COCO2017val data set).
  • the number of bit operands and the number of array calculations are 100% as an example, and the calculation is performed using the 8-bit model and the integrated storage and calculation device of the present application , the number of bit operations can be reduced to 81.38% of the prior art, and the number of array calculations can be reduced to 78.31% of the prior art while ensuring the calculation accuracy.
  • the number of bit operands can be reduced to 69.14 in the prior art while ensuring the calculation accuracy %, reducing the number of array calculations to 72.23% of the prior art. It can be seen that the method provided by the embodiment of the present application can effectively reduce the number of calculations, and when the weight data is calculated with mixed precision, the number of calculations can be greatly reduced, thus effectively reducing the calculation cost and improving the calculation efficiency.
  • the above-mentioned integrated storage and calculation device includes hardware structures and/or software modules corresponding to each function.
  • the embodiments of the present application can be implemented in the form of hardware or a combination of hardware and computer software in combination with the example units and algorithm steps described in the embodiments disclosed herein. Whether a certain function is executed by hardware or computer software drives hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the embodiments of the present application.
  • the embodiment of the present application can divide the functional modules of the above-mentioned integrated storage and calculation device according to the above-mentioned method examples.
  • each functional module can be divided corresponding to each function, or two or more functions can be integrated into one processing module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. It should be noted that the division of modules in the embodiment of the present application is schematic, and is only a logical function division, and there may be other division methods in actual implementation.
  • the embodiment of the present application discloses an integrated storage and calculation device 1000 , which may be the chip 300 in the above embodiment.
  • the storage and calculation integrated device 1000 may include a processing module, a storage module and a communication module.
  • the processing module can be used to control and manage the actions of the integrated storage and calculation device 1000, for example, it can be used to support the integrated storage and calculation device 1000 to execute the above-mentioned bit width calculation module 3011, calculation module 3012, weight bit width configuration module 3013, control Steps performed by module 3014 and result processing module 3015.
  • the storage module can be used to support the integrated storage and calculation device 1000 to store program codes and data, for example, can be used to store input data and weight data.
  • the communication module can be used to support the communication between the integrated storage and calculation device 1000 and other devices, for example, it can be used to input multiple input data and weight data from external devices, and it can also be used to output the final result obtained by the result processing module 3015 to the outside equipment.
  • the unit modules in the above-mentioned integrated storage and calculation device 1000 include but are not limited to the above-mentioned processing module, storage module and communication module.
  • the processing module may be a processor or a controller. It can implement or execute the various illustrative logical blocks, modules and circuits described in connection with the present disclosure.
  • the processor can also be a combination of computing functions, such as a combination of one or more microprocessors, a neural network processor (neural network processing unit, NPU), digital signal processing (digital signal processing, DSP) and a microprocessor. combinations and more.
  • the storage module may be a memory.
  • the communication module may be a device that interacts with other external devices.
  • the processing module is a processor 1001
  • the storage module may be a memory 1002
  • the communication module may be called a communication interface 1003
  • the storage and calculation integrated device 1000 provided in the embodiment of the present application may be the chip 300 shown in FIG. 3 .
  • the above-mentioned processor 1001, memory 1002, communication interface 1003, etc. may be connected together, for example, connected through a bus.
  • the embodiment of the present application also provides an electronic device, including one or more processors and one or more memories.
  • the one or more memories are coupled with one or more processors, the one or more memories are used to store computer program codes, the computer program codes include computer instructions, and when the one or more processors execute the computer instructions, the electronic device performs The above related method steps implement the calculation method in the above embodiment.
  • the embodiment of the present application also provides an electronic device, the electronic device includes one or more communication interfaces and one or more processors, wherein the communication interface and the processor are interconnected through a line, and the processor reads from the memory of the electronic device through the communication interface
  • the computer instruction is received and executed, so that the electronic device executes the above-mentioned related method steps to implement the computing method in the above-mentioned embodiment.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer program codes, and when the computer instructions run on the computer or the processor, the computer or the processor executes the above-mentioned embodiment. Calculation method.
  • the embodiment of the present application also provides a computer program product, the computer program product includes computer instructions, when the computer instructions are run on the computer or the processor, the computer or the processor is made to perform the above-mentioned related steps, so as to realize the above-mentioned embodiment.
  • the storage and calculation integrated device, electronic equipment, computer storage medium, computer program product or chip provided in this embodiment are all used to execute the corresponding method provided above, therefore, the beneficial effects it can achieve can refer to the above The beneficial effects of the provided corresponding method will not be repeated here.
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units or components can be Incorporation or may be integrated into another device, or some features may be omitted, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the unit described as a separate component may or may not be physically separated, and the component displayed as a unit may be one physical unit or multiple physical units, that is, it may be located in one place, or may be distributed to multiple different places . Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a readable storage medium.
  • the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the software product is stored in a storage medium Among them, several instructions are included to make a device (which may be a single-chip microcomputer, a chip, etc.) or a processor (processor) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read only memory (ROM), random access memory (random access memory, RAM), magnetic disk or optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Neurology (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Complex Calculations (AREA)

Abstract

本申请实施例提供一种存算一体装置和计算方法,涉及芯片技术领域,用于减少进行神经网络计算时的计算开销,提升计算效率。该方法包括:通过位宽计算模块对多个输入数据进行计算,得到多个有效数据,将多个有效数据输入到计算模块,再由计算模块根据多个有效数据和每个存储计算单元存储的权重数据的比特位,得到计算阵列中每一列的计算结果,将每一列的计算结果输入到结果处理模块,最后由结果处理模块对每一列的计算结果进行加权计算,得到最终结果。本申请实施例用于存算一体装置进行计算的过程中。

Description

一种存算一体装置和计算方法
本申请要求于2021年12月24日提交国家知识产权局、申请号为202111599630.1、申请名称为“一种存算一体装置和计算方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及芯片技术领域,尤其涉及一种存算一体装置和计算方法。
背景技术
近年来,神经网络(neural network,NN)发展迅速,广泛应用于机器人、语音识别、图像识别、自然语言处理和专家系统等领域。神经网络的核心计算是矩阵向量乘,具有计算密集和访存密集的特征。使用通用芯片进行神经网络的计算时,通用芯片在功耗、性能和尺寸上都存在明显不足,因此为了提高神经网络的计算效率,需要为神经网络定制专用芯片(神经网络加速器)来进行计算。
存算一体装置既保留了存储电路本身的存储和读写功能,又能够并行地支持乘加运算,降低了数据搬移量,提升了能耗效率,为神经网络加速器设计提供了一种高效的解决方案。存算一体装置在进行计算时,通常需要根据数据位宽将多比特(bit)数据展开成单比特/低比特(例如2比特或4比特等)数据进行计算,再对计算结果进行合并,因此展开计算的次数较多,导致开销较大。
发明内容
本本申请实施例提供一种存算一体装置和计算方法,应用于存算一体装置,可以在进行神经网络的计算时,降低开销,提高计算效率。
为达到上述目的,本申请实施例采用如下技术方案:
第一方面,本申请实施例提供一种存算一体装置,该存算一体装置包括位宽计算模块、计算模块和结果处理模块。计算模块包括计算阵列,计算阵列包括多个存储计算单元,多个存储计算单元用于存储权重数据。位宽计算模块用于对多个输入数据进行计算,得到多个有效数据,将多个有效数据输入到计算模块,多个输入数据与多个有效数据一一对应,多个输入数据中的第一输入数据与多个有效数据中的第一有效数据对应,且第一输入数据的位宽大于第一有效数据的位宽。计算模块用于根据多个有效数据和权重数据的比特位,得到计算阵列中每一列的计算结果,将每一列的计算结果输入到结果处理模块,其中,一列计算结果为多个有效数据的同一比特位和一列存储计算单元计算的乘积之和。结果处理模块用于对每一列的计算结果进行加权计算,得到最终结果。
由此,本申请提供的计算方法,相比于现有技术中根据数据位宽将多比特输入数据展开成多个单比特/低比特输入数据进行输入和计算,导致展开计算太多次,产生较大开销,本申请的方法能够动态计算输入数据的有效数据,从而仅对输入数据的有效位进行计算,有效减少了计算模块进行计算的次数,降低计算开销,提高了存算一体 装置的计算效率。
在一种可能的设计中,位宽计算模块,具体用于对多个输入数据进行掩膜计算,得到掩膜值,根据掩膜值的有效位确定多个有效数据,将多个有效数据逐比特位输入到计算模块,以使计算模块对多个有效数据逐比特位进行计算。由此,本申请提供的计算方法,使位宽计算模块通过掩膜计算得到输入数据的有效数据,并将有效数据逐比特位输入到计算模块,从而能够大幅度减少计算阵列的计算次数。
在一种可能的设计中,当计算阵列接收到多个有效数据分别对应的第N比特位时,其中,N为大于等于0的整数,计算阵列用于计算多个有效数据分别对应的第N比特位和权重数据的比特位的乘积;计算模块还包括累加电路,累加电路用于对计算阵列中同一列存储计算单元计算的乘积相加,得到计算阵列中每一列存储计算单元计算的乘积之和。由此,本申请提供的计算方法,计算模块每次对多个有效数据分别对应的第N比特位进行计算,计算模块进行计算的次数与有效数据的位宽对应,由于有效数据的位宽小于输入数据的位宽,因此计算阵列进行计算的次数能够有效降低。
在一种可能的设计中,权重数据包括多种权重数据,存算一体装置还包括权重位宽配置模块;权重位宽配置模块用于存储多种权重数据的位宽信息,位宽信息包括每种权重数据的位宽和每种权重数据对应在计算阵列中的起始列的标识,其中,多种权重数据中至少两种权重数据的位宽不同。由此,本申请提供的计算方法,相比于现有技术中权重数据的位宽固定,无法做到权重数据混合精度计算,导致计算效率低,本申请能够利用权重位宽配置模块存储的多种权重数据的位宽信息,在单个计算阵列中实现多种位宽的权重数据的部署和计算,从而支持权重数据混合精度的计算,有效提高存算一体装置的计算效率。
在一种可能的设计中,存算一体装置还包括控制模块,控制模块用于根据位宽信息将多种权重数据写入多个存储计算单元。由此,本申请提供的计算方法,控制模块能够根据位宽信息将权重数据部署到计算阵列中的每个存储计算单元中,从而在单个计算阵列中包括多种权重数据的位宽,实现权重数据混合精度计算,提高存算一体装置的计算效率。
在一种可能的设计中,控制模块还用于逐比特位确定掩膜值的有效位,当确定掩膜值的任一比特位有效时,产生第一控制信号和第二控制信号。第一控制信号用于指示计算模块计算得到所述计算阵列中每一列存储计算单元的乘积之和,第二控制信号用于指示结果处理模块,根据位宽信息对计算阵列中每种权重数据对应的多列存储计算单元的乘积之和进行加权计算,得到多个有效数据分别对应的第N比特位的多个加权结果,多个加权结果中每个加权结果对应一种权重数据。由此,本申请提供的计算方法,控制模块能够根据掩膜值的有效位生成控制信号,对计算模块和结果处理模块进行控制。由于掩膜值的有效位的位数和有效数据的位宽相同,通常比输入数据的位宽小,因此根据掩膜值的有效位生成控制信号,能够降低计算模块进行计算的次数,降低计算开销。
在一种可能的设计中,控制模块还用于确定掩膜值的位宽与输入数据的位宽相等时,产生第三控制信号。第三控制信号用于指示结果处理模块,根据掩膜值的有效位对应的位权,以及多个有效数据的每个比特位的多个加权结果进行加权计算,得到最 终结果,最终结果包括每种权重数据的加权结果。由此,本申请提供的计算方法,当计算模块计算结束后,结果处理模块根据位宽信息和掩膜值有效位的位权进行加权计算,能够准确将多次单比特有效数据和多比特权重数据的计算结果转化为多比特输入数据和多比特权重数据的计算结果。在保证计算精度不变的前提下,有效减少计算次数,降低开销。
第二方面,本申请实施例提供了一种计算方法,该方法应用于存算一体装置,存算一体装置包括计算阵列,计算阵列包括多个存储计算单元,多个存储计算单元用于存储权重数据。该方法包括:对多个输入数据进行计算,得到多个有效数据,多个输入数据与多个有效数据一一对应,多个输入数据中的第一输入数据与多个有效数据中的第一有效数据对应,且第一输入数据的位宽大于第一有效数据的位宽,根据多个有效数据和权重数据的比特位,得到计算阵列中每一列的计算结果,其中,一列计算结果为多个有效数据的同一比特位和一列存储计算单元计算的乘积之和,对每一列的计算结果进行加权计算,得到最终结果。第二方面所达到的有益效果可以参见第一方面中有益效果。
在一种可能的设计中,对多个输入数据进行计算,得到多个有效数据包括:对多个输入数据进行掩膜计算,得到掩膜值,根据掩膜值的有效位确定多个有效数据,根据多个有效数据和权重数据的比特位,得到计算阵列中每一列的计算结果包括:将多个有效数据逐比特位和权重数据的比特位进行计算,得到计算阵列中每一列的计算结果。
在一种可能的设计中,根据多个有效数据和权重数据的比特位,得到计算阵列中每一列的计算结果包括:当计算阵列接收到多个有效数据分别对应的第N比特位时,其中,N为大于等于0的整数,计算多个有效数据分别对应的第N比特位和权重数据的比特位的乘积,对计算阵列中同一列存储计算单元计算的乘积相加,得到计算阵列中每一列存储计算单元计算的乘积之和。
在一种可能的设计中,该方法还包括:存储多种权重数据的位宽信息,位宽信息包括每种权重数据的位宽和每种权重数据对应在计算阵列中的起始列的标识,其中,多种权重数据中至少两种权重数据的位宽不同。
在一种可能的设计中,权重数据包括多种权重数据,该方法还包括:根据位宽信息将多种权重数据写入多个存储计算单元。
在一种可能的设计中,该方法还包括:逐比特位确定所述掩膜值的有效位,当确定掩膜值的任一比特位有效时,产生第一控制信号和第二控制信号。第一控制信号用于计算得到计算阵列中每一列存储计算单元的乘积之和,第二控制信号用于根据位宽信息对计算阵列中每种权重数据对应的多列存储计算单元的乘积之和进行加权计算,得到多个有效数据分别对应的第N比特位的多个加权结果,多个加权结果中每个加权结果对应一种权重数据。
在一种可能的设计中,该方法还包括:确定掩膜值的位宽与输入数据的位宽相等时,产生第三控制信号,第三控制信号用于根据掩膜值的有效位对应的位权,以及多个有效数据的每个比特位的多个加权结果进行加权计算,得到最终结果,最终结果包括每种权重数据的加权结果。
第三方面,一种计算机可读存储介质,存储有计算机指令,当计算机指令在电子设备上运行时,使得电子设备执行上述第二方面以及第二方面中的任一种可能的设计所述的方法。
第四方面,一种计算机程序产品,当计算机程序产品在计算机上运行时,使得电子设备执行上述第二方面以及第二方面中的任一种可能的设计所述的方法。
上述其他方面对应的有益效果,可以参见关于第一方面的有益效果的描述,此处不予赘述。
附图说明
图1为一种模拟计算阵列示意图;
图2为一种数字计算阵列示意图;
图3为本申请实施例提供的一种存算一体装置的结构示意图;
图4为本申请实施例提供的一种计算阵列的示意图;
图5为本申请实施例提供的一种计算方法的流程示意图;
图6为本申请实施例提供的一种计算有效数据的示意图;
图7为本申请实施例提供的一种计算模块的示意图;
图8为本申请实施例提供的一种控制模块的示意图;
图9为本申请实施例提供的一种计算方法的流程示意图;
图10为本申请实施例提供的一种存算一体装置的结构示意图。
具体实施方式
为了便于理解,示例性地给出了部分与本申请实施例相关概念的说明以供参考。如下所示:
人工神经网络(artificial neural network,ANN):简称神经网络或类神经网络,是一种模仿生物神经网络(中枢神经系统,例如大脑)的结构和功能的数学模型或计算模型,用于对函数进行估计或近似。神经网络由大量的节点(神经元)相互联接构成,每个节点代表一种特定的输出函数,称为激励函数或激活函数(activation function),每两个节点间的联接都代表一个对于通过该连接信号的加权值,称为权重数据。
神经网络加速器:一种适用于人工神经网络推理或训练的专用集成电路(application specific integrated circuit,ASIC)芯片,用于进行神经网络的计算,提升神经网络的计算效率。
存算一体:在存储器中进行算法嵌入,将计算机中的运算从中央处理器(central processing unit,CPU)中转到存储器中进行,实现在存储计算单元(cell)内计算,可大幅降低数据交换时间以及计算过程中的数据存取能耗。
存算一体装置有两种实现方式,分别为采用模拟器件(例如阻变存储器(resistive random-access memory,ReRAM)等)构建计算阵列,和采用数字器件(例如静态随机存取存储器(static random-access memory,SRAM)等)构建计算阵列。
图1所示为采用模拟器件构建的模拟计算阵列示意图,在进行神经网络计算时,模拟器件可以理解为存储计算单元,以阵列的形式排布,位于同一行的模拟器件共用一根字线(word line),位于同一列的模拟器件共用一根位线(bit line)。模拟器件中 的电导可以理解为权重数据,电压可以理解为输入数据,同一根字线的输入电压相同。每一根位线输出的电流值表示共用该位线的模拟器件(位于同一列)的电导与电压的乘积之和,即表示该列权重数据与输入数据的乘积之和。例如一个4×4的模拟计算阵列,位于第一列的电导分别为G1、G2、G3和G4,即第一列的权重数据为G1、G2、G3和G4,每一行的输入电压为V1、V2、V3和V4,即输入数据为V1、V2、V3和V4,输入数据并行输入,则第一列输出的电流I1=G1×V1+G2×V2+G3×V3+G4×V4,表示该列权重数据与多个输入数据的乘积之和。
图2所示为采用数字器件构建的数字计算阵列示意图,在进行神经网络计算时,每个存储计算单元中存储一个权重数据,输入单元向数字计算阵列中的每个存储计算单元输入输入数据,位于同一行的存储计算单元的输入数据相同,在存储计算单元上进行权重数据与输入数据的乘法计算,同一列上的乘法计算结果通过外围累加电路进行累加,得到每一列的权重数据与多个输入数据的乘积之和。
两种实现方式均可以在行上并行的输入多个输入数据,在列上并行的进行多个乘积累加计算。
数据位宽:简称位,等价于比特(bit),表示总线一次传输的二进制位数。位是计算机内部数据储存的最小单位,例如11010100是一个8位二进制数,即位宽为8bit,可以称为8比特数据。
计算阵列(crossbar,XB):在本申请中,指由存储计算单元构建的计算阵列,每个计算阵列包含若干行和若干列。
位权:数中每一固定位置对应的单位值称为位权。对于多位数,处在某一位上的“l”所表示的数值的大小,称为该位的位权。例如十进制数从右到左第2位数上的位权为10,第3位数上的位权为100;而二进制数从右到左第2位数上的位权为2,第3位数上的位权为4,对于N进制数,整数部分从右到左第i位数上的位权为N i-1,而小数部分从左到右第j位数上的位权为N -j
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。其中,在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,在本实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本实施例的描述中,除非另有说明,“耦合”的含义指两个或两个以上的电路元件直接连接或间接连接的意思,例如,A与B耦合可以表示A直接与B连接,或A通过C与B连接。
目前,神经网络加速器采用存算一体装置进行计算时,当计算阵列为采用模拟器件构建的模拟计算阵列时,由于模拟计算阵列受限于模拟器件的精度以及模数转换器(analog-to-digital converter,ADC)/数模转换器(digital-to-analog converter,DAC)等器件的开销,因此通常倾向于进行低比特计算。例如,输入数据和权重数据均采用 16比特,存储计算单元采用2比特,即每个存储计算单元中存储2比特数据,16比特权重数据需要用8个存储计算单元进行存储,可以理解为8列存储计算单元表示一列权重数据。在进行神经网络计算时,16比特输入数据表示为一个长度为16的0/1电压序列,每个时钟周期从低位开始依次并行输入1比特输入数据进行计算,即每个时钟周期存储计算单元计算一次,每次计算1比特输入数据和2比特权重数据的乘积,需要16个时钟周期才能完成16比特输入数据和16比特权重数据的计算。每个时钟周期存储计算单元完成一次计算后每一列存储计算单元会得到一个乘积之和(多个输入数据的同一个单比特位并行输入计算后得到的多个乘积之和),16个时钟周期计算完成后,每一列存储计算单元会输出16次计算得到的16个乘积之和的总和。将连续8列存储计算单元输出的8个总和采用移位加进行合并,即得到每一列权重数据与多个输入数据的乘积之和,可以理解为图1中的I1。
当计算阵列为采用数字器件构建的数字计算阵列时,由于数字计算阵列通常也倾向于进行单/低比特计算,因此多比特计算需要通过多次单/低比特计算来实现。例如,输入数据和权重数据均采用4比特,存储计算单元为单比特乘法器,即存储计算单元中存储1比特数据,4比特权重数据需要用4个存储计算单元进行存储,可以理解为4列存储计算单元表示一列权重数据。在进行神经网络计算时,将输入数据逐比特位输入到位于同一行的存储计算单元中,每次输入数据的单比特位会和权重数据的所有比特位相乘,即每次输入数据的单比特位会和4个存储计算单元(该4个存储计算单元保存一个权重数据)分别相乘,每个存储计算单元计算1比特输入数据和1比特权重数据的乘积,乘积结果为一个4比特数据(输入数据的单比特位和4个存储计算单元的乘积),且该乘积结果会输出到外围累加电路中。每次计算结束后,外围累加电路会将同一列权重数据中,多个输入数据的同一个单比特位并行输入计算后得到的多个乘积结果相加,得到多个输入数据的4个比特位对应的4个乘积累加结果。最后,外围累加电路对该4个乘积累加结果进行相应的移位求和,得到一列权重数据和多个输入数据的乘积之和。
可以看出,采用存算一体装置进行计算时,通常需要根据数据位宽将多比特输入数据展开成多个单比特/低比特输入数据进行输入和计算。由于多比特输入数据的位宽是固定的,因此多比特输入数据展开计算的次数是固定的,不论输入数据的数值大或小,进行计算时展开计算的次数都是相同的。例如,8比特输入数据为00001010,若展开成多个单比特输入数据进行乘法计算,需要展开计算8次,8次分别对单比特输入数据0、0、0、0、1、0、1和0进行乘法计算。可以看出,由于8比特输入数据在单比特输入数据0所在的比特位上进行乘法计算所得到的计算结果为0,可以理解为这8次计算中,对单比特输入数据0的计算都是无效的。而在进行计算时,大部分多比特数据的数值都较小,不需要展开计算太多次,因此根据数据位宽将多比特输入数据展开成多个单比特/低比特输入数据进行输入和计算,会存在冗余计算,产生较大开销。此外,上述采用存算一体装置进行计算时,权重数据位宽也是固定的,即不论权重数据的数值大或小,部署到计算阵列上所需要的存储计算单元个数都是相同的,导致计算效率较低。
因此,本申请提出一种存算一体装置,本申请中的存算一体装置可以理解为芯片, 例如神经网络加速器。考虑到现有技术中采用存算一体装置进行神经网络计算时,根据数据位宽将多比特输入数据展开成多个单比特/低比特输入数据进行输入和计算,且输入数据位宽和权重数据位宽固定,导致计算开销较大,计算效率较低的问题,本申请在采用存算一体装置进行神经网络计算时,通过位宽计算模块对多个输入数据进行计算,得到与多个输入数据一一对应的多个有效数据,并将多个有效数据输入到计算模块,再由计算模块根据多个有效数据和权重数据的比特位,得到计算阵列中每一列的计算结果,并将每一列的计算结果输入到结果处理模块,最后由结果处理模块对每一列的计算结果进行加权计算,得到最终结果。从而有效减少了计算阵列展开计算的次数,降低了计算开销,提高了计算效率。
本申请实施例提出的存算一体装置可以应用于进行计算的场景中,例如应用于进行神经网络计算的场景中。在进行神经网络计算时,存算一体装置对多个神经网络的权重数据和多个输入数据进行计算。
如图3所示,其示出了一种存算一体装置的结构示意图,该存算一体装置可以为芯片,图3中以芯片300示例的芯片。芯片300包括数据处理单元(processing element,PE)301、数据交换模块(switch)302以及输入输出模块(TxRx)303等。
可以理解的是,本申请实施例示意的结构并不构成对芯片300的具体限定。在本申请另一些实施例中,芯片300可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
数据处理单元301可以包括一个或多个数据处理单元,一个数据处理单元包括多个计算引擎。一部分计算引擎用于完成神经网络的乘加计算,本申请实施例中,用于完成神经网络的乘加计算的计算引擎包括位宽计算模块3011、计算模块3012、权重位宽配置模块3013、控制模块3014和结果处理模块3015。另一部分计算引擎用于完成神经网络中例如激活、点乘、点加和除法等计算。
其中,位宽计算模块3011可以用于计算输入数据的有效数据,例如对多个输入数据进行逻辑或计算,得到掩膜值,根据掩膜值确定多个输入数据的多个有效数据,并将计算得到的多个有效数据输入到计算模块。
计算模块3012包括计算阵列和累加电路。计算阵列包括多个存储计算单元,多个存储计算单元阵列排布,每个存储计算单元可以用于存储权重数据的比特位,例如存储多比特权重数据中的1比特数据、2比特数据或4比特数据等。参见图4所示的8×8计算阵列,其中计算阵列包括8列存储计算单元,每列存储计算单元包括8个存储计算单元。以每个存储计算单元中存储1比特数据,权重数据采用4比特为例,一个4比特权重数据需要用4个存储计算单元进行存储,可以理解为4列存储计算单元表示一列权重数据,一列权重数据中包括8个4比特权重数据。计算阵列可以用于对多个有效数据和多个权重数据进行计算,例如对多个有效数据的相同比特位(单比特位/低比特位)和每个存储计算单元存储权重数据的比特位进行乘法计算,得到多个乘积结果(一次计算中,计算阵列中有多少个存储计算单元就得到多少个乘积结果),并将多个乘积结果输入到累加电路。
累加电路可以用于对计算阵列输出的多个乘积结果进行累加,例如将同一列存储 计算单元得到的多个乘积结果进行累加,得到每一列存储计算单元的乘积之和,并将得到的多个乘积之和输入到结果处理模块3015。
权重位宽配置模块3013可以用于存储多种权重数据的位宽信息,一列权重数据即为一种权重数据,因此可以理解为权重位宽配置模块3013用于存储多列权重数据的位宽信息。其中,同一列权重数据的位宽相同,不同列权重数据的位宽可能相同,也可能不相同。位宽信息包括每种权重数据的位宽和每种权重数据对应在计算阵列中的起始列的标识,可以理解为包括每列权重数据的位宽和每列权重数据对应在计算阵列中的起始列的标识。以图4所示的8×8计算阵列为例,计算阵列从左到右分别为第0列存储计算单元,第1列存储计算单元,……,第7列存储计算单元。若权重位宽配置模块3013存储的位宽信息中第0列权重数据的位宽为4比特,在计算阵列中的起始列的标识为第0列存储计算单元,则第0列权重数据如图4所示,包括第0列存储计算单元-第3列存储计算单元。
控制模块3014可以用于根据权重位宽配置模块3013中的位宽信息将存储器中存储的多种权重数据写入多个存储计算单元。控制模块3014还可以用于产生控制信号对计算模块3012和结果处理模块3015进行控制。例如,控制模块3014确定位宽计算模块3011得到的掩膜值的任一比特位有效时,产生第一控制信号和第二控制信号,第一控制信号用于指示计算模块3012对多个有效数据的相同比特位(单比特位/低比特位)和每个存储计算单元中存储的权重数据的比特位进行乘法计算,将得到的多个乘积之和输入到结果处理模块3015。第二控制信号用于指示结果处理模块3015根据位宽信息对计算阵列中每种权重数据对应的多列存储计算单元的乘积之和进行加权计算,得到多个有效数据分别对应的第N比特位的多个加权结果,其中,最低位为第0比特位,N为大于等于0的整数。控制模块3014确定掩膜值的位宽与输入数据的位宽相等时还可以产生第三控制信号,第三控制信号用于指示结果处理模块3015根据掩膜值的有效位对应的位权,以及多个有效数据的每个比特位的多个加权结果进行加权计算,得到每种权重数据的加权结果。
结果处理模块3015可以用于在接收到控制模块3014发送的控制信号后,根据控制信号执行相应的动作。例如,接收到第二控制信号时,根据位宽信息对计算阵列中每种权重数据对应的多列存储计算单元的乘积之和进行加权计算,得到多个有效数据分别对应的第N比特位的多个加权结果。接收到第三控制信号时,根据掩膜值的有效位对应的位权,以及多个有效数据的每个比特位的多个加权结果进行加权计算,得到每种权重数据的加权结果。
数据交换模块302可以用于实现芯片内部各个单元之间的数据交换,例如实现输入输出模块303和多个数据处理单元301之间的数据交换。
输入输出模块303可以用于接收输入数据和权重数据,也可以用于输出数据处理单元301中的得到的最终结果。例如输入输出模块303可以与芯片外的存储器(存储有输入数据和权重数据)进行交互,接收输入数据和权重数据,将输入数据和权重数据通过数据交换模块302输入到数据处理单元301中。还可以将数据处理单元301中的得到的最终结果输出到芯片外的存储器或芯片内的缓存(图3中未示出)中,本申请不与限制。
应用上述本申请提供的存算一体装置,下面结合附图对本申请针对存算一体装置所提出的计算方法,以存算一体装置为芯片为例,在芯片进行神经网络计算的过程中,通过计算多个输入数据中每个输入数据的有效数据,对多个有效数据和多个权重数据进行计算的过程进行介绍。
如图5所示,本申请实施例提供一种计算方法,该方法应用于存算一体装置,以存算一体装置为芯片300为例,芯片300包括位宽计算模块3011、计算模块3012和结果处理模块3015。其中,计算模块包括计算阵列,计算阵列包括多个存储计算单元,多个存储计算单元中每个存储计算单元用于存储权重数据的比特位,可以参见对图4所示的计算阵列的描述。该方法包括:
步骤501、对多个输入数据进行计算,得到多个有效数据。
其中,将输入数据展开进行乘法计算时,对输入数据为0的比特位的乘法计算的结果为0,可以理解为是无效的。对输入数据为1的比特位的乘法计算可以理解是有效的,因此输入数据的有效数据可以理解为由该输入数据的有效位(为1的比特位)组成的数据。多个输入数据与多个有效数据一一对应,多个输入数据中的第一输入数据与多个有效数据中的第一有效数据对应,且第一输入数据的位宽大于第一有效数据的位宽。其中,第一输入数据可以为多个输入数据中的任一个输入数据。由于本申请对输入数据进行神经网络计算与对该输入数据的有效数据进行神经网络计算得到的结果相同,因此能够保证计算结果准确。并且,由于第一输入数据的位宽大于第一有效数据的位宽,因此将该输入数据的有效数据展开进行乘法计算的次数小于将该输入数据展开进行乘法计算的次数,能够有效减少计算模块计算的次数,降低开销。
具体的,步骤501为位宽计算模块3011对多个输入数据进行计算,得到多个有效数据,将多个有效数据输入到计算模块3012。
示例性的,位宽计算模块3011能够从输入输出模块303中获取多个输入数据,位宽计算模块3011计算该多个输入数据中每个输入数据的有效数据,将计算得到的多个有效数据输入到计算模块3012中进行计算。
在一些可选的实施例中,步骤501包括:对多个输入数据进行掩膜计算,得到掩膜值,根据掩膜值的有效位确定多个有效数据。具体为位宽计算模块3011对多个输入数据进行掩膜计算,得到掩膜值,根据掩膜值的有效位确定多个有效数据。
其中,多个输入数据的有效数据需要根据多个输入数据来确定,计算多个输入数据的有效数据的方法包括对多个输入数据进行掩膜(mask)计算。以mask计算为逻辑或计算为例,对多个输入数据逐比特位进行逻辑或计算,即按照从最高比特位到最低比特位的顺序,对多个输入数据的相同比特位进行逻辑或计算,得到一个掩膜值,即mask值,根据mask值的有效位(为1的比特位)能够确定出多个输入数据中每个输入数据的有效数据。
示例性的,如图6所示,以多个输入数据为4个8比特输入数据为例,4个8比特输入数据分别为00001101、00010100、00001001和00000001。按照从最高比特位到最低比特位的顺序,对该4个8比特输入数据的相同比特位进行逻辑或计算,例如4个8比特输入数据的最高比特位(第7位)都为0,因此逻辑或计算结果为0,4个8比特输入数据的最低比特位(第0位)分别为1、0、1和1,因此逻辑或计算结果为 1。对该4个8比特输入数据逐比特位进行逻辑或计算后,得到mask值为00011101。该mask值的有效位分别为第4位、第3位、第2位和第0位,将多个输入数据中第4位、第3位、第2位和第0位对应的数提取出来,即为每个输入数据的有效数据。因此得到该4个8比特输入数据的有效数据分别为0111、1010、0101和0001。
位宽计算模块3011得到多个输入数据中每个输入数据的有效数据后,将该多个有效数据逐比特位输入到计算模块3012,以使计算模块3012将多个有效数据逐比特位和每个存储计算单元存储的权重数据的比特位进行计算,得到计算阵列中每一列的计算结果。其中,计算模块3012对多个有效数据进行计算的结果和对多个输入数据进行计算的结果保持一致。示例性的,以图6所示的多个有效数据0111、1010、0101和0001为例,将多个有效数据按照由高比特位到低比特位的顺序,逐比特位并行输入到计算模块3012。例如先将多个有效数据的最高比特位0、1、0和1并行输入到计算模块3012中,再依次将其余比特位并行输入到计算模块3012中,以使计算模块3012对多个有效数据逐比特位进行计算。
在一些实施例中,位宽计算模块3011也可以逐比特位判断多个输入数据的有效位(即逐比特位计算该4个输入数据的mask值),当判断任一比特位有效时,将多个输入数据的该有效位输入到计算模块3012进行计算。示例性的,以4个输入数据分别为00001101、00010100、00001001和00000001为例,位宽计算模块3011逐比特位判断该4个输入数据的有效位,当判断到第4比特位时,确定该第4比特位有效,将4个输入数据的第4比特位输入到计算模块3012进行计算,以此类推,判断到无效比特位即不输入到计算模块3012。
可以理解的是,位宽计算模块3011会从输入输出模块303中获取多次输入数据,每次获取多个输入数据,每次都会计算获取到的多个输入数据的有效数据,并将计算得到的多个有效数据输入到计算模块3012。有效数据的位宽和每次获取到的多个输入数据有关,每次计算得到的多个有效数据的位宽可能相同也可能不相同,因此位宽计算模块3011能够动态计算多个输入数据的有效数据。
在一些可选的实施例中,mask计算还可以为其他计算方式,例如通过确定多个输入数据的最大值,直接确定mask的高位数据是否为零等方式,本申请不予限制。此外,当只有一个输入数据时,mask值即为该输入数据,位宽计算模块3011可以根据该输入数据的每个比特位是否为1直接确定出该输入数据的有效数据。
在一些可选的实施例中,位宽计算模块3011还可以根据器件和电路实现的不同,将计算得到的多个有效数据展开成其余低比特输入到计算模块3012中,例如将多个有效数据展开成2比特输入到计算模块3012中,本申请不予限制。
步骤502、根据多个有效数据和权重数据的比特位,得到计算阵列中每一列的计算结果。
其中,每一列的计算结果中的一列计算结果为多个有效数据的同一比特位和一列存储计算单元计算的乘积之和。具体的,步骤502为计算模块3012根据多个有效数据和每个存储计算单元存储的权重数据的比特位,计算得到计算阵列中每一列的计算结果,将每一列的计算结果输入到结果处理模块3015。计算模块3012包括计算阵列,计算阵列包括多个存储计算单元,一个权重数据被展开成多个单比特/低比特权重数据 存储到多个存储计算单元中,每个存储计算单元存储的权重数据的比特位可以理解为每个存储计算单元存储的一个权重数据的部分比特位,该部分比特位可以为单比特位或多比特位。计算模块3012会将位宽计算模块3011输入的多个有效数据和每个存储计算单元存储的权重数据的比特位进行乘法计算,具体的,多个有效数据中的每个有效数据会输入到计算阵列中的不同行中,即每个有效数据对应一行存储计算单元,每个有效数据会与对应的每个存储计算单元中存储的权重数据的比特位进行乘法计算。计算结束后,计算阵列中的每一列都会对应一个计算结果,每一列的计算结果为多个有效数据和该列的乘积之和,计算模块3012将每一列的计算结果输入到结果处理模块3015中。
在一些可选的实施例中,步骤502包括:当计算阵列接收到多个有效数据分别对应的第N比特位时,计算阵列计算多个有效数据分别对应的第N比特位和权重数据的比特位的乘积。
其中,N为大于等于0的整数。上述“每个有效数据会与对应的每个存储计算单元中存储的权重数据的比特位进行乘法计算”具体为,进行多次计算,每次计算时,每个有效数据的单比特位与对应的每个存储计算单元中存储的权重数据的比特位进行乘法计算,根据有效数据的位宽来确定进行计算的次数。例如4比特有效数据即计算4次,每次对该有效数据的单比特位进行计算。并且每次计算时,多个有效数据相同的单比特位并行进行计算,即多个有效数据分别对应的第N比特位并行进行计算,可以理解为当计算阵列接收到多个有效数据分别对应的第N比特位时,计算阵列进行一次计算。
示例性的,图7示例了一个计算模块700,包括一个4×8的计算阵列701,以输入数据的有效数据和权重数据均采用4比特,存储计算单元采用1比特,即存储计算单元中存储1比特权重数据,并与1比特输入数据进行乘法计算,多个输入数据的有效数据分别为a1b1c1d1、a2b2c2d2、a3b3c3d3和a4b4c4d4,计算阵列701中的一列权重数据分别为A1B1C1D1、A2B2C2D2、A3B3C3D3和A4B4C4D4为例。该多个有效数据分别对应的第3比特位(最高比特位)为a1、a2、a3和a4,当计算阵列701接收到a1、a2、a3和a4时,计算阵列701会将a1、a2、a3和a4输入到计算阵列701的不同行中,具体为将a1、a2、a3和a4输入到对应行上的每个存储计算单元中。以a1为例,a1会与对应行上的每个存储计算单元存储的权重数据的比特位分别进行乘法计算,得到多个乘积结果,即得到a1×A1、a1×B1、a1×C1和a1×D1等乘积结果。同理,a2、a3和a4也会进行乘法计算得到多个乘积结果。对a1、a2、a3和a4计算结束后,即代表计算阵列701一次计算结束。可以理解的是,4比特有效数据需要进行4次上述计算过程才算对整个有效数据计算结束,对a1、a2、a3和a4计算结束后,还会对b1、b2、b3和b4,c1、c2、c3和c4以及d1、d2、d3和d4分别进行3次计算。
在一些可选的实施例中,计算模块还包括累加电路,累加电路对计算阵列中同一列存储计算单元计算的乘积相加,得到计算阵列中每一列存储计算单元计算的乘积之和。
其中,计算阵列每次计算结束后,累加电路都会对计算阵列得到的多个结果进行累加,具体为对计算阵列中同一列存储计算单元计算得到的多个乘积结果进行累加, 得到计算阵列中每一列的计算结果,即得到计算阵列中每一列存储计算单元的乘积之和,并将每一列存储计算单元的乘积之和输入到结果处理模块3015。
示例性的,如图7所示,计算阵列701对a1、a2、a3和a4计算结束后,累加电路702会对计算阵列701中同一列存储计算单元计算得到的多个乘积结果进行累加。累加电路702对计算阵列701中第0列存储计算单元计算得到的多个乘积结果进行累加,得到第0列存储计算单元的乘积之和S3=a1×A1+a2×A2+a3×A3+a4×A4,对计算阵列701中第1列存储计算单元计算得到的多个乘积结果进行累加,得到第1列存储计算单元的乘积之和S2=a1×B1+a2×B2+a3×B3+a4×B4,对计算阵列701中第2列存储计算单元计算得到的多个乘积结果进行累加,得到第2列存储计算单元的乘积之和S1=a1×C1+a2×C2+a3×C3+a4×C4,对计算阵列701中第3列存储计算单元计算得到的多个乘积结果进行累加,得到第3列存储计算单元的乘积之和S0=a1×D1+a2×D2+a3×D3+a4×D4,以此类推,并将得到的S3、S2、S1和S0等多个乘积之和输入到结果处理模块3015。可以理解的是,计算阵列701每次接收到多个有效数据分别对应的第N比特位,并进行完一次计算后,累加电路702都会计算每列存储计算单元得到的乘积之和,将计算得到的每一列存储计算单元的乘积之和输入到结果处理模块3015,4比特有效数据需要累加电路702向结果处理模块3015输入4次计算结果。
在一些可选的实施例中,存算一体装置还包括权重位宽配置模块,权重位宽配置模块存储多种权重数据的位宽信息。
其中,权重数据包括多种权重数据,权重位宽配置模块可以为图3中的权重位宽配置模块3013。位宽信息包括每种权重数据的位宽和每种权重数据对应在计算阵列中的起始列的标识。一种权重数据可以理解为一列权重数据,例如图7中的4×8的计算阵列701,其中第0列存储计算单元-第3列存储计算单元表示一列权重数据(一种权重数据),计算阵列701中可以包括多列权重数据(多种权重数据),多种权重数据中至少两种权重数据的位宽不同。在计算阵列701中,第0列存储计算单元-第3列存储计算单元表示的即为计算阵列701的第0列权重数据,第0列权重数据的位宽为4比特,第0列权重数据在计算阵列701中的起始列的标识为第0列存储计算单元。
示例性的,位宽信息如下表1所示,对应图7所示的计算阵列701,表1包括3种权重数据,分别为第0列权重数据、第1列权重数据和第2列权重数据。第一种权重数据(第0列权重数据)的位宽为4比特,起始列标识为第0列存储计算单元,即第0列存储计算单元-第3列存储计算单元表示第一种权重数据(第0列权重数据)。第二种权重数据(第1列权重数据)的位宽为2比特,起始列标识为第4列存储计算单元,即第4列存储计算单元和第5列存储计算单元表示第二种权重数据(第1列权重数据)。第三种权重数据(第2列权重数据)的位宽为2比特,起始列标识为第6列存储计算单元,即第6列存储计算单元和第7列存储计算单元表示第三种权重数据(第2列权重数据)。
表1
权重数据标识 位宽 起始列标识
第0列权重数据 4比特 第0列存储计算单元
第1列权重数据 2比特 第4列存储计算单元
第2列权重数据 2比特 第6列存储计算单元
可以看出,本申请的权重位宽配置模块3013能够存储多种权重数据的位宽信息,且多种权重数据中至少两种权重数据的位宽不同,即本申请的单个计算阵列中能够包括多种位宽的权重数据,支持权重数据混合精度的计算,因此能够有效提高存算一体装置的计算效率。
在一些可选的实施例中,存算一体装置还包括控制模块,控制模块根据位宽信息将多种权重数据写入多个存储计算单元。
其中,控制模块可以为图3中的控制模块3014。控制模块3014能够根据权重位宽配置模块3013中的位宽信息将存储器中存储的多种权重数据写入多个存储计算单元。
示例性的,以上述表1所示的位宽信息和图7所示计算阵列701为例,控制模块3014根据表1所示的第0列权重数据的位宽和起始列标识,将存储器中存储的第0列权重数据(A1B1C1D1、A2B2C2D2、A3B3C3D3和A4B4C4D4)的每个比特位对应写入第0列存储计算单元-第3列存储计算单元中的每个存储计算单元中,以此类推直至根据表1所示的位宽信息将存储器中的多种权重数据全部写入计算阵列701中的每个存储计算单元中。
在一些可选的实施例中,控制模块逐比特位确定掩膜值的有效位,当确定掩膜值的任一比特位有效时,产生第一控制信号和第二控制信号。
其中,控制模块3014能够根据位宽计算模块3011计算的mask值产生控制信号对计算模块3012和结果处理模块3015进行控制。具体的,位宽计算模块3011逐比特位将mask值输入到控制模块3014中,控制模块3014逐比特位确定mask值的每个比特位是否有效(即是否为1),当确定mask值的任一比特位有效时,控制模块3014产生第一控制信号和第二控制信号。可以理解的是,mask值中有几个有效比特位,控制模块3014就产生几次第一控制信号和第二控制信号。
第一控制信号用于指示计算模块3012计算得到计算阵列中每一列存储计算单元的乘积之和,可以理解为指示计算模块3012执行一次图7所示的对多个有效数据分别对应的第N比特位的计算,并得到计算阵列中每一列存储计算单元的乘积之和。
第二控制信号用于指示结果处理模块3015,根据位宽信息对计算阵列中每种权重数据对应的多列存储计算单元的乘积之和进行加权计算,得到多个有效数据分别对应的第N比特位的多个加权结果。由于根据位宽信息能够获知一种权重数据(一列权重数据)对应计算阵列的哪几列存储计算单元,因此结果处理模块3015根据位宽信息能够确定每种权重数据对应的多列存储计算单元。结果处理模块3015,将每种权重数据对应的多列存储计算单元的乘积之和进行加权计算,具体为,根据权重数据比特位的位权进行加权计算。例如,对应权重数据最低比特位(第0比特位)的一列存储计算单元,加权计算时,该存储计算单元的乘积之和与2 0相乘再进行累加,对应权重数据最2比特位的一列存储计算单元,加权计算时,该存储计算单元的乘积之和与2 2相乘再进行累加,与2的幂次相乘在电路上可以通过移位实现。可以理解的是,计算阵列中包括几种权重数据(几列权重数据),结果处理模块3015执行一次加权计算就会得到几个加权结果。当计算阵列中包括多种权重数据时,结果处理模块3015接收到第二 控制信号后就会得到多个有效数据分别对应的第N比特位的多个加权结果,其中,多个加权结果中每个加权结果对应一种权重数据。
示例性的,如图8所示的控制模块800,控制模块800中包括第一比较器,第一比较器用于比较输入到控制模块800中的比特位与1是否相同,若相同则产生第一控制信号和第二控制信号,若不相同则不产生第一控制信号和第二控制信号。
以图6所示的mask值为00011101为例。位宽计算模块3011按照从最高比特位到最低比特位的顺序,逐比特位将mask值输入到控制模块800中。首先,位宽计算模块3011输入mask值的最高比特位(第7比特位)0到控制模块800中,控制模块800中的第一比较器比较0与1不相同,即确定该比特位非有效位,不产生第一控制信号和第二控制信号。以此类推,当位宽计算模块3011将mask值的第4比特位1输入到控制模块800中时,控制模块800中的第一比较器比较1与1相同,确定该比特位为有效位,产生第一控制信号和第二控制信号。
控制模块800产生的第一控制信号会输入到计算模块3012中,用于指示计算模块3012对多个有效数据分别对应的第N比特位进行一次计算。相应的,根据mask值的第4比特位产生的第一控制信号,会指示计算模块3012对多个有效数据的最高比特位(第3比特位)0、1、0和0进行一次计算,得到计算阵列中每一列存储计算单元的乘积之和,即图7所示的S3、S2、S1和S0等。
控制模块800产生的第二控制信号会输入到结果处理模块3015中,用于指示结果处理模块3015对计算模块3012进行一次计算产生的多个乘积之和进行加权计算。以表1所示的位宽信息和图7所示计算模块700为例,结果处理模块3015根据表1中第0列权重数据的位宽为4比特,起始列标识为第0列存储计算单元确定出计算阵列701中第0列存储计算单元-第3列存储计算单元表示第一种权重数据(第0列权重数据)。并将计算模块700得到的第0列存储计算单元-第3列存储计算单元对应的乘积之和S3、S2、S1和S0,分别和权重数据比特位的位权进行加权计算,得到一个加权结果sum0,该sum0对应一种权重数据(第0列权重数据),sum0=S3×2 3+S2×2 2+S1×2 1+S0×2 0,以此类推。可以理解的是,计算模块700能够得到3个加权结果,分别对应第0列权重数据、第1列权重数据和第2列权重数据。
在一些可选的实施例中,控制模块确定掩膜值的位宽与输入数据的位宽相等时,产生第三控制信号。
其中,由对图6所示的描述可以看出,输入数据的位宽和mask值的位宽相同。由于mask值逐比特位输入到控制模块3014中,当控制模块3014确定mask值的位宽与输入数据的位宽相同时,即可确定出mask值输入完毕,从而产生第三控制信号。可以理解为,控制模块3014输出第一控制信号和第二控制信号完毕之后,输出第三控制信号。
第三控制信号用于指示结果处理模块3015,根据mask值的有效位对应的位权,以及多个有效数据的每个比特位的多个加权结果进行加权计算,得到最终结果,最终结果包括每种权重数据的加权结果。
示例性的,如图8所示,控制模块800中还包括计数器和第二比较器。每输入mask值的一个比特位,计数器就会执行加1操作,记录mask值的位宽。第二比较器用于比 较计数器中记录的mask值的位宽和输入数据的位宽是否相同,若相同则产生第二控制信号,若不相同则不产生第二控制信号。
以图6所示的输入数据位宽为8比特,mask值为00011101为例。mask值的有效位分别为第4位、第3位、第2位和第0位,有效位对应的位权分别为2 4、2 3、2 2和2 0。位宽计算模块3011输入mask值的最高比特位(第7比特位)0到控制模块800中,控制模块800中的第一比较器比较0与1不相同,即确定该比特位非有效位,不产生第一控制信号和第二控制信号。同时计数器会记录mask的位宽为1,第二比较器比较计数器记录的mask的位宽(1)与输入数据的位宽(8)不相同,不产生第三控制信号。以此类推,当位宽计算模块3011将mask值的最低比特位(第0比特位)1输入到控制模块800中时,控制模块800中的第一比较器比较1与1相同,确定该比特位为有效位,产生第一控制信号和第二控制信号。同时计数器会记录mask的位宽为8,第二比较器比较计数器记录的mask的位宽(8)与输入数据的位宽(8)相同,产生第三控制信号。
控制模块800产生的第三控制信号会输入到结果处理模块3015中,此时结果处理模块3015已经接收到4次第二控制信号,即已经对计算模块3012的多个乘积之和进行了4次加权计算,每次加权计算得到多个加权结果(例如第1次加权计算得到sum0等加权结果)。第三控制信号用于指示结果处理模块3015,根据mask值的有效位对应的位权,以及多次加权计算得到的多个加权结果再次进行加权计算,得到最终结果。以结果处理模块3015对第0列权重数据第1次加权计算得到sum0,第2次加权计算得到sum1,第3次加权计算得到sum2,第4次加权计算得到sum3为例,多个加权结果sum0、sum1、sum2和sum3分别和mask值的有效位的位权2 4、2 3、2 2和2 0对应,可以理解为该加权结果对应的mask值的有效位的位权,与产生得到该加权结果的第二控制信号对应的mask值的有效位的位权相同。结果处理模块3015再次进行加权计算,得到第0列权重数据的最终结果out0=sum0×2 4+sum1×2 3+sum2×2 2+sum3×2 0。以此类推,最终结果包括每种权重数据的加权结果。可以理解的是,计算模块700能够得到3个最终结果,分别对应第0列权重数据、第1列权重数据和第2列权重数据。
步骤503、对每一列的计算结果进行加权计算,得到最终结果。
其中,步骤503具体为结果处理模块3015对每一列的计算结果进行加权计算,得到最终结果。每一列的计算结果为每一列存储计算单元的乘积之和,可以理解为上述步骤502中的S3、S2、S1和S0等计算结果。结果处理模块3015对每一列的计算结果进行加权计算具体为,根据权重数据比特位的位权进行加权计算得到多个sum值,再根据mask值的有效位对应的位权进行加权计算得到多个out值,即得到最终结果。可参见对上述控制模块3014(控制模块800)的描述,此处不过多赘述。
在一些可选的实施例中,输入数据和权重数据包括无符号数和有符号数,其中,无符号数的计算方法可以参见本申请实施例中的举例,有符号数可以通过补码计算和差分计算等计算方法实现,本申请不予限制。
由此,本申请实施例提供的一种计算方法,可以应用于存算一体装置,例如芯片中,当进行神经网络计算的过程中,通过位宽计算模块对多个输入数据进行计算,并将计算得到的多个输入数据的多个有效数据输入到计算模块,再由计算模块根据多个 有效数据和每个存储计算单元存储的权重数据的比特位,得到计算阵列中每一列的计算结果,并将每一列的计算结果输入到结果处理模块,最后由结果处理模块对每一列的计算结果进行加权计算,得到最终结果。相比于现有技术中根据数据位宽将多比特输入数据展开成多个单比特/低比特输入数据进行输入和计算,导致展开计算太多次,产生较大开销,本申请的位宽计算模块能够动态计算输入数据的有效数据,从而仅对输入数据的有效位进行计算,有效减少计算模块计算的次数,降低开销。并且现有技术无法做到权重数据混合精度计算,导致计算效率低,本申请能够利用权重位宽配置模块存储的多种权重数据的位宽信息,在单个计算阵列中实现多种位宽的权重数据的部署和计算,从而支持权重数据混合精度的计算,有效提高存算一体装置的计算效率。
与上述图5提供的计算方法对应,在图3所示的存算一体装置的结构基础上,如图9所示,本申请实施例提供一种计算方法的流程示意图,以位宽计算模块为位宽计算模块3011,计算模块为计算模块3012,权重位宽配置模块为权重位宽配置模块3013,控制模块为控制模块3014,结构处理模块为结果处理模块3015,多个输入数据为00011、00101和00010,计算阵列为3×3计算阵列,存储单元存储采用1比特,权重位宽配置模块3013中仅存储了一种权重数据为例。该计算流程包括:
步骤1、位宽计算模块3011对多个输入数据进行计算,得到多个输入数据的mask值以及每个输入数据对应的有效数据,将多个有效数据输入到计算模块3012。
其中,多个输入数据为00011、00101和00010,对多个输入数据进行mask计算(以逻辑或计算为例),计算得到的mask值为00111,从而确定出多个有效数据分别为011、101和010,并将011、101和010逐比特输入到计算模块。具体参见对上述步骤501的描述,此处不过多赘述。
步骤2、控制模块3014根据权重位宽配置模块3013将多种权重数据写入多个存储计算单元。
其中,权重位宽配置模块3013中仅存储了一种权重数据,位宽为3比特,起始列标识为第0列存储计算单元。以多个权重数据为101、011和111为例,控制模块3014根据权重位宽配置模块3013中的位宽信息将多个权重数据写入多个存储计算单元,参见图9所示的计算阵列。具体参见对上述控制模块的描述,此处不过多赘述。
步骤3、控制模块3014根据位宽计算模块3011计算的mask值的有效位,产生第一控制信号和第二控制信号。
其中,mask值的最高位(第4位)0先输入到控制模块3014中,控制模块3014判断该最高位非有效位,不产生第一控制信号和第二控制信号。之后,mask值的第3位0输入到控制模块3014中,控制模块3014判断该第3位非有效位,不产生第一控制信号和第二控制信号。再之后,mask值的第2位1输入到控制模块3014中,控制模块3014判断该mask值的第2位为有效位,产生第一控制信号和第二控制信号。根据该mask值的第2位产生的第一控制信号用于控制计算模块3012对多个有效数据的最高位(第2位)0、1和0进行计算,得到S0=0×1+1×0+0×1=0,S1=0×0+1×1+0×1=1和S2=0×1+1×1+0×1=1,分别对应第0列存储计算单元、第1列存储计算单元和第2列存储计算单元,并将其输入到结果处理模块3015。根据该mask值的第2位产生的第二控制信号用于控制结果处理模块3015对S0、S1和S2根据权重数据比 特位的位权进行加权计算,得到sum=0×2 2+1×2 1+1×2 0=3。再之后,mask值的第1位1输入到控制模块3014中,控制模块3014判断该mask值的第1位为有效位,产生第一控制信号和第二控制信号。根据该mask值的第1位产生的第一控制信号用于控制计算模块3012对多个有效数据的第1位1、0和1进行计算,得到S0'=1×1+0×0+1×1=2,S1'=1×0+0×1+1×1=1和S2'=1×1+0×1+1×1=2,分别对应第0列存储计算单元、第1列存储计算单元和第2列存储计算单元,并将其输入到结果处理模块3015。根据该mask值的第1位产生的第二控制信号用于控制结果处理模块3015对S0'、S1'和S2'根据权重数据比特位的位权进行加权计算,得到sum'=2×2 2+1×2 1+2×2 0=12。最后,mask值的第0位1输入到控制模块3014中,控制模块3014判断该mask值的第0位为有效位,产生第一控制信号和第二控制信号。根据该mask值的第0位产生的第一控制信号用于控制计算模块3012对多个有效数据的第0位1、1和0进行计算,得到S0”=1×1+1×0+0×1=1,S1”=1×0+1×1+0×1=1和S2”=1×1+1×1+0×1=2,分别对应第0列存储计算单元、第1列存储计算单元和第2列存储计算单元,并将其输入到结果处理模块3015。根据该mask值的第0位产生的第二控制信号用于控制结果处理模块3015对S0”、S1”和S2”根据权重数据比特位的位权进行加权计算,得到sum”=1×2 2+1×2 1+2×2 0=8。具体参见对上述控制模块3014的描述,此处不过多赘述。
步骤4、控制模块3014确定mask值的位宽与输入数据的位宽相等时,产生第三控制信号。
其中,控制模块3014确定mask值的位宽为5比特时,产生第三控制信号。第三控制信号用于控制结果处理模块3015将步骤3中计算得到的sum、sum'和sum”根据mask值的有效位的位权进行加权计算,得到out=sum×2 2+sum'×2 1+sum”×2 0=44,即为最终结果。具体参见对上述控制模块3014的描述,此处不过多赘述。
至此,存算一体装置已完成对多个输入数据和多个权重数据的计算。可以理解的是,上述步骤1-步骤4中仅以计算阵列包括一种权重数据(一列权重数据)作为示例,实际上可能有多种权重数据。本申请实施例提供的一种计算方法,通过动态计算多个输入数据的有效数据,仅对输入数据的有效位进行计算,能够有效减少计算模块计算的次数,降低开销,同时还支持权重数据混合精度的计算,提高存算一体装置的计算效率。
根据本申请实施例提供的计算方法,采用目标检测yolov3-tiny模型进行计算的结果如表2所示(数据集为COCO2017val数据集)。其中,以现有技术中采用8比特模型(权重数据为8比特)进行计算时的比特操作数和阵列计算次数为100%为例,在采用8比特模型和本申请的存算一体装置进行计算时,能够在保证计算精度的同时,将比特操作数降低为现有技术中的81.38%,将阵列计算次数降低为现有技术中的78.31%。在采用4/8比特混合模型(权重数据包括4比特和8比特)和本申请的存算一体装置进行计算时,能够在保证计算精度的同时,将比特操作数降低为现有技术中的69.14%,将阵列计算次数降低为现有技术中的72.23%。可以看出,本申请实施例提供的方法能够有效降低计算次数,并且当权重数据混合精度进行计算时,计算次数能够大幅度降低,因此有效减少了计算开销,提高了计算效率。
表2
Figure PCTCN2022141634-appb-000001
可以理解的是,上述存算一体装置为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请实施例能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。
本申请实施例可以根据上述方法示例对上述存算一体装置进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
在采用集成的单元的情况下,如图10所示,本申请实施例公开了一种存算一体装置1000,该存算一体装置1000可以为上述实施例中的芯片300。存算一体装置1000可以包括处理模块、存储模块和通信模块。其中,处理模块可以用于对存算一体装置1000的动作进行控制管理,例如,可以用于支持存算一体装置1000执行上述位宽计算模块3011、计算模块3012、权重位宽配置模块3013、控制模块3014和结果处理模块3015执行的步骤。存储模块可以用于支持存算一体装置1000存储程序代码和数据等,例如,可以用于存储输入数据和权重数据等。通信模块可以用于支持存算一体装置1000与其他设备的通信,例如,可以用于从外部设备输入多个输入数据和权重数据,也可以用于将结果处理模块3015得到的最终结果输出到外部设备。
当然,上述存算一体装置1000中的单元模块包括但不限于上述处理模块、存储模块和通信模块。
其中,处理模块可以是处理器或控制器。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,神经网络处理器(neural network processing unit,NPU)、数字信号处理(digital signal processing,DSP)和微处理器的组合等等。存储模块可以是存储器。通信模块具体可以为与其他外部设备交互的设备。
例如,处理模块为处理器1001,存储模块可以为存储器1002,通信模块可以称为通信接口1003。本申请实施例所提供的存算一体装置1000可以为图3所示的芯片300。其中,上述处理器1001、存储器1002、通信接口1003等可以连接在一起,例如通过总线连接。
本申请实施例还提供一种电子设备,包括一个或多个处理器以及一个或多个存储 器。该一个或多个存储器与一个或多个处理器耦合,一个或多个存储器用于存储计算机程序代码,计算机程序代码包括计算机指令,当一个或多个处理器执行计算机指令时,使得电子设备执行上述相关方法步骤实现上述实施例中的计算方法。
本申请实施例还提供一种电子设备,该电子设备包括一个或多个通信接口和一个或多个处理器,其中,通信接口和处理器通过线路互联,处理器通过通信接口从电子设备的存储器接收并执行计算机指令,使得电子设备执行上述相关方法步骤实现上述实施例中的计算方法。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序代码,当计算机指令在计算机或处理器上运行时,使得计算机或处理器执行上述实施例中的计算方法。
本申请的实施例还提供了一种计算机程序产品,计算机程序产品中包括计算机指令,当计算机指令在计算机或处理器上运行时,使得计算机或处理器执行上述相关步骤,以实现上述实施例中电子设备执行的计算方法。
其中,本实施例提供的存算一体装置、电子设备、计算机存储介质、计算机程序产品或芯片均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
通过以上实施方式的描述,所属领域的技术人员可以了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。例如,以上所描述的设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个设备,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory, ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上内容,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (15)

  1. 一种存算一体装置,其特征在于,所述存算一体装置包括位宽计算模块、计算模块和结果处理模块;所述计算模块包括计算阵列,所述计算阵列包括多个存储计算单元,所述多个存储计算单元用于存储权重数据;
    所述位宽计算模块,用于对输入数据进行计算,得到有效数据,所述有效数据的位宽小于所述输入数据的位宽;
    所述计算模块,用于对所述有效数据和所述权重数据计算,计算次数由所述有效数据的位宽决定;
    所述结果处理模块,用于对所述计算模块的计算结果进行计算,得到最终结果。
  2. 根据权利要求1所述的存算一体装置,其特征在于,
    所述位宽计算模块,具体用于对所述输入数据进行掩膜计算,得到掩膜值,根据所述掩膜值的有效位确定所述有效数据;
    将所述有效数据逐比特位输入到所述计算模块,以使所述计算模块对所述有效数据逐比特位进行计算。
  3. 根据权利要求2所述的存算一体装置,其特征在于,当所述计算阵列接收到所述有效数据分别对应的第N比特位时,其中,N为大于等于0的整数,
    所述计算阵列,用于计算所述有效数据分别对应的第N比特位和所述权重数据的比特位的乘积;
    所述计算模块还包括累加电路;
    所述累加电路,用于对所述计算阵列中同一列存储计算单元计算的乘积相加,得到所述计算阵列中每一列存储计算单元计算的乘积之和。
  4. 根据权利要求2或3所述的存算一体装置,其特征在于,所述权重数据包括多种权重数据,所述存算一体装置还包括权重位宽配置模块;
    所述权重位宽配置模块,用于存储所述多种权重数据的位宽信息,所述位宽信息包括每种权重数据的位宽和所述每种权重数据对应在所述计算阵列中的起始列的标识,其中,所述多种权重数据中至少两种权重数据的位宽不同。
  5. 根据权利要求4所述的存算一体装置,其特征在于,所述存算一体装置还包括控制模块;
    所述控制模块,用于根据所述位宽信息将所述多种权重数据写入所述多个存储计算单元。
  6. 根据权利要求5所述的存算一体装置,其特征在于,
    所述控制模块,还用于逐比特位确定所述掩膜值的有效位,当确定所述掩膜值的任一比特位有效时,产生第一控制信号和第二控制信号;
    所述第一控制信号用于指示所述计算模块对所述有效数据和所述权重数据计算;所述第二控制信号用于指示所述结果处理模块,根据所述位宽信息对所述计算阵列中所述每种权重数据对应的多列存储计算单元的计算结果进行加权计算,得到所述多个有效数据分别对应的第N比特位的多个加权结果,所述多个加权结果中每个加权结果对应一种权重数据。
  7. 根据权利要求6所述的存算一体装置,其特征在于,
    所述控制模块,还用于确定所述掩膜值的位宽与所述输入数据的位宽相等时,产生第三控制信号;所述第三控制信号用于指示所述结果处理模块,根据所述掩膜值的有效位对应的位权,以及所述多个有效数据的每个比特位的多个加权结果进行加权计算,得到所述最终结果,所述最终结果包括所述每种权重数据的加权结果。
  8. 一种计算方法,其特征在于,所述方法应用于存算一体装置,所述存算一体装置包括计算阵列,所述计算阵列包括多个存储计算单元,所述多个存储计算单元用于存储权重数据;所述方法包括:
    对输入数据进行计算,得到有效数据;所述有效数据的位宽小于所述输入数据的位宽;
    对所述有效数据和所述权重数据计算,计算次数由所述有效数据的位宽决定;
    对所述计算模块的计算结果进行计算,得到最终结果。
  9. 根据权利要求8所述的方法,其特征在于,所述对输入数据进行计算,得到有效数据包括:
    对所述输入数据进行掩膜计算,得到掩膜值,根据所述掩膜值的有效位确定所述有效数据;
    所述对所述有效数据和所述权重数据计算包括:
    将所述有效数据逐比特位和所述权重数据的比特位进行计算,得到计算结果。
  10. 根据权利要求9所述的方法,其特征在于,所述将所述有效数据逐比特位和所述权重数据的比特位进行计算,得到计算结果包括:当所述计算阵列接收到所述有效数据分别对应的第N比特位时,其中,N为大于等于0的整数,
    计算所述有效数据分别对应的第N比特位和所述权重数据的比特位的乘积;
    对所述计算阵列中同一列存储计算单元计算的乘积相加,得到所述计算阵列中每一列存储计算单元计算的计算结果。
  11. 根据权利要求9或10所述的方法,其特征在于,所述权重数据包括多种权重数据,所述方法还包括:
    存储所述多种权重数据的位宽信息,所述位宽信息包括每种权重数据的位宽和所述每种权重数据对应在所述计算阵列中的起始列的标识,其中,所述多种权重数据中至少两种权重数据的位宽不同。
  12. 根据权利要求11所述的方法,其特征在于,所述方法还包括:
    根据所述位宽信息将所述多种权重数据写入所述多个存储计算单元。
  13. 根据权利要求12所述的方法,其特征在于,所述方法还包括:
    逐比特位确定所述掩膜值的有效位,当确定所述掩膜值的任一比特位有效时,产生第一控制信号和第二控制信号;
    所述第一控制信号用于计算得到所述计算阵列中每一列存储计算单元计算的乘积之和;所述第二控制信号用于根据所述位宽信息对所述计算阵列中所述每种权重数据对应的多列存储计算单元的计算结果进行加权计算,得到所述多个有效数据分别对应的第N比特位的多个加权结果,所述多个加权结果中每个加权结果对应一种权重数据。
  14. 根据权利要求13所述的方法,其特征在于,所述方法还包括:
    确定所述掩膜值的位宽与所述输入数据的位宽相等时,产生第三控制信号;所述第三控制信号用于根据所述掩膜值的有效位对应的位权,以及所述有效数据的每个比特位的多个加权结果进行加权计算,得到所述最终结果,所述最终结果包括所述每种权重数据的加权结果。
  15. 一种计算机可读存储介质,其特征在于,存储有计算机指令,当计算机指令在电子设备上运行时,使得电子设备执行上述权利要求8-14中的任一项所述的方法。
PCT/CN2022/141634 2021-12-24 2022-12-23 一种存算一体装置和计算方法 WO2023116923A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111599630.1A CN116362314A (zh) 2021-12-24 2021-12-24 一种存算一体装置和计算方法
CN202111599630.1 2021-12-24

Publications (1)

Publication Number Publication Date
WO2023116923A1 true WO2023116923A1 (zh) 2023-06-29

Family

ID=86901378

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141634 WO2023116923A1 (zh) 2021-12-24 2022-12-23 一种存算一体装置和计算方法

Country Status (2)

Country Link
CN (1) CN116362314A (zh)
WO (1) WO2023116923A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116821047A (zh) * 2023-08-31 2023-09-29 北京犀灵视觉科技有限公司 一种感存算一体化电路、系统及方法
CN117331512A (zh) * 2023-12-01 2024-01-02 芯动微电子科技(武汉)有限公司 对gpu核内存储器执行写操作的数据压缩及处理方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423816A (zh) * 2017-03-24 2017-12-01 中国科学院计算技术研究所 一种多计算精度神经网络处理方法和系统
CN110990060A (zh) * 2019-12-06 2020-04-10 北京瀚诺半导体科技有限公司 一种存算一体芯片的嵌入式处理器、指令集及数据处理方法
CN113255875A (zh) * 2020-02-07 2021-08-13 华为技术有限公司 神经网络电路和神经网络系统
CN214225915U (zh) * 2020-11-23 2021-09-17 格科微电子(上海)有限公司 应用于便携式移动终端的多媒体芯片架构与多媒体处理系统
US20210326114A1 (en) * 2020-04-15 2021-10-21 Macronix International Co., Ltd. In-memory computing method and apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423816A (zh) * 2017-03-24 2017-12-01 中国科学院计算技术研究所 一种多计算精度神经网络处理方法和系统
CN110990060A (zh) * 2019-12-06 2020-04-10 北京瀚诺半导体科技有限公司 一种存算一体芯片的嵌入式处理器、指令集及数据处理方法
CN113255875A (zh) * 2020-02-07 2021-08-13 华为技术有限公司 神经网络电路和神经网络系统
US20210326114A1 (en) * 2020-04-15 2021-10-21 Macronix International Co., Ltd. In-memory computing method and apparatus
CN214225915U (zh) * 2020-11-23 2021-09-17 格科微电子(上海)有限公司 应用于便携式移动终端的多媒体芯片架构与多媒体处理系统

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116821047A (zh) * 2023-08-31 2023-09-29 北京犀灵视觉科技有限公司 一种感存算一体化电路、系统及方法
CN116821047B (zh) * 2023-08-31 2023-10-31 北京犀灵视觉科技有限公司 一种感存算一体化电路、系统及方法
CN117331512A (zh) * 2023-12-01 2024-01-02 芯动微电子科技(武汉)有限公司 对gpu核内存储器执行写操作的数据压缩及处理方法
CN117331512B (zh) * 2023-12-01 2024-04-12 芯动微电子科技(武汉)有限公司 对gpu核内存储器执行写操作的数据压缩及处理方法

Also Published As

Publication number Publication date
CN116362314A (zh) 2023-06-30

Similar Documents

Publication Publication Date Title
WO2023116923A1 (zh) 一种存算一体装置和计算方法
Zhu et al. A configurable multi-precision CNN computing framework based on single bit RRAM
Sun et al. Fully parallel RRAM synaptic array for implementing binary neural network with (+ 1,− 1) weights and (+ 1, 0) neurons
WO2018228424A1 (zh) 一种神经网络训练方法和装置
CN112636745B (zh) 逻辑单元、加法器以及乘法器
CN110580519A (zh) 一种卷积运算结构及其方法
Mao et al. Energy-efficient machine learning accelerator for binary neural networks
CN114003198A (zh) 内积处理部件、任意精度计算设备、方法及可读存储介质
CN110750945B (zh) 一种芯片的仿真方法、装置、仿真芯片以及相关产品
CN112966729A (zh) 一种数据处理方法、装置、计算机设备及存储介质
Joardar et al. Heterogeneous manycore architectures enabled by processing-in-memory for deep learning: From CNNs to GNNs:(ICCAD special session paper)
US20230047364A1 (en) Partial sum management and reconfigurable systolic flow architectures for in-memory computation
CN115879530A (zh) 一种面向rram存内计算系统阵列结构优化的方法
US20220121908A1 (en) Method and apparatus for processing data, and related product
US20230031841A1 (en) Folding column adder architecture for digital compute in memory
Song et al. ReRAM-sharing: Fine-grained weight sharing for ReRAM-based deep neural network accelerator
CN117561519A (zh) 用于逐深度卷积的存储器内计算架构
TWI749552B (zh) 內積計算裝置
CN115312090A (zh) 一种存内计算电路及方法
CN111198714B (zh) 重训练方法及相关产品
CN111258545B (zh) 乘法器、数据处理方法、芯片及电子设备
CN113031916A (zh) 乘法器、数据处理方法、装置及芯片
US20230418600A1 (en) Non-volatile memory die with latch-based multiply-accumulate components
US11914973B2 (en) Performing multiple bit computation and convolution in memory
US20240231758A1 (en) Performing Multiple Bit Computation and Convolution in Memory

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22910229

Country of ref document: EP

Kind code of ref document: A1