WO2023116314A1 - 一种神经网络加速装置、方法、设备和计算机存储介质 - Google Patents

一种神经网络加速装置、方法、设备和计算机存储介质 Download PDF

Info

Publication number
WO2023116314A1
WO2023116314A1 PCT/CN2022/133443 CN2022133443W WO2023116314A1 WO 2023116314 A1 WO2023116314 A1 WO 2023116314A1 CN 2022133443 W CN2022133443 W CN 2022133443W WO 2023116314 A1 WO2023116314 A1 WO 2023116314A1
Authority
WO
WIPO (PCT)
Prior art keywords
operator
calculation result
convolutional layer
memory
feature data
Prior art date
Application number
PCT/CN2022/133443
Other languages
English (en)
French (fr)
Inventor
祝叶华
孙炜
Original Assignee
哲库科技(上海)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 哲库科技(上海)有限公司 filed Critical 哲库科技(上海)有限公司
Publication of WO2023116314A1 publication Critical patent/WO2023116314A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C11/00Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
    • G11C11/54Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using elements simulating biological cells, e.g. neuron
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • G06N3/065Analogue means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of in-memory computing, and in particular to a neural network acceleration device, method, device and computer storage medium.
  • neural networks have achieved remarkable success in practical applications, such as image classification and icon detection, etc., but these achievements largely rely on complex neural network models with a large number of parameters and calculations.
  • deploying these complex neural network models that require a large amount of calculation and data movement to a neural network accelerator based on the von Neumann architecture will cause the so-called memory wall (Memory Wall) problem, that is, the speed of data movement cannot keep up with data processing speed.
  • memory Wall memory wall
  • the embodiment of the present application provides a neural network acceleration device, the neural network acceleration device includes several computing units, each computing unit includes an in-memory computing array and a first operator module, and the first operator module Including several first-type operators; among them,
  • An in-memory computing array is used to obtain the input feature data, and perform a convolution operation on the input feature data to obtain an initial calculation result;
  • the first operator module is used to perform an operator operation on the initial calculation result by the first type of operator to obtain an intermediate calculation result, and use the intermediate calculation result as the input feature data of the next calculation unit.
  • the embodiment of the present application provides a neural network acceleration method, which is applied to a neural network acceleration device.
  • the neural network acceleration device includes several computing units, and each computing unit includes an in-memory computing array and a first operator module; the method includes:
  • the intermediate calculation result is used as the input characteristic data of the next calculation unit until all the processing of several calculation units is completed, and the target output result is determined.
  • an embodiment of the present application provides a chip, and the chip includes the neural network acceleration device as described in the first aspect.
  • the embodiment of the present application provides an electronic device, the electronic device includes a memory and a processor; wherein,
  • memory for storing computer programs capable of running on the processor
  • a processor configured to execute the method as described in the second aspect when running the computer program.
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method described in the second aspect is implemented.
  • Fig. 1 is a schematic diagram of the architecture of an artificial intelligence accelerator
  • FIG. 2 is a schematic diagram of the composition and structure of a neural network acceleration device provided in an embodiment of the present application
  • FIG. 3 is a schematic diagram of a basic structure of in-memory computing provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of an in-memory computing array provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a computing unit provided in an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a neural network acceleration device provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of the composition and structure of a neural network structure provided by the embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a neural network acceleration method provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a specific hardware structure of an electronic device provided in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the composition and structure of a chip provided by the embodiment of the present application.
  • FIG. 11 is a schematic diagram of a specific hardware structure of a chip provided by an embodiment of the present application.
  • the embodiment of the present application provides a neural network acceleration device.
  • the neural network acceleration device includes several computing units.
  • the computing unit includes an in-memory computing array and a first operator module, and the first operator module includes several operators of the first type; among them,
  • An in-memory computing array is used to obtain the input feature data, and perform a convolution operation on the input feature data to obtain an initial calculation result;
  • the first operator module is used to perform an operator operation on the initial calculation result by the first type of operator to obtain an intermediate calculation result, and use the intermediate calculation result as the input feature data of the next calculation unit.
  • the weight parameters corresponding to the target convolutional layer are pre-stored in the in-memory computing array; wherein,
  • the in-memory calculation array is used to perform a convolution operation on the input feature data according to the weight parameters after obtaining the input feature data corresponding to the target convolution layer to obtain an initial calculation result.
  • the in-memory computing array includes a digital-to-analog conversion module, a storage array, and an analog-to-digital conversion module; wherein,
  • a digital-to-analog conversion module configured to perform digital-to-analog conversion on the input feature data to obtain a first analog signal
  • the storage array is used to perform multiplication and accumulation calculation according to the weight parameter and the first analog signal to obtain the second analog signal;
  • the analog-to-digital conversion module is configured to perform analog-to-digital conversion on the second analog signal to obtain a target digital signal, and determine the target digital signal as an initial calculation result.
  • the computing unit is the i-th computing unit, and the in-memory computing array in the i-th computing unit pre-stores weight parameters corresponding to the i-th convolutional layer;
  • the in-memory computing array is used to obtain the input feature data corresponding to the i-th convolutional layer, and perform convolution operations on the input feature data corresponding to the i-th convolutional layer according to the weight parameters corresponding to the i-th convolutional layer to obtain the i-th volume
  • the initial calculation result of the layer
  • the first operator module is used to perform an operator operation on the initial calculation result of the i-th convolutional layer through the first type of operator to obtain the intermediate calculation result of the i-th convolutional layer, and convert the intermediate calculation result of the i-th convolutional layer
  • the result is determined as the input feature data corresponding to the i+1th convolutional layer
  • i is an integer greater than zero and less than or equal to N; N represents the number of arithmetic units, and N is an integer greater than zero.
  • the neural network acceleration device further includes a receiving unit; wherein,
  • the receiving unit is configured to receive the feature image, divide the feature image into at least one feature block, and sequentially read the at least one feature block into the computing unit.
  • the input feature data of the first computing unit is the first feature block
  • the output of the first computing unit is The intermediate calculation result is used as the input feature data of the next computing unit
  • the next feature block is used as the input feature data of the first computing unit until all the processing of several computing units is completed.
  • the neural network acceleration device further includes a sending unit; wherein,
  • the sending unit is configured to send the obtained target output results to the outside after all the processing by the several computing units is completed.
  • the neural network acceleration device further includes a scheduling unit; wherein,
  • the scheduling unit is used for scheduling and arranging the several computing units, so as to realize the processing of the input feature data by the several computing units.
  • the scheduling unit is further configured to schedule the receiving unit and the sending unit, so as to schedule the receiving unit to process when receiving the characteristic image, or schedule the sending unit to send out after obtaining the target output result.
  • the neural network acceleration device further includes a digital signal processor; wherein,
  • the digital signal processor is used to process the initial calculation result to obtain the intermediate calculation result when the first type of operator cannot be used.
  • the first type of operator corresponds to an accelerated operation suitable for a dedicated digital circuit
  • the digital signal processor is used to process operations other than the first type of operator that are not suitable for a dedicated digital circuit
  • the first type of operator includes at least one of the following: an operator for performing a pooling operation, an operator for performing an activation function operation, and an operator for performing an addition operation.
  • the embodiment of the present application provides a neural network acceleration method, which is applied to a neural network acceleration device, and the neural network acceleration device includes several computing units, and each computing unit includes an in-memory computing array and a first computing submodule; the method includes:
  • the intermediate calculation result is used as the input characteristic data of the next calculation unit until all the processing of several calculation units is completed, and the target output result is determined.
  • the weight parameters corresponding to the target convolutional layer are pre-stored in the in-memory computing array; correspondingly, the input feature data is obtained through the in-memory computing array, and the convolution operation is performed on the input feature data to obtain the initial calculation result ,include:
  • the in-memory calculation array After the in-memory calculation array acquires the input feature data corresponding to the target convolution layer, the input feature data is convoluted according to the weight parameters to obtain the initial calculation result.
  • the convolution operation is performed on the input feature data according to the weight parameters to obtain the initial calculation results, including:
  • the method further includes:
  • i is an integer greater than zero and less than or equal to N; N represents the number of operation units, and N is an integer greater than zero.
  • the method further includes:
  • the intermediate calculation result of the i+1th convolutional layer is determined as the input feature data corresponding to the i+2th convolutional layer and input to Carry out related processing in the i+1th computing unit;
  • i is an integer greater than zero and less than or equal to N; N represents the number of operation units, and N is an integer greater than zero.
  • the method also includes:
  • the input feature data of the first computing unit is the first feature block
  • the intermediate computing result output by the first computing unit is used as The input feature data of the next computing unit
  • the next feature block is used as the input feature data of the first computing unit until all the processing of several computing units is completed.
  • the neural network acceleration device further includes a digital signal processor
  • the method further includes: when the first type of operator cannot be used, the initial calculation result is processed by the digital signal processor to obtain an intermediate calculation result .
  • the first type of operator corresponds to an accelerated operation suitable for a dedicated digital circuit
  • the digital signal processor is used to process operations other than the first type of operator that are not suitable for a dedicated digital circuit
  • the first type of operator includes at least one of the following: an operator for performing a pooling operation, an operator for performing an activation function operation, and an operator for performing an addition operation.
  • an embodiment of the present application provides a chip, and the chip includes the neural network acceleration device as described in the first aspect.
  • the embodiment of the present application provides an electronic device, the electronic device includes a memory and a processor; wherein,
  • memory for storing computer programs capable of running on the processor
  • a processor configured to execute the method as described in the second aspect when running the computer program.
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method described in the second aspect is implemented.
  • references to “some embodiments” describe a subset of all possible embodiments, but it is understood that “some embodiments” may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict.
  • first ⁇ second ⁇ third involved in the embodiment of the present application is only used to distinguish similar objects, and does not represent a specific ordering of objects. Understandably, “first ⁇ second ⁇ The specific order or sequence of "third” may be interchanged where permitted so that the embodiments of the application described herein can be implemented in an order other than that illustrated or described herein.
  • in-memory computing is an emerging computing architecture, which is a technical solution proposed to solve the memory wall problem.
  • the computer system based on the von Neumann architecture divides the memory and the processor into two parts, and the overhead of the processor frequently accessing the memory forms a memory wall.
  • In-memory computing is to combine computing and storage into one, that is, to complete computing inside the memory, thereby reducing the frequency of processor access to memory.
  • in-memory computing has the characteristics of high parallelism and high energy efficiency. It is a better alternative for algorithms that require a large number of parallel matrix-vector multiplication operations, especially neural network algorithms.
  • AI artificial intelligence
  • PE Processing Engine
  • the accumulation unit is the core unit.
  • the storage resources that need to be invoked also increase.
  • the performance of the entire system is subject to the performance of the storage unit.
  • Fig. 1 shows a schematic architecture diagram of an artificial intelligence accelerator.
  • the data is moved from the memory to the processor, and then the PE array in the processor performs data calculation, and then writes the result back to the memory; wherein, the PE array includes several PEs. That is to say, for the current von Neumann architecture, its basic structure is an architecture in which the computing unit and the memory are separated. The computing unit reads data from the memory, and writes the result back to the memory after the calculation is completed.
  • the improvement of memory performance is relatively slow. Under the increasing algorithm requirements, data transfer has become the bottleneck of the system.
  • An embodiment of the present application provides a neural network acceleration device, the neural network acceleration device includes several computing units, each computing unit includes an in-memory computing array and a first operator module, and the first operator module includes several The first type of operator; wherein, the in-memory calculation array is used to obtain the input feature data, and perform convolution operation on the input feature data to obtain the initial calculation result; the first operator module is used to use the first type of operator to pair The operator operation is performed on the initial calculation result to obtain the intermediate calculation result, and the intermediate calculation result is used as the input characteristic data of the next operation unit.
  • the neural network acceleration device uses a chain structure, that is, the intermediate calculation result output by the current calculation unit is used as the input feature data of the next calculation unit, which makes the system scale good; in addition, it makes full use of the intelligent algorithm structure and memory
  • the characteristics of the in-memory computing array can not only reduce the amount of data transmission between the processor and the memory, reduce the cost of data handling, and then reduce power consumption; but also use the in-memory computing array to reduce the complexity of calculation, thereby improving the overall performance of the system.
  • FIG. 2 shows a schematic structural diagram of a neural network acceleration device provided in an embodiment of the present application.
  • the neural network acceleration device 20 may include several computing units, each computing unit may include an in-memory computing array and a first operator module, and the first operator module includes several first-type computing son; among them,
  • An in-memory computing array is used to obtain the input feature data, and perform a convolution operation on the input feature data to obtain an initial calculation result;
  • the first operator module is used to perform an operator operation on the initial calculation result by the first type of operator to obtain an intermediate calculation result, and use the intermediate calculation result as the input feature data of the next calculation unit.
  • the neural network structures can be grouped based on the characteristics of the neural network structures (such as artificial intelligence networks).
  • the neural network structure can include several groups, where each group includes a convolutional layer and a non-convolutional operator; thus, this algorithm structure is mapped to the hardware architecture so that it is compatible with the computing unit in the hardware architecture Corresponding.
  • the convolutional layer can implement the convolution operation based on the in-memory computing array
  • the non-convolution operator can implement the operator operation based on the first operator module.
  • the neural network acceleration device may include several computing units, and the intermediate calculation result output by the current computing unit is used as the input characteristic data of the next computing unit, that is, a chain structure is used, It is very convenient to expand the scale of the system.
  • the in-memory computing method has been proposed in recent years, that is to say, the analog circuit is directly used in the storage unit to perform multiplication and accumulation operations without transferring data from the storage The unit is moved out and then calculated using a computing engine based on digital circuits.
  • This solution not only greatly reduces the amount of data transmission, but also saves a lot of multiplication and addition operations.
  • the basic operation is a matrix multiplication operation, specifically as shown in formula (1),
  • the black-filled cells are used to store the value of the weight parameter, and the voltage is applied in the horizontal direction, and x 1 , x 2 , x 3, x 4 can be used to characterize the magnitude of the voltage; then in the vertical direction, each black-filled cell
  • the output analog value can be expressed as the product of x and w, then the output of each column can be represented by y 1 , y 2 , y 3 , and y 4 , which match the matrix multiplication results in the above formula (1) .
  • the weight parameters corresponding to the target convolutional layer are pre-stored in the memory calculation array;
  • the in-memory calculation array is used to perform a convolution operation on the input feature data according to the weight parameters after obtaining the input feature data corresponding to the target convolution layer to obtain an initial calculation result.
  • the current computing unit will perform a convolution operation on the target convolutional layer. Specifically, according to the in-memory calculation array in the current operation unit, the convolution operation is performed on the weight parameters corresponding to the target convolution layer and the input feature data corresponding to the target convolution layer to obtain the initial calculation result; and then according to the current operation unit.
  • the first calculation module performs operator operations on the initial calculation results to obtain intermediate calculation results, and continues to use the intermediate calculation results as the input characteristic data of the next calculation unit, and so on until all the processing of several calculation units is completed.
  • FIG. 4 shows a schematic diagram of an architecture of an in-memory computing array provided by an embodiment of the present application.
  • the calculation array 40 in the memory can include a digital-to-analog conversion (Digital-to-Analog Conversion, DAC) module 401, a storage array 402 and an analog-to-digital conversion (Analog-to-Digital Conversion, ADC) module 403; in,
  • DAC Digital-to-Analog Conversion
  • ADC analog-to-digital conversion
  • a digital-to-analog conversion module 401 configured to perform digital-to-analog conversion on the input feature data to obtain a first analog signal
  • the storage array 402 is used to perform multiplication and accumulation calculation according to the weight parameter and the first analog signal to obtain the second analog signal;
  • the analog-to-digital conversion module 403 is configured to perform analog-to-digital conversion on the second analog signal to obtain a target digital signal, and determine the target digital signal as an initial calculation result.
  • the weight data in the embodiment of the present application does not need to be continuously loaded during the execution process, but only needs to be pre-loaded into the storage array in the in-memory computing array, use related components to perform analog data calculation, and finally pass The analog-to-digital conversion module 403 converts it into a target digital signal for output.
  • FIG. 5 shows a schematic structural diagram of a computing unit provided in an embodiment of the present application.
  • the computing unit may include an in-memory computing array 40 and a first operator module 50; wherein, the target digital signal of the storage computing array 40 after analog-to-digital conversion may interact with the first operator module 50 . That is to say, for the artificial intelligence network, it can not only realize the operation of the convolution operator, but also there are a large number of other operators in the artificial intelligence network in addition to the convolution layer, and the data exchange between each operator is also required. interact.
  • the first type of operator represents an accelerated operation suitable for a dedicated digital circuit
  • the first type of operator includes at least one of the following: an operator for performing a pooling operation, an operator for performing an activation function An operator for operations and an operator for performing addition operations.
  • the first operator module 50 may include an addition operator (Adder), an activation function operator (Activation) and a pooling operator (Pooling).
  • the neural network acceleration device 20 also includes a digital signal processor (Digital Signal Processor, DSP); wherein,
  • the digital signal processor is used to process the initial calculation result to obtain the intermediate calculation result when the first type of operator cannot be used.
  • the first type of operator corresponds to the accelerated operation applicable to special-purpose digital circuits
  • the digital signal processor is used to process other than the first type of operator that is not suitable for special-purpose digital circuits. operation of the circuit.
  • the digital signal processor mainly deals with situations where the first type of operator cannot be used, such as the more complex sigmoid activation function, tanh activation function, or softmax activation function.
  • the first operator module can also be called a fixed function (Fixed Function) module, which mainly uses addition operators, activation function operators and pooling operators, etc.
  • Digital circuits perform accelerated calculations; for calculations that are not suitable for dedicated digital circuits, digital signal processors (DSPs) are usually used to complete them.
  • DSPs digital signal processors
  • FIG. 6 there may be four computing units, namely, computing unit 1, computing unit 2, computing unit 3, and computing unit 4.
  • the computing unit 1 may include an in-memory computing array 1 and a first operator module 1.
  • the computing unit 2 may include an in-memory computing array 2 and a first operator module 2
  • the computing unit 3 may include an in-memory computing array 3 and a first operator module 3
  • the computing unit 4 may include an in-memory computing array 4 and a The first operator module 4
  • the in-memory computing array (for example, the in-memory computing array 1, the in-memory computing array 2, the in-memory computing array 3 or the in-memory computing array 4) includes a digital-to-analog conversion module, a storage array and The analog-to-digital conversion module, and the digital-to-analog conversion module and the analog-to-digital conversion module are respectively placed at the data input end and the data output end of the calculation array in the memory, because the calculation in the memory uses analog signals for processing;
  • the first operator module (for example, The first operator module 1, the first operator module 2, the first operator module 3 or the first operator module 4) are other commonly used operators in artificial intelligence algorithms, such as pooling, activation
  • the part implemented using a dedicated digital circuit can be called a fixed function; for some accelerated operations in artificial intelligence algorithms that are not suitable for implementation by a dedicated digital circuit, such as sigmoid activation function, tanh activation function or softmax activation function, etc., it can be used DSP to complete.
  • the neural network acceleration device 20 may also include a receiving unit; wherein,
  • the receiving unit is configured to receive the feature image, divide the feature image into at least one feature block, and sequentially read the at least one feature block into the computing unit.
  • the input feature data of the first computing unit is the first feature block
  • the first computing unit after obtaining the intermediate calculation result output by the first computing unit, the first computing unit The intermediate calculation result output by the unit is used as the input feature data of the next operation unit, and the next feature block is used as the input feature data of the first operation unit until all the processing of several operation units is completed.
  • the input feature data of computing unit 1 is provided by the receiving unit; the output of computing unit 1 is taken as the input of computing unit 2, and the output of computing unit 2 is used as computing unit 3, the output of the computing unit 3 is used as the input of the computing unit 4, until all the processing of these four computing units is completed, and the target output result is obtained.
  • the digital signal processor can be used to assist in the processing.
  • the neural network acceleration device 20 may also include a sending unit and a scheduling unit;
  • the unit can be used to send the obtained target output results to the outside after all the processing of several computing units is completed;
  • the scheduling unit can be used to schedule and arrange several computing units to realize the input Processing of feature data;
  • the scheduling unit can also schedule the receiving unit and the sending unit, so as to schedule the receiving unit to process when the feature image needs to be received, or schedule the sending unit to send it out after obtaining the target output result.
  • the neural network structure (such as artificial intelligence network) can be grouped, that is, the neural network structure can include several groups; wherein, each group includes a convolutional layer and an operator layer, and in each group, the convolution layer implements the convolution operation based on the in-memory computing array, and the operator layer implements the operator operation based on the first operator module or digital signal processor.
  • FIG. 7 it shows a schematic diagram of a composition structure of a neural network structure provided by an embodiment of the present application.
  • the neural network structure can be divided into convolutional layer 0 (represented by Conv0), operator 0 (represented by FF0), convolutional layer 1 (represented by Conv1), operator 1 (represented by FF1) , convolutional layer 2 (represented by Conv2), operator 2 (represented by FF2), convolutional layer 3 (represented by Conv3), operator 3 (represented by FF3), etc.; among them, Conv0 and FF0 are a group, Conv1 and FF1 are one group, Conv2 and FF2 are one group, and Conv3 and FF3 are one group.
  • the computing unit is the i-th computing unit, and the in-memory computing array in the i-th computing unit pre-stores the weight parameters corresponding to the i-th convolutional layer;
  • the in-memory computing array is used to obtain the input feature data corresponding to the i-th convolutional layer, and perform convolution operations on the input feature data corresponding to the i-th convolutional layer according to the weight parameters corresponding to the i-th convolutional layer to obtain the i-th volume
  • the initial calculation result of the layer
  • the first operator module is used to perform an operator operation on the initial calculation result of the i-th convolutional layer through the first type of operator to obtain the intermediate calculation result of the i-th convolutional layer, and convert the intermediate calculation result of the i-th convolutional layer The result is determined as the input feature data corresponding to the i+1th convolutional layer.
  • i is an integer greater than zero and less than or equal to N; N represents the number of operation units, and N is an integer greater than zero.
  • the computing unit is the i-th computing unit, and the in-memory computing array in the i-th computing unit pre-stores the i-th convolutional layer and the i+1-th convolutional layer corresponding weight parameters;
  • the in-memory computing array is used to obtain the input feature data corresponding to the i-th convolutional layer, and perform convolution operations on the input feature data corresponding to the i-th convolutional layer according to the weight parameters corresponding to the i-th convolutional layer to obtain the i-th volume
  • the initial calculation result of the layer
  • the first operator module is used to perform an operator operation on the initial calculation result of the i-th convolutional layer through the first type of operator to obtain the intermediate calculation result of the i-th convolutional layer, and convert the intermediate calculation result of the i-th convolutional layer
  • the result is determined as the input feature data corresponding to the i+1th convolutional layer and is still input into the i-th computing unit for related processing.
  • the weight parameters corresponding to the i+1th convolutional layer are still pre-stored in the in-memory computing array in the i-th computing unit, then It can still be input into the i-th computing unit for related processing; after the intermediate calculation result of the i+1th convolutional layer is obtained according to the i-th computing unit, the intermediate calculation result of the i+1th convolutional layer is determined is the input feature data corresponding to the i+2th convolutional layer; since the weight parameters corresponding to the i+2th convolutional layer are pre-stored in the memory calculation array in the i+1th computing unit, at this time, the i+th convolutional layer needs to be 2
  • the input feature data corresponding to the convolutional layer is input to the i+1th computing unit for related processing.
  • i is an integer greater than zero and less than or equal to N; N represents the number of operation units, and N is an
  • FIG. 7 it shows a general structure diagram of a neural network structure.
  • the weight data used by the convolutional layer needs to be solidified into the memory computing array in advance, as shown in Figure 3, due to the large number of convolutional layers in the neural network structure, the operation of each convolutional layer contains a large number of weights data, and the total size of the in-memory computing array used to store weight data in the system is fixed, according to the neural network acceleration device 20 shown in Figure 6, four computing units are set here, and each computing unit includes an in-memory computing array and The first operator module; therefore, each in-memory computing array may store the parameters of one or more convolutional layers.
  • the weight parameters corresponding to Conv0 and Conv1 in Fig. 7 are pre-stored in the in-memory computing array 1 in Fig. 6, since the weight data has been loaded into the in-memory computing array 1 in advance, then the feature The image is segmented, and then read into the memory calculation array 1 in sequence; specifically, it can be converted into an analog signal through a digital-to-analog conversion module, and the multiplied and accumulated analog signal is obtained through the calculation of the storage array, and then through the analog-to-digital conversion module Convert it into a digital signal and send it to the first operator module to perform the operation of the FF0 operator; the next thing to be calculated is Conv1, and the weight parameters in Conv1 are still pre-stored in the memory calculation array 1, so in the figure In 6, the output of the FF0 module needs to continue to be sent to the in-memory computing array 1, and so on, until the input feature data is completely executed and the first three layers (Conv0, FF0, Con
  • each convolutional layer and operator layer are implemented based on the computing unit, each computing unit includes an in-memory computing array and the first operator module, and one computing unit is one in Figure 6
  • the dotted line box, and one computing unit can perform operations on multiple groups in the algorithm structure, and then pass the computing results to the next computing unit after completion.
  • This architecture fully combines the characteristics of artificial intelligence algorithm structure and in-memory computing array, which greatly reduces the amount of data transmission.
  • the overall architecture uses a chain structure, it is very convenient to expand the system scale. It is not limited to the four-level transmission architecture used for illustration in the embodiment of the present application.
  • the first operator module in the architecture shown in FIG. 6 may be any algorithm suitable for implementation by a dedicated acceleration circuit.
  • the grouping of functions in the artificial intelligence network may take various forms, and is not limited to the example shown in FIG. 7 .
  • the neural network acceleration device includes several computing units, each computing unit includes an in-memory computing array and a first operator module, and the first operator module includes several A type of operator; among them, the in-memory calculation array is used to obtain the input feature data, and perform convolution operation on the input feature data to obtain the initial calculation result; the first operator module is used to use the first type of operator to perform initial calculation results The calculation result is subjected to operator operation to obtain the intermediate calculation result, and the intermediate calculation result is used as the input characteristic data of the next operation unit.
  • the neural network acceleration device uses a chain structure, that is, the intermediate calculation result output by the current calculation unit is used as the input feature data of the next calculation unit, which makes the system scale good; in addition, it makes full use of the intelligent algorithm structure and memory
  • the characteristics of the in-memory computing array can not only reduce the amount of data transmission between the processor and the memory, reduce the cost of data handling, and then reduce power consumption; but also use the in-memory computing array to reduce the complexity of calculation, thereby improving the overall performance of the system.
  • FIG. 8 shows a schematic flowchart of a neural network acceleration method provided in an embodiment of the present application. As shown in Figure 8, the method may include:
  • S801 Obtain input feature data through an in-memory calculation array, and perform a convolution operation on the input feature data to obtain an initial calculation result.
  • S802 Perform an operator operation on the initial calculation result by using a first-type operator in the first operator module to obtain an intermediate calculation result.
  • S803 Use the intermediate calculation result as the input characteristic data of the next operation unit until all the processing of several operation units is completed, and determine the target output result.
  • the neural network acceleration device may include several computing units, and each computing unit includes an in-memory computing array and a first operator module; at the same time, the intermediate calculation result output by the current calculation unit is used as the input characteristic data of the next calculation unit, even if the chain structure is used, the system scale can be easily expanded.
  • the weight parameters corresponding to the target convolutional layer are pre-stored in the in-memory computing array; correspondingly, in some embodiments, for S801, the input feature data is acquired through the in-memory computing array, and the input The feature data is convolved to obtain the initial calculation results, which can include:
  • the in-memory calculation array After the in-memory calculation array acquires the input feature data corresponding to the target convolution layer, the input feature data is convoluted according to the weight parameters to obtain the initial calculation result.
  • the performing convolution operation on the input feature data according to the weight parameters to obtain the initial calculation result may include:
  • the in-memory computing array may include a digital-to-analog conversion module, a storage array, and an analog-to-digital conversion module, and the digital-to-analog conversion module is located at the data input end of the in-memory computing array, and the analog-to-digital conversion module is located at the memory The data output terminal of the internal calculation array.
  • the digital-to-analog conversion module is used to perform digital-to-analog conversion on the input feature data to obtain the first analog signal;
  • the storage array is used to perform multiplication and accumulation calculations according to the weight parameter and the first analog signal to obtain the second analog signal;
  • the digital conversion module is used for performing analog-to-digital conversion on the second analog signal to obtain a target digital signal, where the target digital signal is the initial calculation result, and then sent to the first operator module for operator operation.
  • the neural network acceleration device may also include a digital signal processor.
  • the method may further include: when the first type of operator cannot be used, processing the initial calculation result by a digital signal processor to obtain an intermediate calculation result.
  • the first type of operator corresponds to an accelerated operation applicable to a dedicated digital circuit, which can be called a Fixed Function module; a digital signal processor is used to process other than the first type of operator Except for operations that are not applicable to special-purpose digital circuits, that is to say, for operations that are not suitable for special-purpose digital circuits, digital signal processors, namely DSP, are usually used to complete at this time.
  • the first type of operator may include at least one of the following: an operator for performing a pooling operation (ie, a pooling operator), an operator for performing an activation function operation (ie, an activation function operator Sub) and the operator used to perform the addition operation (that is, the addition operator); the digital signal processor mainly deals with the situation where the first type of operator cannot be used, such as the more complex sigmoid activation function, tanh activation function, or softmax activation function etc.
  • the activation function operators in the first type of operators do not include operators such as sigmoid activation function, tanh activation function, and softmax activation function.
  • the method may further include: receiving the feature image; dividing the feature image into at least one feature block, and sequentially reading the at least one feature block into the computing unit.
  • the input feature data of the first computing unit is the first feature block
  • the first The intermediate calculation result output by the computing unit is used as the input feature data of the next computing unit
  • the next feature block is used as the input feature data of the first computing unit until all the processing of several computing units is completed.
  • the input feature data of computing unit 1 is provided by the receiving unit; the output of computing unit 1 is taken as the input of computing unit 2, and the output of computing unit 2 is used as The input of the computing unit 3 and the output of the computing unit 3 are used as the input of the computing unit 4 until all the processing of these four computing units is completed, and the target output result is obtained.
  • the digital signal processor can be used to assist in processing, which increases the versatility of the algorithm.
  • the neural network structure may include several groups; where each group includes a convolutional layer and an operator layer, and in each grouping, the convolutional layer may be based on The inner computing array realizes the convolution operation, and the operator layer can realize the operator operation based on the first operator module or a digital signal processor.
  • the method may further include:
  • the method may further include:
  • the intermediate calculation result of the i+1th convolutional layer is determined as the input feature data corresponding to the i+2th convolutional layer and input to Correlation processing is performed in the i+1th computing unit.
  • the weight parameters corresponding to the i+1th convolutional layer are pre-stored in the memory calculation array in the i+1th computing unit, Then it can be input into the i+1th computing unit for related processing; if the weight parameters corresponding to the i+1th convolutional layer are still pre-stored in the in-memory computing array in the i+1th computing unit, then it can be It is still input to the i-th computing unit for related processing; after the intermediate calculation result of the i+1th convolutional layer is obtained according to the i-th computing unit, the intermediate calculation result of the i+1th convolutional layer is determined as the i-th The input feature data corresponding to the +2 convolutional layer; since the weight parameters corresponding to the i+2th convolutional layer are pre-stored in the in-memory calculation array in the i+1th computing unit, at this time it is necessary to convolve the
  • the traditional von Neumann architecture is centered on the computing unit, and there is a large amount of data handling.
  • the complexity of artificial intelligence scenarios the amount of data that the algorithm needs to process is increasing, and the performance improvement based on the traditional architecture is getting smaller and smaller.
  • the technical solution of the embodiment of this application is based on a relatively mature in-memory computing solution. Convolution operations can be realized, combined with the characteristics of non-convolution operators, so that the overall architecture can realize the function of a general artificial intelligence network.
  • the weight parameters do not need to be loaded continuously during the execution process, but only need to be pre-loaded into the memory for calculation and storage
  • the overall architecture uses a chain structure, it is very convenient to expand the system scale. It is not limited to the four-level transmission architecture used for illustration in the embodiment of the present application.
  • the first operator module in the architecture shown in FIG. 6 may be any operator suitable for implementation by a dedicated acceleration circuit.
  • the grouping of functions in the artificial intelligence network may take various forms, and is not limited to the examples in the embodiments of the present application.
  • This embodiment provides a neural network acceleration method, which is applied to the neural network acceleration device 20 described in the foregoing embodiments.
  • the neural network acceleration device uses a chain structure, that is, the intermediate calculation result output by the current calculation unit is used as the input feature data of the next calculation unit, the scalability of the system scale is good; in addition, the intelligent algorithm structure and memory are fully utilized.
  • the characteristics of the in-memory computing array can not only reduce the amount of data transmission between the processor and the memory, reduce the cost of data handling, and then reduce power consumption; but also use the in-memory computing array to reduce the complexity of calculation, thereby improving the overall performance of the system.
  • the neural network acceleration device 20 described in the foregoing embodiments may be implemented in the form of hardware or in the form of software function modules. If it is implemented in the form of a software function module and is not sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage
  • the medium includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) execute all or part of the steps of the method described in this embodiment.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other various media that can store program codes.
  • this embodiment provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by at least one processor, the neural network described in any one of the preceding embodiments is implemented. Acceleration method.
  • FIG. 9 shows a schematic diagram of a specific hardware structure of an electronic device provided by the embodiment of the present application.
  • the electronic device 90 may include a processor 901, and the processor 901 may call and run a computer program from a memory, so as to implement the neural network acceleration method described in any one of the foregoing embodiments.
  • the electronic device 90 may further include a memory 902 .
  • the processor 901 can call and run a computer program from the memory 902, so as to implement the neural network acceleration method described in any one of the foregoing embodiments.
  • the memory 902 may be an independent device independent of the processor 901 , or may be integrated in the processor 901 .
  • the electronic device 90 may further include a transceiver 903, and the processor 901 may control the transceiver 903 to communicate with other devices, specifically, to send information or data to other devices, or receive other Information or data sent by the device.
  • the transceiver 903 may include a transmitter and a receiver, and the transceiver 903 may further include an antenna, and the number of antennas may be one or more.
  • the electronic device 90 may specifically be the smart phone, tablet computer, palmtop computer, notebook computer, desktop computer and other devices described in the foregoing embodiments, or the neural network acceleration device 20 integrated with any of the foregoing embodiments. device of.
  • the electronic device 90 can implement the corresponding processes described in the various methods of the embodiments of the present application, and for the sake of brevity, details are not repeated here.
  • FIG. 10 shows a chip provided by the embodiment of the present application Schematic diagram of the composition structure.
  • the chip 100 may include the neural network acceleration device 20 described in any one of the foregoing embodiments.
  • FIG. 11 shows a schematic diagram of a specific hardware structure of a chip provided by an embodiment of the present application.
  • the chip 100 may include a processor 1101 , and the processor 1101 may call and run a computer program from a memory, so as to implement the neural network acceleration method described in any one of the foregoing embodiments.
  • the chip 100 may further include a memory 1102 .
  • the processor 1101 can call and run a computer program from the memory 1102, so as to realize the neural network acceleration method described in any one of the foregoing embodiments.
  • the memory 1102 may be an independent device independent of the processor 1101 , or may be integrated in the processor 1101 .
  • the chip 100 may further include an input interface 1103 .
  • the processor 1101 can control the input interface 1103 to communicate with other devices or chips, specifically, can obtain information or data sent by other devices or chips.
  • the chip 100 may further include an output interface 1104 .
  • the processor 1101 can control the output interface 1104 to communicate with other devices or chips, specifically, can output information or data to other devices or chips.
  • the chip 100 can be applied to the electronic device described in the foregoing embodiments, and the chip can implement the corresponding processes described in the various methods of the embodiments of the present application, and for the sake of brevity, details are not repeated here.
  • chips mentioned in the embodiments of the present application may also be called system-on-chip, system-on-chip, system-on-chip, or system-on-a-chip, etc., which are not limited herein.
  • the processor in the embodiment of the present application may be an integrated circuit chip, which has a signal processing capability.
  • each step of the above-mentioned method embodiments may be completed by an integrated logic circuit of hardware in a processor or instructions in the form of software.
  • the above-mentioned processor can be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other available Program logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware.
  • the memory in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories.
  • the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electronically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash.
  • the volatile memory can be Random Access Memory (RAM), which acts as external cache memory.
  • RAM Static Random Access Memory
  • SRAM Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • Synchronous Dynamic Random Access Memory Synchronous Dynamic Random Access Memory
  • SDRAM double data rate synchronous dynamic random access memory
  • Double Data Rate SDRAM DDRSDRAM
  • enhanced SDRAM ESDRAM
  • synchronous chain dynamic random access memory Synchronous link DRAM, SLDRAM
  • Direct Rambus RAM Direct Rambus RAM
  • the embodiments described in this application may be implemented by hardware, software, firmware, middleware, microcode or a combination thereof.
  • the processing unit can be implemented in one or more application specific integrated circuits (Application Specific Integrated Circuits, ASIC), digital signal processor (Digital Signal Processing, DSP), digital signal processing device (DSP Device, DSPD), programmable Logic device (Programmable Logic Device, PLD), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), general-purpose processor, controller, microcontroller, microprocessor, other devices used to perform the functions described in this application electronic unit or its combination.
  • the techniques described herein can be implemented through modules (eg, procedures, functions, and so on) that perform the functions described herein.
  • Software codes can be stored in memory and executed by a processor. Memory can be implemented within the processor or external to the processor.
  • the neural network acceleration device includes several computing units, each computing unit includes an in-memory computing array and a first operator module, and the first operator module includes several first-type operators; wherein , an in-memory computing array, used to obtain the input feature data, and perform convolution operations on the input feature data to obtain the initial calculation result; the first operator module is used to perform operator operations on the initial calculation result through the first type of operator , to obtain the intermediate calculation result, and use the intermediate calculation result as the input characteristic data of the next operation unit.
  • the neural network acceleration device uses a chain structure, that is, the intermediate calculation result output by the current calculation unit is used as the input feature data of the next calculation unit, which makes the system scale good; in addition, it makes full use of the intelligent algorithm structure and memory
  • the characteristics of the in-memory computing array can not only reduce the amount of data transmission between the processor and the memory, reduce data handling costs, but also use the in-memory computing array to reduce the complexity of calculations, thereby improving the overall performance of the system.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种神经网络加速装置、方法、设备和计算机存储介质,该神经网络加速装置包括若干个运算单元,每一个运算单元包括存内计算阵列和第一算子模块,且第一算子模块中包括若干个第一类算子;其中,存内计算阵列,用于获取输入特征数据,并对输入特征数据进行卷积操作,得到初始计算结果;第一算子模块,用于通过第一类算子对初始计算结果进行算子操作,得到中间计算结果,并将中间计算结果作为下一个运算单元的输入特征数据。这样,不仅可以减少处理器与存储器之间的数据传输量,降低数据搬运开销,而且利用该存内计算阵列还能够降低计算的复杂度,进而提高了系统的整体性能。

Description

一种神经网络加速装置、方法、设备和计算机存储介质
相关申请的交叉引用
本申请要求在2021年12月23日提交中国专利局、申请号为202111592393.6、申请名称为“一种神经网络加速装置、方法、设备和计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及存内计算技术领域,尤其涉及一种神经网络加速装置、方法、设备和计算机存储介质。
背景技术
近年来,神经网络在实际应用中取得了显著的成功,如图像分类和图标检测等,但这些成果在很大程度上依赖于具有大量参数和计算的复杂神经网络模型。目前,将这些需要大量计算和数据搬移的复杂神经网络模型,部署到基于冯.诺依曼架构的神经网络加速器上,将会出现所谓的存储墙(Memory Wall)问题,即数据搬移速度跟不上数据处理速度。
在冯.诺依曼架构中,虽然实现了计算单元和内存相分离,但是计算单元需要从内存中读取数据,然后再把计算结果写回到存储器。这样,即使增加再多的计算能力,由于读取数据速度的限制,整个系统的性能提升并不明显,而且大量的数据传输也将带来大量的功耗消耗。
发明内容
本申请的技术方案是这样实现的:
第一方面,本申请实施例提供了一种神经网络加速装置,该神经网络加速装置包括若干个运算单元,每一个运算单元包括存内计算阵列和第一算子模块,且第一算子模块中包括若干个第一类算子;其中,
存内计算阵列,用于获取输入特征数据,并对输入特征数据进行卷积操作,得到初始计算结果;
第一算子模块,用于通过第一类算子对初始计算结果进行算子操作,得到中间计算结果,并将中间计算结果作为下一个运算单元的输入特征数据。
第二方面,本申请实施例提供了一种神经网络加速方法,应用于神经网络加速装置,该神经网络加速装置包括若干个运算单元,且每一个运算单元包括存内计算阵列和第一算子模块;该方法包括:
通过存内计算阵列获取输入特征数据,并对输入特征数据进行卷积操作,得到初始计算结果;
通过第一算子模块内的第一类算子对初始计算结果进行算子操作,得到中间计算结果;
将中间计算结果作为下一个运算单元的输入特征数据,直至若干个运算单元全部处理完成,确定目标输出结果。
第三方面,本申请实施例提供了一种芯片,该芯片包括如第一方面所述的神经网络加速装置。
第四方面,本申请实施例提供了一种电子设备,该电子设备包括存储器和处理器;其 中,
存储器,用于存储能够在处理器上运行的计算机程序;
处理器,用于在运行计算机程序时,执行如第二方面所述的方法。
第五方面,本申请实施例提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时实现如第二方面所述的方法。
附图说明
图1为一种人工智能加速器的架构示意图;
图2为本申请实施例提供的一种神经网络加速装置的组成结构示意图;
图3为本申请实施例提供的一种存内计算的基本结构示意图;
图4为本申请实施例提供的一种存内计算阵列的架构示意图;
图5为本申请实施例提供的一种运算单元的架构示意图;
图6为本申请实施例提供的一种神经网络加速装置的架构示意图;
图7为本申请实施例提供的一种神经网络结构的组成结构示意图;
图8为本申请实施例提供的一种神经网络加速方法的流程示意图;
图9为本申请实施例提供的一种电子设备的具体硬件结构示意图;
图10为本申请实施例提供的一种芯片的组成结构示意图;
图11为本申请实施例提供的一种芯片的具体硬件结构示意图。
具体实施方式
第一方面,本申请实施例提供了一种神经网络加速装置,神经网络加速装置包括若干个运算单元,运算单元包括存内计算阵列和第一算子模块,且第一算子模块中包括若干个第一类算子;其中,
存内计算阵列,用于获取输入特征数据,并对输入特征数据进行卷积操作,得到初始计算结果;
第一算子模块,用于通过第一类算子对初始计算结果进行算子操作,得到中间计算结果,并将中间计算结果作为下一个运算单元的输入特征数据。
在一些实施例中,存内计算阵列中预先存储有目标卷积层对应的权重参数;其中,
存内计算阵列,用于在获取到目标卷积层对应的输入特征数据后,根据权重参数对输入特征数据进行卷积操作,得到初始计算结果。
在一些实施例中,存内计算阵列包括数模转换模块、存储阵列和模数转换模块;其中,
数模转换模块,用于对输入特征数据进行数模转换,得到第一模拟信号;
存储阵列,用于根据权重参数和第一模拟信号进行乘累加计算,得到第二模拟信号;
模数转换模块,用于对第二模拟信号进行模数转换,得到目标数字信号,将目标数字信号确定为初始计算结果。
在一些实施例中,运算单元为第i个运算单元,且第i个运算单元中的存内计算阵列预先存储有第i卷积层对应的权重参数;其中,
存内计算阵列,用于获取第i卷积层对应的输入特征数据,并根据第i卷积层对应的权重参数对第i卷积层对应的输入特征数据进行卷积操作,得到第i卷积层的初始计算结果;
第一算子模块,用于通过第一类算子对第i卷积层的初始计算结果进行算子操作,得到第i卷积层的中间计算结果,并将第i卷积层的中间计算结果确定为第i+1卷积层对应的输入特征数据;
其中,i为大于零且小于或等于N的整数;N表示运算单元的个数,且N为大于零的 整数。
在一些实施例中,神经网络加速装置还包括接收单元;其中,
接收单元,用于接收特征图像,并将特征图像分割为至少一个特征块,以及按照顺序依次将至少一个特征块读入到运算单元中。
在一些实施例中,在若干个运算单元中,第一个运算单元的输入特征数据为第一特征块,在得到第一个运算单元输出的中间计算结果之后,将第一个运算单元输出的中间计算结果作为下一个运算单元的输入特征数据,并将下一个特征块作为第一个运算单元的输入特征数据,直至若干个运算单元全部处理完成。
在一些实施例中,神经网络加速装置还包括发送单元;其中,
发送单元,用于在若干个运算单元全部处理完成后,将所得到的目标输出结果向外发送。
在一些实施例中,神经网络加速装置还包括调度单元;其中,
调度单元,用于对若干个运算单元进行调度安排,以实现若干个运算单元对输入特征数据的处理。
在一些实施例中,调度单元,还用于对接收单元和发送单元进行调度安排,以实现在接收特征图像时调度接收单元进行处理,或者在得到目标输出结果之后调度发送单元进行向外发送。
在一些实施例中,神经网络加速装置还包括数字信号处理器;其中,
数字信号处理器,用于在无法使用第一类算子的情况下,对初始计算结果进行处理,得到中间计算结果。
在一些实施例中,第一类算子对应于适用于专用数字电路的加速运算,数字信号处理器用于处理除第一类算子之外的不适用于专用数字电路的运算;
第一类算子至少包括下述之一:用于执行池化操作的算子、用于执行激活函数操作的算子和用于执行加法操作的算子。
第二方面,本申请实施例提供了一种神经网络加速方法,其中,应用于神经网络加速装置,神经网络加速装置包括若干个运算单元,且每一个运算单元包括存内计算阵列和第一算子模块;该方法包括:
通过存内计算阵列获取输入特征数据,并对输入特征数据进行卷积操作,得到初始计算结果;
通过第一算子模块内的第一类算子对初始计算结果进行算子操作,得到中间计算结果;
将中间计算结果作为下一个运算单元的输入特征数据,直至若干个运算单元全部处理完成,确定目标输出结果。
在一些实施例中,存内计算阵列中预先存储有目标卷积层对应的权重参数;相应地,通过存内计算阵列获取输入特征数据,并对输入特征数据进行卷积操作,得到初始计算结果,包括:
在存内计算阵列获取到目标卷积层对应的输入特征数据后,根据权重参数对输入特征数据进行卷积操作,得到初始计算结果。
在一些实施例中,根据权重参数对输入特征数据进行卷积操作,得到初始计算结果,包括:
对输入特征数据进行数模转换,得到第一模拟信号;
根据权重参数和第一模拟信号进行乘累加计算,得到第二模拟信号;
对第二模拟信号进行模数转换,得到目标数字信号,并将目标数字信号确定为初始计算结果。
在一些实施例中,当第i个运算单元中的存内计算阵列预先存储有第i卷积层对应的权重参数时,该方法还包括:
通过存内计算阵列获取第i卷积层对应的输入特征数据,并根据第i卷积层对应的权重参数对第i卷积层对应的输入特征数据进行卷积操作,得到第i卷积层的初始计算结果;
通过第一算子模块内的第一类算子对第i卷积层的初始计算结果进行算子操作,得到第i卷积层的中间计算结果,并将第i卷积层的中间计算结果确定为第i+1卷积层对应的输入特征数据以及输入到第i+1个运算单元中进行相关处理;
其中,i为大于零且小于或等于N的整数;N表示运算单元的个数,且N为大于零的整数。
在一些实施例中,当第i个运算单元中的存内计算阵列预先存储有第i卷积层和第i+1层对应的权重参数时,该方法还包括:
通过存内计算阵列获取第i卷积层对应的输入特征数据,并根据第i卷积层对应的权重参数对第i卷积层对应的输入特征数据进行卷积操作,得到第i卷积层的初始计算结果;
通过第一算子模块内的第一类算子对第i卷积层的初始计算结果进行算子操作,得到第i卷积层的中间计算结果,并将第i卷积层的中间计算结果确定为第i+1卷积层对应的输入特征数据以及仍输入到第i个运算单元中进行相关处理;
在根据第i个运算单元得到第i+1卷积层的中间计算结果后,将第i+1卷积层的中间计算结果确定为第i+2卷积层对应的输入特征数据以及输入到第i+1个运算单元中进行相关处理;
其中,i为大于零且小于或等于N的整数;N表示运算单元的个数,且N为大于零的整数。
在一些实施例中,该方法还包括:
接收特征图像;
将特征图像分割为至少一个特征块,并按照顺序依次将至少一个特征块读入到运算单元中;
其中,在若干个运算单元中,第一个运算单元的输入特征数据为第一特征块,在得到第一个运算单元输出的中间计算结果之后,将第一个运算单元输出的中间计算结果作为下一个运算单元的输入特征数据,并将下一个特征块作为第一个运算单元的输入特征数据,直至若干个运算单元全部处理完成。
在一些实施例中,神经网络加速装置还包括数字信号处理器,该方法还包括:在无法使用第一类算子的情况下,通过数字信号处理器对初始计算结果进行处理,得到中间计算结果。
在一些实施例中,第一类算子对应于适用于专用数字电路的加速运算,数字信号处理器用于处理除第一类算子之外的不适用于专用数字电路的运算;
第一类算子至少包括下述之一:用于执行池化操作的算子、用于执行激活函数操作的算子和用于执行加法操作的算子。
第三方面,本申请实施例提供了一种芯片,该芯片包括如第一方面所述的神经网络加速装置。
第四方面,本申请实施例提供了一种电子设备,电子设备包括存储器和处理器;其中,
存储器,用于存储能够在处理器上运行的计算机程序;
处理器,用于在运行计算机程序时,执行如第二方面所述的方法。
第五方面,本申请实施例提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时实现如第二方面所述的方法。
为了能够更加详尽地了解本申请实施例的特点与技术内容,下面结合附图对本申请实施例的实现进行详细阐述,所附附图仅供参考说明之用,并非用来限定本申请实施例。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是 旨在限制本申请。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。还需要指出,本申请实施例所涉及的术语“第一\第二\第三”仅是用于区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
应理解,存内计算(In-Memory Computing,CIM)是目前新兴的一种计算架构,其是为了解决内存墙问题而提出的技术方案。其中,基于冯.诺依曼架构的计算机系统把存储器和处理器分割成了两个部分,而处理器频繁访问存储器的开销就形成了内存墙。存内计算就是把计算和存储合二为一,即在存储器的内部完成计算,从而实现减少处理器访问存储器的频率。相比于传统架构,存内计算具有高并行度、高能量效率的特定,对于需要大量并行矩阵向量乘法操作的算法,特别是神经网络算法,是一种更优的替代方案。
具体来讲,人工智能(Artificial Intelligence,AI)场景依赖的算法是一个庞大而复杂的网络结构,有很多参数需要存储,也需要完成大量的计算,这些计算中又会产生大量数据。在完成大量计算的过程中,一般来说,为了增大计算能力,应对更加复杂的处理场景,需要在处理引擎阵列中不断扩充其计算单元或称为处理单元(Process Engine,PE),例如乘累加单元是其中的核心单元,但是,随着计算单元的增多,需要调用的存储资源也在增大,然而整个系统的性能受制于存储单元的性能。整个算法的运算过程中,需要不断地从外部存储器中读入数据并把结果数据写回到存储器;使得在传输带宽一定的情况下,随着计算引擎运算能力的提升,每个计算单元能够使用存储器的带宽在逐渐减小,数据的传输能力成为AI芯片的瓶颈。
示例性地,图1示出了一种人工智能加速器的架构示意图。如图1所示,数据从存储器搬移到处理器中,然后由处理器中的PE阵列进行数据计算,再将结果写回到存储器;其中,PE阵列包括若干个PE。也就是说,对于目前的冯.诺伊曼架构,其基本结构是计算单元与存储器分离的架构,计算单元从存储器中读取数据,计算完成后再把结果写回到存储器。但近些年来,随着处理器性能不断地增长,存储器的性能提升相对来说缓慢,在日益增长的算法需求下,数据的搬运成为了系统的瓶颈,即使再增加计算能力,由于系统中读取数据速度的限制,整体性能的提升越发不明显。另外,除了性能上的限制以外,大量的数据传输也带来了大量的功耗消耗,在目前功耗要求越来越高的情况下,这也是一个需要亟需解决的问题。
本申请实施例提供了一种神经网络加速装置,该神经网络加速装置包括若干个运算单元,每一个运算单元包括存内计算阵列和第一算子模块,且第一算子模块中包括若干个第一类算子;其中,存内计算阵列,用于获取输入特征数据,并对输入特征数据进行卷积操作,得到初始计算结果;第一算子模块,用于通过第一类算子对初始计算结果进行算子操作,得到中间计算结果,并将中间计算结果作为下一个运算单元的输入特征数据。这样,该神经网络加速装置使用了链式结构,即当前运算单元输出的中间计算结果作为下一个运算单元的输入特征数据,使得系统规模的扩展性好;另外,充分利用了智能算法结构与存内计算阵列的特点,从而不仅可以减少处理器与存储器之间的数据传输量,降低数据搬运开销,进而降低功耗消耗;而且利用该存内计算阵列还能够降低计算的复杂度,进而提高了系统的整体性能。
下面将结合附图对本申请各实施例进行详细说明。
在本申请的一实施例中,参见图2,其示出了本申请实施例提供的一种神经网络加速装置的组成结构示意图。如图2所示,该神经网络加速装置20可以包括若干个运算单元,每一个运算单元可以包括存内计算阵列和第一算子模块,且第一算子模块中包括若干个第 一类算子;其中,
存内计算阵列,用于获取输入特征数据,并对输入特征数据进行卷积操作,得到初始计算结果;
第一算子模块,用于通过第一类算子对初始计算结果进行算子操作,得到中间计算结果,并将中间计算结果作为下一个运算单元的输入特征数据。
需要说明的是,在本申请实施例中,基于神经网络结构(如人工智能网络)的特点,可以对神经网络结构进行分组。具体地,神经网络结构可以包括若干个分组,其中,每一个分组包括卷积层和非卷积算子;如此,将这种算法结构映射到硬件架构中,使其与硬件架构中的运算单元相对应。在每一个分组中,卷积层可以是基于存内计算阵列来实现卷积操作的,非卷积算子可以是基于第一算子模块来实现算子操作的。
还需要说明的是,在本申请实施例中,神经网络加速装置可以包括若干个运算单元,而且当前运算单元输出的中间计算结果作为下一个运算单元的输入特征数据,即使用了链式结构,可以很方便对系统规模进行扩展。
可以理解的是,对于存内计算阵列而言,近些年虽然已经提出了存内计算的方式,也就是说,在存储单元内直接使用模拟电路进行乘法与累加的运算,无需把数据从存储单元搬运出来然后再使用基于数字电路的运算引擎计算,这种方案不仅大大减少了数据的传输量,而且省掉了大量的乘加运算。示例性地,人工智能的神经网络结构中,基本的运算是矩阵乘法运算,具体如式(1)所示,
Figure PCTCN2022133443-appb-000001
另外,对于使用传统冯.诺依曼架构实现的情况,可以借助于乘累加树来完成,其中包含有乘法器和加法器。而对于使用存内计算的方式,可以使用图3所示的存内计算基本结构进行简单示意。其中,黑色填充的单元用于存储权重参数的数值,在横向上施加电压,可以使用x 1,x 2,x 3,x 4来表征电压的大小;那么在纵向上,每一个黑色填充的单元所输出的模拟值,可以表示为x与w的乘积,那么每一列的输出可以使用y 1,y 2,y 3,y 4表示,其分别与上述式(1)中的矩阵乘法结果相匹配。
在本申请实施例中,为了避免权重数据在执行过程中连续的被加载,可以将其预先存储到存内计算阵列中。因此,在一些实施例中,存内计算阵列中预先存储有目标卷积层对应的权重参数;其中,
存内计算阵列,用于在获取到目标卷积层对应的输入特征数据后,根据权重参数对输入特征数据进行卷积操作,得到初始计算结果。
也就是说,如果当前运算单元中的存内计算阵列预先存储有目标卷积层对应的权重参数,那么当前运算单元将会对目标卷积层执行卷积操作。具体地,根据当前运算单元中的存内计算阵列,对目标卷积层对应的权重参数和目标卷积层对应的输入特征数据进行卷积操作,得到初始计算结果;然后根据当前运算单元中的第一计算模块,对初始计算结果进行算子操作,得到中间计算结果,继续将中间计算结果作为下一个运算单元的输入特征数据,依次类推,直至若干个运算单元全部处理完成。
还可以理解的是,对于存内计算阵列而言,参见图4,其示出了本申请实施例提供的一种存内计算阵列的架构示意图。如图4所示,该存内计算阵列40可以包括数模转换(Digital-to-Analog Conversion,DAC)模块401、存储阵列402和模数转换(Analog-to-Digital Conversion,ADC)模块403;其中,
数模转换模块401,用于对输入特征数据进行数模转换,得到第一模拟信号;
存储阵列402,用于根据权重参数和第一模拟信号进行乘累加计算,得到第二模拟信号;
模数转换模块403,用于对第二模拟信号进行模数转换,得到目标数字信号,将目标数字信号确定为初始计算结果。
需要说明的是,本申请实施例中的权重数据无需在执行过程中连续的被加载,只需要预先加载到存内计算阵列中的存储阵列中,利用相关元器件进行模拟数据计算,最后再通过模数转换模块403将其转换为目标数字信号进行输出。
示例性地,以其中一个运算单元为例,图5示出了本申请实施例提供的一种运算单元的架构示意图。如图5所示,该运算单元可以包括存内计算阵列40和第一算子模块50;其中,存储计算阵列40在经过模数转换后的目标数字信号可以与第一算子模块50进行交互。也就是说,对于人工智能网络而言,其不仅可以实现卷积算子的运算,而且人工智能网络中除了卷积层,还存在大量的其他算子,各个算子之间也需要进行数据的交互。
在本申请实施例中,第一类算子表示适用于专用数字电路的加速运算,而且第一类算子至少包括下述之一:用于执行池化操作的算子、用于执行激活函数操作的算子和用于执行加法操作的算子。也就是说,如图5所示,第一算子模块50中可以包括加法算子(Adder)、激活函数算子(Activation)和池化算子(Pooling)。
除此之外,对于人工智能网络中不适用于专用数字电路的加速运算,则不能够使用第一类算子进行算子操作。因此,在一些实施例中,神经网络加速装置20还包括数字信号处理器(Digital Signal Processor,DSP);其中,
数字信号处理器,用于在无法使用第一类算子的情况下,对初始计算结果进行处理,得到中间计算结果。
需要说明的是,在本申请实施例中,第一类算子对应于适用于专用数字电路的加速运算,数字信号处理器用于处理除所述第一类算子之外的不适用于专用数字电路的运算。也就是说,数字信号处理器主要是处理无法使用第一类算子的情况,例如比较复杂的sigmoid激活函数、tanh激活函数、或者softmax激活函数等。
还需要说明的是,在本申请实施例中,第一算子模块还可以称为固定函数(Fixed Function)模块,其主要使用加法算子、激活函数算子和池化算子等适用于专用数字电路进行加速运算;而对于不适用于专用数字电路的运算情况,这时候通常使用数字信号处理器即DSP来完成。
在这里,由于存内计算只能适用于矩阵乘法运算,所以对于人工智能网络来说,其可以实现卷积算子的运算,但是人工智能网络中除了卷积层,还存在大量的其他算子,各个算子之间还需要进行数据的交互,可以根据已有的CIM单元来构建出基于CIM的人工智能加速器,即本申请实施例所述的神经网络加速装置20,其基本架构如图6所示。在图6中,若干个运算单元可以为四个,即运算单元1、运算单元2、运算单元3和运算单元4,运算单元1中可以包括存内计算阵列1和第一算子模块1,运算单元2中可以包括存内计算阵列2和第一算子模块2,运算单元3中可以包括存内计算阵列3和第一算子模块3,运算单元4中可以包括存内计算阵列4和第一算子模块4;其中,存内计算阵列(例如,存内计算阵列1、存内计算阵列2、存内计算阵列3或者存内计算阵列4)包含了数模转换模块、存储阵列和模数转换模块,而数模转换模块和模数转换模块分别放置于存内计算阵列的数据输入端和数据输出端,原因在于存内计算利用模拟信号进行处理;第一算子模块(例如,第一算子模块1、第一算子模块2、第一算子模块3或者第一算子模块4)为人工智能算法中的其他常用算子,例如池化、激活函数、加法等适合于使用专用数字电路实现的部分,可称之为fixed function;而对于人工智能算法中的一些不适合专用数字电路实现的加速运算,例如sigmoid激活函数、tanh激活函数或者softmax激活函数等,其可以使用DSP来完成。
进一步地,在一些实施例中,在图6所示神经网络加速装置20的基础上,如图6所示,该神经网络加速装置20还可以包括接收单元;其中,
接收单元,用于接收特征图像,并将特征图像分割为至少一个特征块,以及按照顺序依次将至少一个特征块读入到运算单元中。
进一步地,在一些实施例中,在若干个运算单元中,第一个运算单元的输入特征数据为第一特征块,在得到第一个运算单元输出的中间计算结果之后,将第一个运算单元输出的中间计算结果作为下一个运算单元的输入特征数据,并将下一个特征块作为第一个运算单元的输入特征数据,直至若干个运算单元全部处理完成。
也就是说,结合图6,在这四个运算单元中,运算单元1的输入特征数据是由接收单元提供的;运算单元1的输出作为运算单元2的输入,运算单元2的输出作为运算单元3的输入,运算单元3的输出作为运算单元4的输入,直至这四个运算单元全部处理完成,得到目标输出结果。在该过程中,如果在人工智能算法中出现了第一算子模块中并未包含的算子,那么可以通过数字信号处理器进行协助处理。
还需要说明的是,在一些实施例中,在图6所示神经网络加速装置20的基础上,如图6所示,该神经网络加速装置20还可以包括发送单元和调度单元;其中,发送单元,可以用于在若干个运算单元全部处理完成后,将所得到的目标输出结果向外发送;调度单元,可以用于对若干个运算单元进行调度安排,以实现这若干个运算单元对输入特征数据的处理;另外,调度单元也可以是实现对接收单元和发送单元的调度,以便在需要接收特征图像时调度接收单元进行处理,或者在得到目标输出结果之后调度发送单元将其发送出去。
还可以理解的是,在本申请实施例中,可以对神经网络结构(如人工智能网络)进行分组,即该神经网络结构可以包括若干个分组;其中,每一个分组包括卷积层和算子层,且在每一个分组中,卷积层是基于存内计算阵列实现卷积操作的,算子层是基于第一算子模块或者数字信号处理器实现算子操作的。参见图7,其示出了本申请实施例提供的一种神经网络结构的组成结构示意图。如图7所示,该神经网络结构可以划分为卷积层0(用Conv0表示)、算子0(用FF0表示)、卷积层1(用Conv1表示)、算子1(用FF1表示)、卷积层2(用Conv2表示)、算子2(用FF2表示)、卷积层3(用Conv3表示)、算子3(用FF3表示)等等;其中,Conv0和FF0为一个分组,Conv1和FF1为一个分组,Conv2和FF2为一个分组,Conv3和FF3为一个分组。在这里,通常情况下,FF0、FF1、FF2和FF3等算子优先采用第一算子模块内的第一类算子进行算子操作;但是当不适用于第一类算子时,本申请实施例也可以通过数字信号处理器进行协助处理。
在一种可能的实施方式中,假定运算单元为第i个运算单元,且第i个运算单元中的存内计算阵列预先存储有第i卷积层对应的权重参数;其中,
存内计算阵列,用于获取第i卷积层对应的输入特征数据,并根据第i卷积层对应的权重参数对第i卷积层对应的输入特征数据进行卷积操作,得到第i卷积层的初始计算结果;
第一算子模块,用于通过第一类算子对第i卷积层的初始计算结果进行算子操作,得到第i卷积层的中间计算结果,并将第i卷积层的中间计算结果确定为第i+1卷积层对应的输入特征数据。
需要说明的是,在得到第i+1卷积层对应的输入特征数据后,由于第i+1卷积层对应的权重参数预先存储在第i+1个运算单元中的存内计算阵列,那么可以将其输入到第i+1个运算单元中进行相关处理。其中,i为大于零且小于或等于N的整数;N表示运算单元的个数,且N为大于零的整数。
在另一种可能的实施方式中,假定运算单元为第i个运算单元,且第i个运算单元中的存内计算阵列预先存储有第i卷积层和第i+1卷积层对应的权重参数;其中,
存内计算阵列,用于获取第i卷积层对应的输入特征数据,并根据第i卷积层对应的权重参数对第i卷积层对应的输入特征数据进行卷积操作,得到第i卷积层的初始计算结 果;
第一算子模块,用于通过第一类算子对第i卷积层的初始计算结果进行算子操作,得到第i卷积层的中间计算结果,并将第i卷积层的中间计算结果确定为第i+1卷积层对应的输入特征数据以及仍输入到第i个运算单元中进行相关处理。
需要说明的是,在得到第i+1卷积层对应的输入特征数据后,由于第i+1卷积层对应的权重参数仍然预先存储在第i个运算单元中的存内计算阵列,那么可以将其仍输入到第i个运算单元中进行相关处理;在根据第i个运算单元得到第i+1卷积层的中间计算结果后,将第i+1卷积层的中间计算结果确定为第i+2卷积层对应的输入特征数据;由于第i+2卷积层对应的权重参数预先存储在第i+1个运算单元中的存内计算阵列,这时候需要将第i+2卷积层对应的输入特征数据输入到第i+1个运算单元中进行相关处理。其中,i为大于零且小于或等于N的整数;N表示运算单元的个数,且N为大于零的整数。
具体来讲,如图7所示,其示出了一种神经网络结构的通用结构示意。其中,卷积层使用的权重数据需要提前固化到存内计算阵列中,如图3所示,由于神经网络结构中的卷积层数目较多,每一个卷积层的运算都包含大量的权重数据,而系统中用于存储权重数据的存内计算阵列的总大小固定,按照图6所示的神经网络加速装置20,这里设置有四个运算单元,每一个运算单元包括存内计算阵列和第一算子模块;所以每个存内计算阵列中可能存储1个或多个卷积层的参数。示例性地,假设图7中的Conv0和Conv1对应的权重参数预先存储到了图6中的存内计算阵列1中,由于权重数据已经提前加载到存内计算阵列1中,那么接下来需要把特征图像进行分割,然后按照顺序依次读入到存内计算阵列1中;具体可以是通过数模转换模块将其转换为模拟信号,通过存储阵列计算得到乘累加的模拟信号,再通过模数转换模块将其转换为数字信号送入第一算子模块中进行FF0算子的运算;接下来需要运算的是Conv1,而Conv1中的权重参数仍旧被预先存储到存内计算阵列1中,所以在图6中,FF0模块的输出需要继续送入到存内计算阵列1中,以此类推,直到输入的特征数据完全执行完成算子网络中的前三层(Conv0,FF0,Conv1);然后在将所得到的结果数据送入到存内计算阵列2中,而下一帧的特征数据继续送入到存内计算阵列1中做处理。如果在人工智能算法中出现了第一算子模块中并未包含的其他算子,这时候可以需要DSP进行协助处理;在四个运算单元全部处理完成之后,把最终的结果数据送回。
也就是说,结合人工智能网络本身的特点,对人工智能网络进行分组,每一个分组中包含了卷积操作的卷积层与非卷积算子,并且把这种算法结构映射到如图6所示的硬件架构中,基于运算单元来实现每一卷积层和算子层的功能,每一个运算单元包含了存内计算阵列和第一算子模块,一个运算单元为图6中的一个虚线框,而且一个运算单元可以针对算法结构中的多个分组进行运算,当结束之后再把运算结果传入到下一个运算单元中。该架构充分结合了人工智能算法结构与存内计算阵列的特点,大大减小了数据的传输量。
除此之外,在本申请实施例中,由于整体架构使用了链式结构,可以很方便对系统规模进行扩展。并不限于本申请实施例用于说明的四级传输架构。另外,对于图6所示架构中的第一算子模块,可以是任意的适合专用加速电路实现的算法。此外,对人工智能网络中的功能分组可以有多种形式,并不局限于图7所示的示例。
本实施例提供了一种神经网络加速装置,该神经网络加速装置包括若干个运算单元,每一个运算单元包括存内计算阵列和第一算子模块,且第一算子模块中包括若干个第一类算子;其中,存内计算阵列,用于获取输入特征数据,并对输入特征数据进行卷积操作,得到初始计算结果;第一算子模块,用于通过第一类算子对初始计算结果进行算子操作,得到中间计算结果,并将中间计算结果作为下一个运算单元的输入特征数据。这样,该神经网络加速装置使用了链式结构,即当前运算单元输出的中间计算结果作为下一个运算单元的输入特征数据,使得系统规模的扩展性好;另外,充分利用了智能算法结构与存内计算阵列的特点,从而不仅可以减少处理器与存储器之间的数据传输量,降低数据搬运开销, 进而降低功耗消耗;而且利用该存内计算阵列还能够降低计算的复杂度,进而提高了系统的整体性能。
在本申请的另一实施例中,参见图8,其示出了本申请实施例提供的一种神经网络加速方法的流程示意图。如图8所示,该方法可以包括:
S801:通过存内计算阵列获取输入特征数据,并对输入特征数据进行卷积操作,得到初始计算结果。
S802:通过第一算子模块内的第一类算子对初始计算结果进行算子操作,得到中间计算结果。
S803:将中间计算结果作为下一个运算单元的输入特征数据,直至若干个运算单元全部处理完成,确定目标输出结果。
需要说明的是,本申请实施例应用于前述实施例所述的神经网络加速装置20,该神经网络加速装置可以包括若干个运算单元,而且每一个运算单元包括存内计算阵列和第一算子模块;同时当前运算单元输出的中间计算结果作为下一个运算单元的输入特征数据,即使用了链式结构,可以很方便对系统规模进行扩展。
在本申请实施例中,为了避免权重数据在执行过程中连续的被加载,可以将其预先存储到存内计算阵列中。也就是说,存内计算阵列中预先存储有目标卷积层对应的权重参数;相应地,在一些实施例中,对于S801来说,所述通过存内计算阵列获取输入特征数据,并对输入特征数据进行卷积操作,得到初始计算结果,可以包括:
在存内计算阵列获取到目标卷积层对应的输入特征数据后,根据权重参数对输入特征数据进行卷积操作,得到初始计算结果。
在一种具体的实施例中,所述根据权重参数对输入特征数据进行卷积操作,得到初始计算结果,可以包括:
对输入特征数据进行数模转换,得到第一模拟信号;
根据权重参数和第一模拟信号进行乘累加计算,得到第二模拟信号;
对第二模拟信号进行模数转换,得到目标数字信号,并将目标数字信号确定为初始计算结果。
需要说明的是,对于存内计算阵列而言,其可以包括数模转换模块、存储阵列和模数转换模块,而且数模转换模块位于存内计算阵列的数据输入端,模数转换模块位于存内计算阵列的数据输出端。
在这里,数模转换模块用于对输入特征数据进行数模转换,以得到第一模拟信号;存储阵列用于根据权重参数和第一模拟信号进行乘累加计算,以得到第二模拟信号;模数转换模块用于对第二模拟信号进行模数转换,以得到目标数字信号,这里的目标数字信号即为初始计算结果,然后发送给第一算子模块进行算子操作。
进一步地,在一些实施例中,神经网络加速装置还可以包括数字信号处理器。相应地,该方法还可以包括:在无法使用第一类算子的情况下,通过数字信号处理器对初始计算结果进行处理,得到中间计算结果。
需要说明的是,在本申请实施例中,第一类算子对应于适用于专用数字电路的加速运算,可以称为Fixed Function模块;数字信号处理器用于处理除所述第一类算子之外的不适用于专用数字电路的运算,也就是说,对于不适用于专用数字电路的运算情况,这时候通常使用数字信号处理器即DSP来完成。
还需要说明的是,第一类算子至少可以包括下述之一:用于执行池化操作的算子(即池化算子)、用于执行激活函数操作的算子(即激活函数算子)和用于执行加法操作的算子(即加法算子);数字信号处理器主要是处理无法使用第一类算子的情况,例如比较复杂的sigmoid激活函数、tanh激活函数、或者softmax激活函数等。需要注意的是,第一类算子中的激活函数算子并不包括sigmoid激活函数、tanh激活函数、softmax激活函数等算 子。
进一步地,在一些实施例中,该方法还可以包括:接收特征图像;将特征图像分割为至少一个特征块,并按照顺序依次将至少一个特征块读入到运算单元中。
需要说明的是,在神经网络加速装置的若干个运算单元中,第一个运算单元的输入特征数据为第一特征块,在得到第一个运算单元输出的中间计算结果之后,将第一个运算单元输出的中间计算结果作为下一个运算单元的输入特征数据,并将下一个特征块作为第一个运算单元的输入特征数据,直至若干个运算单元全部处理完成。
也就是说,以图6为例,在这四个运算单元中,运算单元1的输入特征数据是由接收单元提供的;运算单元1的输出作为运算单元2的输入,运算单元2的输出作为运算单元3的输入,运算单元3的输出作为运算单元4的输入,直至这四个运算单元全部处理完成,得到目标输出结果。在该过程中,如果在人工智能算法中出现了第一算子模块中并未包含的算子,那么可以通过数字信号处理器进行协助处理,增加了算法的通用性。
还需要说明的是,在本申请实施例中,神经网络结构可以包括若干个分组;其中,每一个分组包括卷积层和算子层,且在每一个分组中,卷积层可以是基于存内计算阵列实现卷积操作的,算子层可以是基于第一算子模块或者数字信号处理器实现算子操作的。
在一种可能的实施方式中,当第i个运算单元中的存内计算阵列预先存储有第i卷积层对应的权重参数时,该方法还可以包括:
通过存内计算阵列获取第i卷积层对应的输入特征数据,并根据第i卷积层对应的权重参数对第i卷积层对应的输入特征数据进行卷积操作,得到第i卷积层的初始计算结果;
通过第一算子模块内的第一类算子对第i卷积层的初始计算结果进行算子操作,得到第i卷积层的中间计算结果,将第i卷积层的中间计算结果确定为第i+1卷积层对应的输入特征数据以及输入到第i+1个运算单元中进行相关处理。
在另一种可能的实施方式中,当第i个运算单元中的存内计算阵列预先存储有第i卷积层和第i+1层对应的权重参数时,该方法还可以包括:
通过存内计算阵列获取第i卷积层对应的输入特征数据,并根据第i卷积层对应的权重参数对第i卷积层对应的输入特征数据进行卷积操作,得到第i卷积层的初始计算结果;
通过第一算子模块内的第一类算子对第i卷积层的初始计算结果进行算子操作,得到第i卷积层的中间计算结果,将第i卷积层的中间计算结果确定为第i+1卷积层对应的输入特征数据以及仍输入到第i个运算单元中进行相关处理;
在根据第i个运算单元得到第i+1卷积层的中间计算结果后,将第i+1卷积层的中间计算结果确定为第i+2卷积层对应的输入特征数据以及输入到第i+1个运算单元中进行相关处理。
在这里,i为大于零且小于或等于N的整数;N表示所述运算单元的个数,且N为大于零的整数。
需要说明的是,在得到第i+1卷积层对应的输入特征数据后,如果第i+1卷积层对应的权重参数预先存储在第i+1个运算单元中的存内计算阵列,那么可以将其输入到第i+1个运算单元中进行相关处理;如果第i+1卷积层对应的权重参数仍然预先存储在第i个运算单元中的存内计算阵列,那么可以将其仍输入到第i个运算单元中进行相关处理;在根据第i个运算单元得到第i+1卷积层的中间计算结果后,将第i+1卷积层的中间计算结果确定为第i+2卷积层对应的输入特征数据;由于第i+2卷积层对应的权重参数预先存储在第i+1个运算单元中的存内计算阵列,这时候需要将第i+2卷积层对应的输入特征数据输入到第i+1个运算单元中进行相关处理,直至N个运算单元全部处理完成。
简言之,传统冯.诺依曼架构以计算单元为中心,存在大量的数据搬运。随着人工智能场景的复杂化,算法需要处理的数据量越来越多,基于传统架构进行性能提升的幅度越来越小,本申请实施例的技术方案是基于比较成熟的存内计算方案,可以实现卷积的运算, 并结合非卷积算子的特点,使得整体架构可以实现通用人工智能网络的功能,权重参数无需在执行过程中连续的被加载,只需要预先加载到存内计算存储单元中,然后利用元器件进行模拟数据计算,并可以通过数模转换模块与外部的非卷积类算子进行交互;另外,为了增加算法的通用性,本申请实施例还增加了一个DSP使得算子的实用性得到大大扩展。
除此之外,在本申请实施例中,由于整体架构使用了链式结构,可以很方便对系统规模进行扩展。并不限于本申请实施例用于说明的四级传输架构。另外,对于图6所示架构中的第一算子模块,可以是任意的适合专用加速电路实现的算子。此外,对人工智能网络中的功能分组可以有多种形式,并不局限于本申请实施例中的示例。
本实施例提供了一种神经网络加速方法,该方法应用于前述实施例所述的神经网络加速装置20。通过存内计算阵列获取输入特征数据,并对输入特征数据进行卷积操作,得到初始计算结果;通过第一算子模块内的第一类算子对初始计算结果进行算子操作,得到中间计算结果;将中间计算结果作为下一个运算单元的输入特征数据,直至若干个运算单元全部处理完成,确定目标输出结果。这样,由于神经网络加速装置使用了链式结构,即当前运算单元输出的中间计算结果作为下一个运算单元的输入特征数据,使得系统规模的扩展性好;另外,充分利用了智能算法结构与存内计算阵列的特点,从而不仅可以减少处理器与存储器之间的数据传输量,降低数据搬运开销,进而降低功耗消耗;而且利用该存内计算阵列还能够降低计算的复杂度,进而提高了系统的整体性能。
在本申请的又一实施例中,对于前述实施例所述的神经网络加速装置20,其既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。如果以软件功能模块的形式实现并非作为独立的产品进行销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或processor(处理器)执行本实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
因此,本实施例提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,所述计算机程序被至少一个处理器执行时实现前述实施例中任一项所述的神经网络加速方法。
本申请的再一实施例中,基于前述神经网络加速装置20的组成及计算机可读存储介质,参见图9,其示出了本申请实施例提供的一种电子设备的具体硬件结构示意图。如图9所示,电子设备90可以包括处理器901,处理器901可以从存储器中调用并运行计算机程序,以实现前述实施例中任一项所述的神经网络加速方法。
可选地,如图9所示,电子设备90还可以包括存储器902。其中,处理器901可以从存储器902中调用并运行计算机程序,以实现前述实施例中任一项所述的神经网络加速方法。
其中,存储器902可以是独立于处理器901的一个单独的器件,也可以集成在处理器901中。
可选地,如图9所示,电子设备90还可以包括收发器903,处理器901可以控制该收发器903与其他设备进行通信,具体地,可以向其他设备发送信息或数据,或接收其他设备发送的信息或数据。
其中,收发器903可以包括发射机和接收机,收发器903还可以进一步包括天线,天线的数量可以为一个或多个。
可选地,电子设备90具体可为前述实施例所述的智能手机、平板电脑、掌上电脑、笔记本电脑、台式计算机等设备,或者集成有前述实施例中任一项所述神经网络加速装置 20的设备。这里,该电子设备90可以实现本申请实施例的各个方法中所述的相应流程,为了简洁,在此不再赘述。
本申请的再一实施例中,基于前述神经网络加速装置20的组成及计算机可读存储介质,在一种可能的示例中,参见图10,其示出了本申请实施例提供的一种芯片的组成结构示意图。如图10所示,芯片100可以包括前述实施例任一项所述的神经网络加速装置20。
在另一种可能的示例中,参见图11,其示出了本申请实施例提供的一种芯片的具体硬件结构示意图。如图11所示,芯片100可以包括处理器1101,处理器1101可以从存储器中调用并运行计算机程序,以实现前述实施例中任一项所述的神经网络加速方法。
可选地,如图11所示,芯片100还可以包括存储器1102。其中,处理器1101可以从存储器1102中调用并运行计算机程序,以实现前述实施例中任一项所述的神经网络加速方法。需要注意的是,存储器1102可以是独立于处理器1101的一个单独的器件,也可以集成在处理器1101中。
可选地,如图11所示,芯片100还可以包括输入接口1103。其中,处理器1101可以控制该输入接口1103与其他设备或芯片进行通信,具体地,可以获取其他设备或芯片发送的信息或数据。
可选地,如图11所示,芯片100还可以包括输出接口1104。其中,处理器1101可以控制该输出接口1104与其他设备或芯片进行通信,具体地,可以向其他设备或芯片输出信息或数据。
可选地,芯片100可应用于前述实施例所述的电子设备,并且该芯片可以实现本申请实施例的各个方法中所述的相应流程,为了简洁,在此不再赘述。
应理解,本申请实施例提到的芯片还可以称为系统级芯片,系统芯片,芯片系统或片上系统芯片等,这里不作任何限定。
需要说明的是,本申请实施例的处理器可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
还需要说明的是,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDRSDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步链动态随机存取存储器(Synchronous link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DRRAM)。 应注意,本申请描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
可以理解地,本申请描述的这些实施例可以用硬件、软件、固件、中间件、微码或其组合来实现。对于硬件实现,处理单元可以实现在一个或多个专用集成电路(Application Specific Integrated Circuits,ASIC)、数字信号处理器(Digital Signal Processing,DSP)、数字信号处理设备(DSP Device,DSPD)、可编程逻辑设备(Programmable Logic Device,PLD)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、通用处理器、控制器、微控制器、微处理器、用于执行本申请所述功能的其它电子单元或其组合中。对于软件实现,可通过执行本申请所述功能的模块(例如过程、函数等)来实现本申请所述的技术。软件代码可存储在存储器中并通过处理器执行。存储器可以在处理器中或在处理器外部实现。
本领域普通技术人员可以意识到,结合本申请中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
需要说明的是,在本申请中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
本申请所提供的几个方法实施例中所揭露的方法,在不冲突的情况下可以任意组合,得到新的方法实施例。
本申请所提供的几个产品实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的产品实施例。
本申请所提供的几个方法或设备实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的方法实施例或设备实施例。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。
工业实用性
本申请实施例中,该神经网络加速装置包括若干个运算单元,每一个运算单元包括存内计算阵列和第一算子模块,且第一算子模块中包括若干个第一类算子;其中,存内计算阵列,用于获取输入特征数据,并对输入特征数据进行卷积操作,得到初始计算结果;第一算子模块,用于通过第一类算子对初始计算结果进行算子操作,得到中间计算结果,并将中间计算结果作为下一个运算单元的输入特征数据。这样,该神经网络加速装置使用了链式结构,即当前运算单元输出的中间计算结果作为下一个运算单元的输入特征数据,使得系统规模的扩展性好;另外,充分利用了智能算法结构与存内计算阵列的特点,从而不仅可以减少处理器与存储器之间的数据传输量,降低数据搬运开销,而且利用该存内计算阵列还能够降低计算的复杂度,进而提高了系统的整体性能。

Claims (20)

  1. 一种神经网络加速装置,所述神经网络加速装置包括若干个运算单元,所述运算单元包括存内计算阵列和第一算子模块,且所述第一算子模块中包括若干个第一类算子;其中,
    所述存内计算阵列,用于获取输入特征数据,并对所述输入特征数据进行卷积操作,得到初始计算结果;
    所述第一算子模块,用于通过所述第一类算子对所述初始计算结果进行算子操作,得到中间计算结果,并将所述中间计算结果作为下一个所述运算单元的输入特征数据。
  2. 根据权利要求1所述的神经网络加速装置,其中,所述存内计算阵列中预先存储有目标卷积层对应的权重参数;其中,
    所述存内计算阵列,用于在获取到所述目标卷积层对应的输入特征数据后,根据所述权重参数对所述输入特征数据进行卷积操作,得到所述初始计算结果。
  3. 根据权利要求2所述的神经网络加速装置,其中,所述存内计算阵列包括数模转换模块、存储阵列和模数转换模块;其中,
    所述数模转换模块,用于对所述输入特征数据进行数模转换,得到第一模拟信号;
    所述存储阵列,用于根据所述权重参数和所述第一模拟信号进行乘累加计算,得到第二模拟信号;
    所述模数转换模块,用于对所述第二模拟信号进行模数转换,得到目标数字信号,将所述目标数字信号确定为所述初始计算结果。
  4. 根据权利要求2所述的神经网络加速装置,其中,所述运算单元为第i个运算单元,且所述第i个运算单元中的存内计算阵列预先存储有第i卷积层对应的权重参数;其中,
    所述存内计算阵列,用于获取所述第i卷积层对应的输入特征数据,并根据所述第i卷积层对应的权重参数对所述第i卷积层对应的输入特征数据进行卷积操作,得到所述第i卷积层的初始计算结果;
    所述第一算子模块,用于通过所述第一类算子对所述第i卷积层的初始计算结果进行算子操作,得到所述第i卷积层的中间计算结果,并将所述第i卷积层的中间计算结果确定为第i+1卷积层对应的输入特征数据;
    其中,i为大于零且小于或等于N的整数;N表示所述运算单元的个数,且N为大于零的整数。
  5. 根据权利要求1所述的神经网络加速装置,其中,所述神经网络加速装置还包括接收单元;其中,
    所述接收单元,用于接收特征图像,并将所述特征图像分割为至少一个特征块,以及按照顺序依次将所述至少一个特征块读入到所述运算单元中。
  6. 根据权利要求5所述的神经网络加速装置,其中,
    在所述若干个运算单元中,第一个运算单元的输入特征数据为第一特征块,在得到第一个运算单元输出的中间计算结果之后,将所述第一个运算单元输出的中间计算结果作为下一个运算单元的输入特征数据,并将下一个特征块作为所述第一个运算单元的输入特征数据,直至所述若干个运算单元全部处理完成。
  7. 根据权利要求6所述的神经网络加速装置,其中,所述神经网络加速装置还包括发送单元;其中,
    所述发送单元,用于在所述若干个运算单元全部处理完成后,将所得到的目标输出结果向外发送。
  8. 根据权利要求1所述的神经网络加速装置,其中,所述神经网络加速装置还包括 数字信号处理器;其中,
    所述数字信号处理器,用于在无法使用所述第一类算子的情况下,对所述初始计算结果进行处理,得到所述中间计算结果。
  9. 根据权利要求8所述的神经网络加速装置,其中,所述第一类算子对应于适用于专用数字电路的加速运算,所述数字信号处理器用于处理除所述第一类算子之外的不适用于专用数字电路的运算;
    所述第一类算子至少包括下述之一:用于执行池化操作的算子、用于执行激活函数操作的算子和用于执行加法操作的算子。
  10. 一种神经网络加速方法,其中,应用于神经网络加速装置,所述神经网络加速装置包括若干个运算单元,且每一个运算单元包括存内计算阵列和第一算子模块;所述方法包括:
    通过所述存内计算阵列获取输入特征数据,并对所述输入特征数据进行卷积操作,得到初始计算结果;
    通过所述第一算子模块内的第一类算子对所述初始计算结果进行算子操作,得到中间计算结果;
    将所述中间计算结果作为下一个所述运算单元的输入特征数据,直至所述若干个运算单元全部处理完成,确定目标输出结果。
  11. 根据权利要求10所述的方法,其中,所述存内计算阵列中预先存储有目标卷积层对应的权重参数;
    相应地,所述通过所述存内计算阵列获取输入特征数据,并对所述输入特征数据进行卷积操作,得到初始计算结果,包括:
    在所述存内计算阵列获取到所述目标卷积层对应的输入特征数据后,根据所述权重参数对所述输入特征数据进行卷积操作,得到所述初始计算结果。
  12. 根据权利要求11所述的方法,其中,所述根据所述权重参数对所述输入特征数据进行卷积操作,得到所述初始计算结果,包括:
    对所述输入特征数据进行数模转换,得到第一模拟信号;
    根据所述权重参数和所述第一模拟信号进行乘累加计算,得到第二模拟信号;
    对所述第二模拟信号进行模数转换,得到目标数字信号,并将所述目标数字信号确定为所述初始计算结果。
  13. 根据权利要求11所述的方法,其中,当第i个运算单元中的存内计算阵列预先存储有第i卷积层对应的权重参数时,所述方法还包括:
    通过所述存内计算阵列获取所述第i卷积层对应的输入特征数据,并根据所述第i卷积层对应的权重参数对所述第i卷积层对应的输入特征数据进行卷积操作,得到所述第i卷积层的初始计算结果;
    通过所述第一算子模块内的第一类算子对所述第i卷积层的初始计算结果进行算子操作,得到所述第i卷积层的中间计算结果,并将所述第i卷积层的中间计算结果确定为第i+1卷积层对应的输入特征数据以及输入到第i+1个运算单元中进行相关处理;
    其中,i为大于零且小于或等于N的整数;N表示所述运算单元的个数,且N为大于零的整数。
  14. 根据权利要求11所述的方法,其中,当第i个运算单元中的存内计算阵列预先存储有第i卷积层和第i+1层对应的权重参数时,所述方法还包括:
    通过所述存内计算阵列获取所述第i卷积层对应的输入特征数据,并根据所述第i卷积层对应的权重参数对所述第i卷积层对应的输入特征数据进行卷积操作,得到所述第i卷积层的初始计算结果;
    通过所述第一算子模块内的第一类算子对所述第i卷积层的初始计算结果进行算子操 作,得到所述第i卷积层的中间计算结果,并将所述第i卷积层的中间计算结果确定为第i+1卷积层对应的输入特征数据以及仍输入到第i个运算单元中进行相关处理;
    在根据所述第i个运算单元得到第i+1卷积层的中间计算结果后,将所述第i+1卷积层的中间计算结果确定为第i+2卷积层对应的输入特征数据以及输入到第i+1个运算单元中进行相关处理;
    其中,i为大于零且小于或等于N的整数;N表示所述运算单元的个数,且N为大于零的整数。
  15. 根据权利要求10所述的方法,其中,所述方法还包括:
    接收特征图像;
    将所述特征图像分割为至少一个特征块,并按照顺序依次将所述至少一个特征块读入到所述运算单元中;
    其中,在所述若干个运算单元中,第一个运算单元的输入特征数据为第一特征块,在得到第一个运算单元输出的中间计算结果之后,将所述第一个运算单元输出的中间计算结果作为下一个运算单元的输入特征数据,并将下一个特征块作为所述第一个运算单元的输入特征数据,直至所述若干个运算单元全部处理完成。
  16. 根据权利要求10所述的方法,其中,所述神经网络加速装置还包括数字信号处理器,所述方法还包括:
    在无法使用所述第一类算子的情况下,通过所述数字信号处理器对所述初始计算结果进行处理,得到所述中间计算结果。
  17. 根据权利要求16所述的方法,其中,所述第一类算子对应于适用于专用数字电路的加速运算,所述数字信号处理器用于处理除所述第一类算子之外的不适用于专用数字电路的运算;
    所述第一类算子至少包括下述之一:用于执行池化操作的算子、用于执行激活函数操作的算子和用于执行加法操作的算子。
  18. 一种芯片,其中,所述芯片包括如权利要求1至9中任一项所述的神经网络加速装置。
  19. 一种电子设备,所述电子设备包括存储器和处理器;其中,
    所述存储器,用于存储能够在所述处理器上运行的计算机程序;
    所述处理器,用于在运行所述计算机程序时,执行如权利要求10至17中任一项所述的方法。
  20. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求10至17中任一项所述的方法。
PCT/CN2022/133443 2021-12-23 2022-11-22 一种神经网络加速装置、方法、设备和计算机存储介质 WO2023116314A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111592393.6A CN116362312A (zh) 2021-12-23 2021-12-23 一种神经网络加速装置、方法、设备和计算机存储介质
CN202111592393.6 2021-12-23

Publications (1)

Publication Number Publication Date
WO2023116314A1 true WO2023116314A1 (zh) 2023-06-29

Family

ID=86901193

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/133443 WO2023116314A1 (zh) 2021-12-23 2022-11-22 一种神经网络加速装置、方法、设备和计算机存储介质

Country Status (2)

Country Link
CN (1) CN116362312A (zh)
WO (1) WO2023116314A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117057400A (zh) * 2023-10-13 2023-11-14 芯原科技(上海)有限公司 视觉图像处理器、神经网络处理器及图像卷积计算方法
CN117077726A (zh) * 2023-10-17 2023-11-17 之江实验室 一种生成存内计算神经网络模型的方法、装置及介质
CN118379605A (zh) * 2024-06-24 2024-07-23 之江实验室 一种图像识别大模型的部署方法、装置及存储介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116881195B (zh) * 2023-09-04 2023-11-17 北京怀美科技有限公司 面向检测计算的芯片系统和面向检测计算的芯片方法
CN117348998A (zh) * 2023-12-04 2024-01-05 北京怀美科技有限公司 应用于检测计算的加速芯片架构及计算方法
CN117991984A (zh) * 2024-01-09 2024-05-07 广东高云半导体科技股份有限公司 一种数据缓存装置
CN117829149B (zh) * 2024-02-29 2024-05-31 苏州元脑智能科技有限公司 一种语言模型混合训练方法、装置、电子设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3671748A1 (en) * 2018-12-21 2020-06-24 IMEC vzw In-memory computing for machine learning
CN113159302A (zh) * 2020-12-15 2021-07-23 浙江大学 一种用于可重构神经网络处理器的路由结构
CN113222107A (zh) * 2021-03-09 2021-08-06 北京大学 数据处理方法、装置、设备及存储介质
CN113743600A (zh) * 2021-08-26 2021-12-03 南方科技大学 适用于多精度神经网络的存算一体架构脉动阵列设计方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3671748A1 (en) * 2018-12-21 2020-06-24 IMEC vzw In-memory computing for machine learning
CN113159302A (zh) * 2020-12-15 2021-07-23 浙江大学 一种用于可重构神经网络处理器的路由结构
CN113222107A (zh) * 2021-03-09 2021-08-06 北京大学 数据处理方法、装置、设备及存储介质
CN113743600A (zh) * 2021-08-26 2021-12-03 南方科技大学 适用于多精度神经网络的存算一体架构脉动阵列设计方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHU JIWU, MAO HAIYU, LI FEI, LIU ZHE: "Development of processing-in-memory", SCIENTIA SINICA INFORMATIONIS, vol. 51, no. 2, 1 February 2021 (2021-02-01), pages 173, XP093073765, ISSN: 1674-7267, DOI: 10.1360/SSI-2020-0037 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117057400A (zh) * 2023-10-13 2023-11-14 芯原科技(上海)有限公司 视觉图像处理器、神经网络处理器及图像卷积计算方法
CN117057400B (zh) * 2023-10-13 2023-12-26 芯原科技(上海)有限公司 视觉图像处理器、神经网络处理器及图像卷积计算方法
CN117077726A (zh) * 2023-10-17 2023-11-17 之江实验室 一种生成存内计算神经网络模型的方法、装置及介质
CN117077726B (zh) * 2023-10-17 2024-01-09 之江实验室 一种生成存内计算神经网络模型的方法、装置及介质
CN118379605A (zh) * 2024-06-24 2024-07-23 之江实验室 一种图像识别大模型的部署方法、装置及存储介质

Also Published As

Publication number Publication date
CN116362312A (zh) 2023-06-30

Similar Documents

Publication Publication Date Title
WO2023116314A1 (zh) 一种神经网络加速装置、方法、设备和计算机存储介质
CN108765247B (zh) 图像处理方法、装置、存储介质及设备
US11157592B2 (en) Hardware implementation of convolutional layer of deep neural network
CN109102065B (zh) 一种基于PSoC的卷积神经网络加速器
WO2020238843A1 (zh) 神经网络计算设备、方法以及计算设备
KR102530548B1 (ko) 신경망 프로세싱 유닛
CN109993293B (zh) 一种适用于堆叠式沙漏网络的深度学习加速器
CN111582465B (zh) 基于fpga的卷积神经网络加速处理系统、方法以及终端
US20200257500A1 (en) Memory device and computing device using the same
WO2023123648A1 (zh) 基于Cortex-M处理器的卷积神经网络加速方法、系统和介质
US20230117042A1 (en) Implementation of discrete fourier-related transforms in hardware
US20230376274A1 (en) Floating-point multiply-accumulate unit facilitating variable data precisions
WO2021158631A1 (en) Hybrid convolution operation
CN114600126A (zh) 一种卷积运算电路和卷积运算方法
WO2023109748A1 (zh) 一种神经网络的调整方法及相应装置
WO2023115814A1 (zh) Fpga硬件架构及其数据处理方法、存储介质
Zaynidinov et al. Comparative analysis of the architecture of dual-core blackfin digital signal processors
CN113128688B (zh) 通用型ai并行推理加速结构以及推理设备
Wang et al. Acceleration and implementation of convolutional neural network based on FPGA
GB2582868A (en) Hardware implementation of convolution layer of deep neural network
CN114897133A (zh) 一种通用可配置的Transformer硬件加速器及其实现方法
CN116432718A (zh) 一种数据处理方法、装置、设备以及可读存储介质
GB2608791A (en) Neural network comprising matrix multiplication
CN115081600A (zh) 执行Winograd卷积的变换单元、集成电路装置及板卡
EP3073387A1 (en) Controlling data flow between processors in a processing system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22909629

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE