CN116362312A - Neural network acceleration device, method, equipment and computer storage medium - Google Patents

Neural network acceleration device, method, equipment and computer storage medium Download PDF

Info

Publication number
CN116362312A
CN116362312A CN202111592393.6A CN202111592393A CN116362312A CN 116362312 A CN116362312 A CN 116362312A CN 202111592393 A CN202111592393 A CN 202111592393A CN 116362312 A CN116362312 A CN 116362312A
Authority
CN
China
Prior art keywords
operator
calculation result
characteristic data
convolution layer
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111592393.6A
Other languages
Chinese (zh)
Inventor
祝叶华
孙炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zeku Technology Shanghai Corp Ltd
Original Assignee
Zeku Technology Shanghai Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zeku Technology Shanghai Corp Ltd filed Critical Zeku Technology Shanghai Corp Ltd
Priority to CN202111592393.6A priority Critical patent/CN116362312A/en
Priority to PCT/CN2022/133443 priority patent/WO2023116314A1/en
Publication of CN116362312A publication Critical patent/CN116362312A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C11/00Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
    • G11C11/54Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using elements simulating biological cells, e.g. neuron
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • G06N3/065Analogue means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses a neural network accelerating device, a method, equipment and a computer storage medium, wherein the neural network accelerating device comprises a plurality of operation units, each operation unit comprises an in-memory computing array and a first operator module, and the first operator module comprises a plurality of first operators; the memory computing array is used for acquiring input characteristic data and carrying out convolution operation on the input characteristic data to obtain an initial computing result; the first operator module is used for performing operator operation on the initial calculation result through a first type of operators to obtain an intermediate calculation result, and taking the intermediate calculation result as input characteristic data of a next operation unit. Therefore, the data transmission quantity between the processor and the memory can be reduced, the data carrying cost is reduced, the computational complexity can be reduced by utilizing the in-memory computing array, and the overall performance of the system is further improved.

Description

Neural network acceleration device, method, equipment and computer storage medium
Technical Field
The present disclosure relates to the field of in-memory computing technologies, and in particular, to a neural network acceleration apparatus, a neural network acceleration method, a neural network acceleration device, and a neural network acceleration apparatus.
Background
In recent years, neural networks have achieved remarkable success in practical applications, such as image classification and icon detection, but these results are largely dependent on complex neural network models with a large number of parameters and calculations. Currently, deployment of these complex neural network models, which require a large amount of computation and data movement, onto a neural network accelerator based on von neumann architecture will present a so-called Memory Wall (Memory Wall) problem, i.e. the data movement speed will not keep pace with the data processing speed.
In von neumann architecture, although separation of the compute unit and memory is achieved, the compute unit needs to read data from memory and then write the compute result back to memory. Thus, even if more computing power is added, the performance improvement of the whole system is not obvious due to the limitation of the speed of reading data, and a large amount of data transmission will bring about a large amount of power consumption.
Disclosure of Invention
The application aims to provide a neural network accelerating device, a neural network accelerating method, neural network accelerating equipment and a computer storage medium.
In order to achieve the above purpose, the technical scheme of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a neural network acceleration device, where the neural network acceleration device includes a plurality of operation units, each operation unit includes an in-memory computing array and a first operator module, and the first operator module includes a plurality of first operators; wherein,,
The in-memory computing array is used for acquiring input characteristic data, and carrying out convolution operation on the input characteristic data to obtain an initial computing result;
the first operator module is used for performing operator operation on the initial calculation result through a first type of operators to obtain an intermediate calculation result, and taking the intermediate calculation result as input characteristic data of a next operation unit.
In a second aspect, an embodiment of the present application provides a neural network acceleration method, which is applied to a neural network acceleration device, where the neural network acceleration device includes a plurality of operation units, and each operation unit includes an in-memory computing array and a first operator module; the method comprises the following steps:
acquiring input characteristic data through an in-memory computing array, and performing convolution operation on the input characteristic data to obtain an initial computing result;
performing operator operation on the initial calculation result through a first type of operators in a first operator module to obtain an intermediate calculation result;
and taking the intermediate calculation result as input characteristic data of the next operation unit until all the processing of a plurality of operation units is completed, and determining a target output result.
In a third aspect, embodiments of the present application provide a chip including a neural network acceleration device as in the first aspect.
In a fourth aspect, embodiments of the present application provide an electronic device including a memory and a processor; wherein,,
a memory for storing a computer program capable of running on the processor;
a processor for performing the method as in the second aspect when the computer program is run.
In a fifth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program which, when executed by a processor, implements a method as in the second aspect.
The embodiment of the application provides a neural network accelerating device, a method, equipment and a computer storage medium, wherein the neural network accelerating device comprises a plurality of operation units, each operation unit comprises an in-memory computing array and a first operator module, and the first operator module comprises a plurality of first operators; the memory computing array is used for acquiring input characteristic data and carrying out convolution operation on the input characteristic data to obtain an initial computing result; the first operator module is used for performing operator operation on the initial calculation result through a first type of operators to obtain an intermediate calculation result, and taking the intermediate calculation result as input characteristic data of a next operation unit. In this way, the neural network accelerating device uses a chain structure, namely, the intermediate calculation result output by the current operation unit is used as the input characteristic data of the next operation unit, so that the expansibility of the system scale is good; in addition, the characteristics of the intelligent algorithm structure and the in-memory computing array are fully utilized, so that the data transmission quantity between the processor and the memory can be reduced, the data carrying cost is reduced, the computing complexity can be reduced by utilizing the in-memory computing array, and the overall performance of the system is further improved.
Drawings
FIG. 1 is a schematic diagram of an architecture of an artificial intelligence accelerator;
fig. 2 is a schematic diagram of a composition structure of a neural network acceleration device according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a basic structure of in-memory computing according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an architecture of an in-memory computing array according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of an architecture of an arithmetic unit according to an embodiment of the present disclosure;
fig. 6 is a schematic architecture diagram of a neural network acceleration device according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a neural network structure according to an embodiment of the present application;
fig. 8 is a schematic flow chart of a neural network acceleration method according to an embodiment of the present application;
fig. 9 is a schematic diagram of a specific hardware structure of an electronic device according to an embodiment of the present application;
fig. 10 is a schematic diagram of a composition structure of a chip according to an embodiment of the present application;
fig. 11 is a schematic diagram of a specific hardware structure of a chip according to an embodiment of the present application.
Detailed Description
For a more complete understanding of the features and technical content of the embodiments of the present application, reference should be made to the following detailed description of the embodiments of the present application, taken in conjunction with the accompanying drawings, which are for purposes of illustration only and not intended to limit the embodiments of the present application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict. It should also be noted that the term "first/second/third" in reference to the embodiments of the present application is used merely to distinguish similar objects and does not represent a specific ordering for the objects, it being understood that the "first/second/third" may be interchanged with a specific order or sequence, if allowed, to enable the embodiments of the present application described herein to be implemented in an order other than that illustrated or described herein.
It should be appreciated that In-Memory Computing (CIM) is an emerging Computing architecture, which is a solution to the problem of Memory walls. Wherein a von neumann architecture based computer system divides memory and processors into two parts, and the overhead of the processors frequently accessing memory forms a memory wall. In-memory computation is to combine computation and storage into one, i.e. the computation is done inside the memory, thereby reducing the frequency of processor access to the memory. Compared with the traditional architecture, the in-memory computing has the characteristics of high parallelism and high energy efficiency, and is a better alternative for algorithms requiring a large number of parallel matrix vector multiplication operations, in particular to neural network algorithms.
In particular, artificial intelligence (Artificial Intelligence, AI) scene-dependent algorithms are a large and complex network structure with many parameters to store and to perform a large number of calculations, which in turn generate a large amount of data. In order to increase the computing power, in general, to cope with more complex processing scenarios, it is necessary to continuously expand its computing units or processing units (PE) in the processing Engine array, for example, the multiply-accumulate unit is the core unit, however, as the computing units increase, the memory resources required to be invoked also increase, and the performance of the whole system is limited by the performance of the memory units. In the operation process of the whole algorithm, data are required to be read in from an external memory continuously and result data are required to be written back to the memory; under the condition that the transmission bandwidth is fixed, with the improvement of the computing capability of the computing engine, each computing unit can gradually reduce the bandwidth of the memory, and the transmission capability of data becomes the bottleneck of the AI chip.
Illustratively, FIG. 1 shows a schematic architecture diagram of an artificial intelligence accelerator. As shown in FIG. 1, data is moved from the memory to the processor, then data calculation is performed by the PE array in the processor, and then the result is written back to the memory; the PE array comprises a plurality of PEs. That is, with the current von neumann architecture, the basic structure is an architecture in which the computing unit is separated from the memory, the computing unit reads data from the memory, and the result is written back to the memory after the computation is completed. However, in recent years, as the performance of the processor is continuously increased, the performance of the memory is relatively slow, and under the increasing algorithm demands, the data handling becomes a bottleneck of the system, and even if the computing power is increased again, the overall performance improvement is not obvious due to the limitation of the data reading speed in the system. In addition, besides the limitation on performance, a large amount of data transmission also brings about a large amount of power consumption, and in the case that the current power consumption requirement is higher and higher, the problem needs to be solved.
The embodiment of the application provides a neural network accelerating device, which comprises a plurality of operation units, wherein each operation unit comprises an in-memory computing array and a first operator module, and the first operator module comprises a plurality of first operators; the memory computing array is used for acquiring input characteristic data and carrying out convolution operation on the input characteristic data to obtain an initial computing result; the first operator module is used for performing operator operation on the initial calculation result through a first type of operators to obtain an intermediate calculation result, and taking the intermediate calculation result as input characteristic data of a next operation unit. In this way, the neural network accelerating device uses a chain structure, namely, the intermediate calculation result output by the current operation unit is used as the input characteristic data of the next operation unit, so that the expansibility of the system scale is good; in addition, the characteristics of the intelligent algorithm structure and the in-memory computing array are fully utilized, so that the data transmission quantity between the processor and the memory can be reduced, the data carrying cost is reduced, and the power consumption is further reduced; and the in-memory computing array can also reduce the complexity of computation, thereby improving the overall performance of the system.
Embodiments of the present application will be described in detail below with reference to the accompanying drawings.
In an embodiment of the present application, referring to fig. 2, a schematic diagram of a composition structure of a neural network acceleration device according to an embodiment of the present application is shown. As shown in fig. 2, the neural network acceleration device 20 may include a plurality of operation units, where each operation unit may include an in-memory computing array and a first operator module, and the first operator module includes a plurality of first operators; wherein,,
the in-memory computing array is used for acquiring input characteristic data, and carrying out convolution operation on the input characteristic data to obtain an initial computing result;
the first operator module is used for performing operator operation on the initial calculation result through a first type of operators to obtain an intermediate calculation result, and taking the intermediate calculation result as input characteristic data of a next operation unit.
It should be noted that, in the embodiment of the present application, the neural network structures (such as the artificial intelligence network) may be grouped based on the characteristics of the neural network structures. In particular, the neural network structure may include several groupings, where each grouping includes a convolutional layer and a non-convolutional operator; in this way, this algorithm structure is mapped into the hardware architecture so as to correspond to the arithmetic unit in the hardware architecture. In each grouping, the convolution layer may implement convolution operations based on the in-memory computational array, and the non-convolution operator may implement operator operations based on the first operator module.
It should be further noted that, in the embodiment of the present application, the neural network acceleration device may include a plurality of operation units, and the intermediate calculation result output by the current operation unit is used as the input feature data of the next operation unit, that is, a chained structure is used, so that the system scale can be conveniently expanded.
It will be appreciated that in-memory computing arrays have been proposed in recent years, that is, the in-memory computing has been implemented by directly performing multiply and accumulate operations using analog circuits in the memory unit, without the need to carry data out of the memory unit and then compute the data using a digital circuit-based computing engine. Illustratively, in the neural network structure of artificial intelligence, the basic operation is a matrix multiplication operation, specifically as shown in formula (1),
Figure BDA0003430229140000061
in addition, for implementations using conventional von neumann architectures, this can be done with the help of a multiply-accumulate tree, which contains multipliers and adders. While for the manner in which in-memory computation is used, a simple illustration of the basic structure of in-memory computation shown in fig. 3 may be used. Wherein the black filled cells are used to store the values of weight parameters, and voltages are applied in the lateral direction, x can be used 1 ,x 2 ,x 3, x 4 To characterize the magnitude of the voltage; then in the longitudinal direction, the analog value output by each black filled cell can be expressed as the product of x and w, then eachThe output of a column may use y 1 ,y 2 ,y 3 ,y 4 This means that they are matched with the matrix multiplication results in the above formula (1), respectively.
In the embodiment of the present application, in order to avoid that weight data is continuously loaded in the execution process, the weight data may be pre-stored in an in-memory computing array. Therefore, in some embodiments, the weight parameters corresponding to the target convolutional layer are pre-stored in the in-memory computing array; wherein,,
and the in-memory computing array is used for carrying out convolution operation on the input characteristic data according to the weight parameters after the input characteristic data corresponding to the target convolution layer is acquired, so as to obtain an initial computing result.
That is, if the in-memory computing array in the current computing unit stores weight parameters corresponding to the target convolution layer in advance, the current computing unit performs a convolution operation on the target convolution layer. Specifically, according to the in-memory computing array in the current computing unit, carrying out convolution operation on the weight parameter corresponding to the target convolution layer and the input characteristic data corresponding to the target convolution layer to obtain an initial computing result; and then, according to a first calculation module in the current calculation unit, carrying out operator operation on the initial calculation result to obtain an intermediate calculation result, continuously taking the intermediate calculation result as input characteristic data of the next calculation unit, and the like until all the processing of a plurality of calculation units is completed.
It is further understood that, for the in-memory computing array, reference is made to fig. 4, which shows a schematic architecture diagram of the in-memory computing array according to an embodiment of the present application. As shown in fig. 4, the in-memory computing array 40 may include a Digital-to-Analog Conversion (DAC) module 401, a memory array 402, and an Analog-to-Digital Conversion (ADC) module 403; wherein,,
the digital-to-analog conversion module 401 is configured to perform digital-to-analog conversion on the input feature data to obtain a first analog signal;
a storage array 402, configured to perform multiply-accumulate computation according to the weight parameter and the first analog signal, to obtain a second analog signal;
the analog-to-digital conversion module 403 is configured to perform analog-to-digital conversion on the second analog signal to obtain a target digital signal, and determine the target digital signal as an initial calculation result.
It should be noted that, the weight data in the embodiment of the present application need not be continuously loaded in the execution process, but only needs to be preloaded into the storage array in the in-memory computing array, and analog data computation is performed by using the related components, and finally, the analog data is converted into the target digital signal by the analog-to-digital conversion module 403 for output.
By way of example, fig. 5 shows a schematic architecture of one of the arithmetic units according to the embodiment of the present application. As shown in fig. 5, the arithmetic unit may include an in-memory computational array 40 and a first operator module 50; wherein the target digital signal after analog-to-digital conversion of the storage computing array 40 may interact with the first operator module 50. That is, for the artificial intelligent network, not only can the operation of the convolution operator be realized, but also a large number of other operators exist in the artificial intelligent network besides the convolution layer, and the interaction of data is required between the operators.
In an embodiment of the present application, the first type of operator represents an acceleration operation applicable to the special purpose digital circuit, and the first type of operator includes at least one of the following: an operator for performing a pooling operation, an operator for performing an activation function operation, and an operator for performing an addition operation. That is, as shown in fig. 5, an addition operator (Adder), an Activation function operator (Activation), and a Pooling operator (Pooling) may be included in the first operator module 50.
In addition, for acceleration operations in artificial intelligence networks that are not applicable to dedicated digital circuits, operator operations using operators of the first type are not possible. Thus, in some embodiments, the neural network acceleration device 20 further includes a digital signal processor (Digital Signal Processor, DSP); wherein,,
and the digital signal processor is used for processing the initial calculation result to obtain an intermediate calculation result under the condition that the first type operator cannot be used.
It should be noted that, in the embodiment of the present application, the first type of operator corresponds to an acceleration operation applicable to the special purpose digital circuit, and the digital signal processor is configured to process operations other than the first type of operator that are not applicable to the special purpose digital circuit. That is, the digital signal processor mainly handles the situation that the first type of operator cannot be used, such as a relatively complex sigmoid activation function, a tanh activation function, or a softmax activation function.
It should be further noted that, in the embodiment of the present application, the first operator module may also be referred to as a Fixed Function (Fixed Function) module, and mainly uses an addition operator, an activation Function operator, a pooling operator, and the like to perform an acceleration operation on a special digital circuit; and for the case of operations that are not suitable for application-specific digital circuits, this is usually done using a digital signal processor, i.e. DSP.
Here, since the in-memory computation can only be applied to matrix multiplication, for the artificial intelligent network, the operation of the convolution operator can be realized, but besides the convolution layer, a large number of other operators exist in the artificial intelligent network, and data interaction is needed between the operators, so that the artificial intelligent accelerator based on CIM can be constructed according to the existing CIM unit, namely, the neural network accelerating device 20 in the embodiment of the present application, and the basic architecture is shown in fig. 6. In fig. 6, the number of the operation units may be four, that is, the operation unit 1, the operation unit 2, the operation unit 3, and the operation unit 4, the operation unit 1 may include the in-memory calculation array 1 and the first operator module 1, the operation unit 2 may include the in-memory calculation array 2 and the first operator module 2, the operation unit 3 may include the in-memory calculation array 3 and the first operator module 3, and the operation unit 4 may include the in-memory calculation array 4 and the first operator module 4; wherein the in-memory computing array (for example, the in-memory computing array 1, the in-memory computing array 2, the in-memory computing array 3, or the in-memory computing array 4) includes a digital-to-analog conversion module, a storage array, and an analog-to-digital conversion module, and the digital-to-analog conversion module and the analog-to-digital conversion module are respectively disposed at a data input end and a data output end of the in-memory computing array, because the in-memory computing uses analog signals for processing; the first operator module (e.g., the first operator module 1, the first operator module 2, the first operator module 3, or the first operator module 4) may be referred to as a fixed function, as well as other common operators in artificial intelligence algorithms, such as pooling, activation functions, addition, etc., that are suitable for implementation using dedicated digital circuits; while for acceleration operations in which some of the artificial intelligence algorithms are not suitable for implementation by dedicated digital circuits, such as sigmoid activation functions, tanh activation functions, or softmax activation functions, etc., they may be done using a DSP.
Further, in some embodiments, on the basis of the neural network acceleration device 20 shown in fig. 6, as shown in fig. 6, the neural network acceleration device 20 may further include a receiving unit; wherein,,
and the receiving unit is used for receiving the characteristic image, dividing the characteristic image into at least one characteristic block and sequentially reading the at least one characteristic block into the operation unit.
Further, in some embodiments, in the plurality of operation units, the input feature data of the first operation unit is a first feature block, after obtaining the intermediate calculation result output by the first operation unit, the intermediate calculation result output by the first operation unit is used as the input feature data of the next operation unit, and the next feature block is used as the input feature data of the first operation unit until all the plurality of operation units complete the processing.
That is, in connection with fig. 6, of the four arithmetic units, the input characteristic data of the arithmetic unit 1 is provided by the receiving unit; the output of the operation unit 1 is used as the input of the operation unit 2, the output of the operation unit 2 is used as the input of the operation unit 3, and the output of the operation unit 3 is used as the input of the operation unit 4 until all the four operation units are processed, and a target output result is obtained. In this process, if an operator not included in the first operator module is present in the artificial intelligence algorithm, then the processing may be assisted by the digital signal processor.
It should be further noted that, in some embodiments, on the basis of the neural network acceleration device 20 shown in fig. 6, as shown in fig. 6, the neural network acceleration device 20 may further include a transmitting unit and a scheduling unit; the sending unit can be used for sending the obtained target output result outwards after all the processing of the plurality of operation units is completed; the scheduling unit can be used for scheduling a plurality of operation units so as to realize the processing of the input characteristic data by the operation units; in addition, the scheduling unit may also implement scheduling of the receiving unit and the transmitting unit, so as to schedule the receiving unit to process when the feature image needs to be received, or schedule the transmitting unit to transmit the target output result after obtaining the target output result.
It is further understood that in embodiments of the present application, a neural network structure (e.g., an artificial intelligence network) may be grouped, i.e., the neural network structure may include several groupings; wherein each grouping includes a convolution layer and an operator layer, and in each grouping, the convolution layer implements a convolution operation based on the in-memory computational array, and the operator layer implements an operator operation based on the first operator module or the digital signal processor. Referring to fig. 7, a schematic diagram of a composition structure of a neural network according to an embodiment of the present application is shown. As shown in fig. 7, the neural network structure may be divided into a convolutional layer 0 (represented by Conv 0), an operator 0 (represented by FF 0), a convolutional layer 1 (represented by Conv 1), an operator 1 (represented by FF 1), a convolutional layer 2 (represented by Conv 2), an operator 2 (represented by FF 2), a convolutional layer 3 (represented by Conv 3), an operator 3 (represented by FF 3), and the like; wherein Conv0 and FF0 are one packet, conv1 and FF1 are one packet, conv2 and FF2 are one packet, and Conv3 and FF3 are one packet. Here, operators such as FF0, FF1, FF2, FF3 and the like preferably adopt a first type of operator in the first operator module to perform operator operation under normal conditions; but when not applicable to the first class of algorithms, the embodiments of the present application may also assist in processing by the digital signal processor.
In one possible implementation manner, the computing unit is assumed to be an ith computing unit, and the in-memory computing array in the ith computing unit stores weight parameters corresponding to an ith convolution layer in advance; wherein,,
the in-memory computing array is used for acquiring input characteristic data corresponding to the ith convolution layer, and carrying out convolution operation on the input characteristic data corresponding to the ith convolution layer according to the weight parameter corresponding to the ith convolution layer to obtain an initial computing result of the ith convolution layer;
the first operator module is used for performing operator operation on the initial calculation result of the ith convolution layer through a first operator, obtaining an intermediate calculation result of the ith convolution layer, and determining the intermediate calculation result of the ith convolution layer as input characteristic data corresponding to the (i+1) th convolution layer.
After the input feature data corresponding to the i+1th convolution layer is obtained, the weight parameter corresponding to the i+1th convolution layer may be input to the i+1th calculation unit to perform the correlation process, because the weight parameter is stored in the in-memory calculation array in the i+1th calculation unit in advance. Wherein i is an integer greater than zero and less than or equal to N; n represents the number of arithmetic units, and N is an integer greater than zero.
In another possible implementation manner, the computing unit is assumed to be an ith computing unit, and the in-memory computing array in the ith computing unit stores weight parameters corresponding to the ith convolution layer and the (i+1) th convolution layer in advance; wherein,,
the in-memory computing array is used for acquiring input characteristic data corresponding to the ith convolution layer, and carrying out convolution operation on the input characteristic data corresponding to the ith convolution layer according to the weight parameter corresponding to the ith convolution layer to obtain an initial computing result of the ith convolution layer;
the first operator module is used for performing operator operation on the initial calculation result of the ith convolution layer through the first operator, obtaining an intermediate calculation result of the ith convolution layer, determining the intermediate calculation result of the ith convolution layer as input characteristic data corresponding to the (i+1) th convolution layer, and still inputting the input characteristic data into the ith operation unit for relevant processing.
After the input feature data corresponding to the i+1th convolution layer is obtained, the weight parameters corresponding to the i+1th convolution layer are still stored in the in-memory computing array in the i computing unit in advance, and then the weight parameters can be still input into the i computing unit for relevant processing; after obtaining the intermediate calculation result of the (i+1) th convolution layer according to the (i) th operation unit, determining the intermediate calculation result of the (i+1) th convolution layer as input characteristic data corresponding to the (i+2) th convolution layer; because the weight parameters corresponding to the i+2 convolution layer are pre-stored in the in-memory computing array in the i+1 operation unit, the input characteristic data corresponding to the i+2 convolution layer needs to be input into the i+1 operation unit for correlation processing. Wherein i is an integer greater than zero and less than or equal to N; n represents the number of arithmetic units, and N is an integer greater than zero.
Specifically, as shown in fig. 7, a general structural schematic of a neural network structure is shown. The weight data used by the convolution layers need to be cured into the in-memory computing array in advance, as shown in fig. 3, because the number of the convolution layers in the neural network structure is more, the operation of each convolution layer contains a large amount of weight data, and the total size of the in-memory computing array for storing the weight data in the system is fixed, according to the neural network accelerating device 20 shown in fig. 6, four operation units are provided, and each operation unit comprises an in-memory computing array and a first operator module; it is possible to store parameters for 1 or more convolutional layers in each in-memory compute array. For example, assuming that the weight parameters corresponding to Conv0 and Conv1 in fig. 7 are stored in advance in the in-memory computing array 1 in fig. 6, since the weight data is already loaded in advance in the in-memory computing array 1, then the feature image needs to be segmented next and then sequentially read into the in-memory computing array 1 in sequence; the method comprises the steps of converting the analog signals into analog signals through a digital-to-analog conversion module, calculating to obtain multiply-accumulated analog signals through a storage array, converting the multiply-accumulated analog signals into digital signals through an analog-to-digital conversion module, and sending the digital signals into a first operator module to perform FF0 operator operation; next, conv1 is needed to be operated, and the weight parameters in Conv1 are still stored in the in-memory computing array 1 in advance, so in FIG. 6, the output of the FF0 module needs to be continuously fed into the in-memory computing array 1, and so on until the input characteristic data completely execute to complete the first three layers (Conv 0, FF0, conv 1) in the operator network; the obtained result data is then sent to the in-memory computing array 2, and the feature data of the next frame is sent to the in-memory computing array 1 for processing. If other operators which are not contained in the first operator module appear in the artificial intelligence algorithm, the DSP can be required to assist in processing at the moment; after all the four arithmetic units have been processed, the final result data is returned.
That is, by combining the characteristics of the artificial intelligent network, the artificial intelligent network is grouped, each group includes a convolution layer and a non-convolution operator of the convolution operation, the algorithm structure is mapped into a hardware architecture as shown in fig. 6, the functions of each convolution layer and each operator layer are realized based on operation units, each operation unit includes an in-memory calculation array and a first operator module, one operation unit is a dashed line box in fig. 6, and one operation unit can perform operations for a plurality of groups in the algorithm structure, and after the operation is finished, the operation result is transferred to the next operation unit. The architecture fully combines the characteristics of an artificial intelligent algorithm structure and an in-memory computing array, and greatly reduces the transmission quantity of data.
In addition, in the embodiment of the application, since the whole architecture uses a chain structure, the system scale can be conveniently expanded. The four-stage transmission architecture used for illustration in the embodiments of the present application is not limited. In addition, for the first operator module in the architecture shown in fig. 6, any algorithm suitable for implementation of a dedicated acceleration circuit may be used. In addition, there are many forms of grouping functions in an artificial intelligence network and are not limited to the example shown in fig. 7.
The embodiment provides a neural network accelerating device, which comprises a plurality of operation units, wherein each operation unit comprises an in-memory computing array and a first operator module, and the first operator module comprises a plurality of first operators; the memory computing array is used for acquiring input characteristic data and carrying out convolution operation on the input characteristic data to obtain an initial computing result; the first operator module is used for performing operator operation on the initial calculation result through a first type of operators to obtain an intermediate calculation result, and taking the intermediate calculation result as input characteristic data of a next operation unit. In this way, the neural network accelerating device uses a chain structure, namely, the intermediate calculation result output by the current operation unit is used as the input characteristic data of the next operation unit, so that the expansibility of the system scale is good; in addition, the characteristics of the intelligent algorithm structure and the in-memory computing array are fully utilized, so that the data transmission quantity between the processor and the memory can be reduced, the data carrying cost is reduced, and the power consumption is further reduced; and the in-memory computing array can also reduce the complexity of computation, thereby improving the overall performance of the system.
In another embodiment of the present application, referring to fig. 8, a schematic flow chart of a neural network acceleration method provided in an embodiment of the present application is shown. As shown in fig. 8, the method may include:
s801: and acquiring input characteristic data through the in-memory computing array, and performing convolution operation on the input characteristic data to obtain an initial computing result.
S802: and performing operator operation on the initial calculation result through a first type of operators in the first operator module to obtain an intermediate calculation result.
S803: and taking the intermediate calculation result as input characteristic data of the next operation unit until all the processing of a plurality of operation units is completed, and determining a target output result.
It should be noted that the embodiment of the present application is applied to the neural network acceleration device 20 described in the foregoing embodiment, where the neural network acceleration device may include a plurality of operation units, and each operation unit includes an in-memory computing array and a first operator module; meanwhile, the intermediate calculation result output by the current operation unit is used as the input characteristic data of the next operation unit, namely, a chain structure is used, so that the system scale can be conveniently expanded.
In the embodiment of the present application, in order to avoid that weight data is continuously loaded in the execution process, the weight data may be pre-stored in an in-memory computing array. That is, the weight parameters corresponding to the target convolution layer are prestored in the in-memory computing array; accordingly, in some embodiments, for S801, the obtaining the input feature data through the in-memory computing array and performing a convolution operation on the input feature data to obtain an initial computing result may include:
After the in-memory computing array acquires the input characteristic data corresponding to the target convolution layer, carrying out convolution operation on the input characteristic data according to the weight parameters to obtain an initial computing result.
In a specific embodiment, the convolving the input feature data according to the weight parameter to obtain an initial calculation result may include:
performing digital-to-analog conversion on the input characteristic data to obtain a first analog signal;
performing multiply-accumulate calculation according to the weight parameters and the first analog signals to obtain second analog signals;
and carrying out analog-to-digital conversion on the second analog signal to obtain a target digital signal, and determining the target digital signal as an initial calculation result.
It should be noted that, for the in-memory computing array, the in-memory computing array may include a digital-to-analog conversion module, a storage array, and an analog-to-digital conversion module, where the digital-to-analog conversion module is located at a data input end of the in-memory computing array, and the analog-to-digital conversion module is located at a data output end of the in-memory computing array.
The digital-to-analog conversion module is used for carrying out digital-to-analog conversion on the input characteristic data so as to obtain a first analog signal; the storage array is used for carrying out multiply-accumulate calculation according to the weight parameters and the first analog signals so as to obtain second analog signals; the analog-to-digital conversion module is used for performing analog-to-digital conversion on the second analog signal to obtain a target digital signal, wherein the target digital signal is an initial calculation result and then is sent to the first operator module for operator operation.
Further, in some embodiments, the neural network acceleration device may also include a digital signal processor. Accordingly, the method may further comprise: and under the condition that the first type of operators cannot be used, the digital signal processor is used for processing the initial calculation result to obtain an intermediate calculation result.
It should be noted that, in the embodiment of the present application, the first type of operator corresponds to an acceleration operation applicable to the special digital circuit, and may be referred to as a Fixed Function module; the digital signal processor is used to process operations other than the first type of operator that are not applicable to the special digital circuit, that is, for the case of operations that are not applicable to the special digital circuit, this is usually done using a digital signal processor, i.e., DSP.
It should be further noted that the first type of operator may include at least one of the following: an operator for performing a pooling operation (i.e., a pooling operator), an operator for performing an activation function operation (i.e., an activation function operator), and an operator for performing an addition operation (i.e., an addition operator); the digital signal processor is mainly used for processing the situation that the first kind of operator cannot be used, such as a relatively complex sigmoid activation function, a tanh activation function, a softmax activation function, or the like. It should be noted that the activation function operators in the first type of operators do not include operators such as sigmoid activation function, tanh activation function, softmax activation function, and the like.
Further, in some embodiments, the method may further comprise: receiving a characteristic image; the feature image is divided into at least one feature block, and the at least one feature block is sequentially read into the arithmetic unit.
In the several operation units of the neural network accelerator, the input feature data of the first operation unit is a first feature block, after the intermediate calculation result output by the first operation unit is obtained, the intermediate calculation result output by the first operation unit is used as the input feature data of the next operation unit, and the next feature block is used as the input feature data of the first operation unit until all the several operation units are processed.
That is, taking fig. 6 as an example, of the four arithmetic units, the input characteristic data of the arithmetic unit 1 is provided by the receiving unit; the output of the operation unit 1 is used as the input of the operation unit 2, the output of the operation unit 2 is used as the input of the operation unit 3, and the output of the operation unit 3 is used as the input of the operation unit 4 until all the four operation units are processed, and a target output result is obtained. In the process, if operators which are not contained in the first operator module appear in the artificial intelligence algorithm, the operators can be assisted by a digital signal processor, so that the universality of the algorithm is improved.
It should also be noted that, in the embodiment of the present application, the neural network structure may include several packets; wherein each grouping includes a convolution layer and an operator layer, and in each grouping, the convolution layer may be configured to implement a convolution operation based on the in-memory computational array, and the operator layer may be configured to implement an operator operation based on the first operator module or the digital signal processor.
In one possible implementation manner, when the weight parameter corresponding to the ith convolution layer is prestored in the in-memory computing array in the ith computing unit, the method may further include:
acquiring input characteristic data corresponding to an ith convolution layer through an in-memory computing array, and carrying out convolution operation on the input characteristic data corresponding to the ith convolution layer according to weight parameters corresponding to the ith convolution layer to obtain an initial computing result of the ith convolution layer;
performing operator operation on the initial calculation result of the ith convolution layer through a first operator in the first operator module to obtain an intermediate calculation result of the ith convolution layer, determining the intermediate calculation result of the ith convolution layer as input characteristic data corresponding to the (i+1) th convolution layer, and inputting the input characteristic data into the (i+1) th operation unit for relevant processing.
In another possible implementation manner, when the in-memory computing array in the ith computing unit stores weight parameters corresponding to the ith convolution layer and the (i+1) th layer in advance, the method may further include:
Acquiring input characteristic data corresponding to an ith convolution layer through an in-memory computing array, and carrying out convolution operation on the input characteristic data corresponding to the ith convolution layer according to weight parameters corresponding to the ith convolution layer to obtain an initial computing result of the ith convolution layer;
performing operator operation on the initial calculation result of the ith convolution layer through a first operator in the first operator module to obtain an intermediate calculation result of the ith convolution layer, determining the intermediate calculation result of the ith convolution layer as input characteristic data corresponding to the (i+1) th convolution layer, and still inputting the input characteristic data into an ith operation unit for relevant processing;
after the intermediate calculation result of the (i+1) -th convolution layer is obtained according to the (i) th calculation unit, determining the intermediate calculation result of the (i+1) -th convolution layer as input characteristic data corresponding to the (i+2) -th convolution layer, and inputting the input characteristic data into the (i+1) -th calculation unit for correlation processing.
Where i is an integer greater than zero and less than or equal to N; n represents the number of the operation units, and N is an integer greater than zero.
After the input feature data corresponding to the i+1th convolution layer is obtained, if the weight parameter corresponding to the i+1th convolution layer is stored in the in-memory computing array in the i+1th computing unit in advance, the weight parameter can be input into the i+1th computing unit for performing correlation processing; if the weight parameter corresponding to the (i+1) th convolution layer is still stored in the in-memory computing array in the (i) th computing unit in advance, the weight parameter can be still input into the (i) th computing unit for relevant processing; after obtaining the intermediate calculation result of the (i+1) th convolution layer according to the (i) th operation unit, determining the intermediate calculation result of the (i+1) th convolution layer as input characteristic data corresponding to the (i+2) th convolution layer; because the weight parameters corresponding to the i+2 convolution layer are pre-stored in the in-memory computing array in the i+1 operation unit, the input characteristic data corresponding to the i+2 convolution layer is required to be input into the i+1 operation unit for relevant processing until all the N operation units are processed.
Briefly, conventional von neumann architectures are centered on computing units, with a large amount of data handling. With the complexity of an artificial intelligence scene, the data volume to be processed by an algorithm is more and more, the performance improvement amplitude based on a traditional architecture is smaller and smaller, the technical scheme of the embodiment of the application is based on a more mature in-memory calculation scheme, convolution operation can be realized, and the characteristics of non-convolution operators are combined, so that the whole architecture can realize the function of a general artificial intelligence network, weight parameters are not required to be continuously loaded in the execution process, only need to be preloaded into an in-memory calculation storage unit, then analog data calculation is carried out by utilizing components, and interaction can be carried out with an external non-convolution operator through a digital-analog conversion module; in addition, in order to increase the generality of the algorithm, a DSP is further added in the embodiment of the application, so that the practicability of the operator is greatly expanded.
In addition, in the embodiment of the application, since the whole architecture uses a chain structure, the system scale can be conveniently expanded. The four-stage transmission architecture used for illustration in the embodiments of the present application is not limited. In addition, for the first operator module in the architecture shown in fig. 6, any operator suitable for implementation by a dedicated acceleration circuit may be used. Furthermore, functional groupings in artificial intelligence networks may take many forms and are not limited to examples in the embodiments of the present application.
The present embodiment provides a neural network acceleration method, which is applied to the neural network acceleration device 20 described in the foregoing embodiment. Acquiring input characteristic data through an in-memory computing array, and performing convolution operation on the input characteristic data to obtain an initial computing result; performing operator operation on the initial calculation result through a first type of operators in a first operator module to obtain an intermediate calculation result; and taking the intermediate calculation result as input characteristic data of the next operation unit until all the processing of a plurality of operation units is completed, and determining a target output result. In this way, the neural network accelerating device uses a chain structure, namely, the intermediate calculation result output by the current operation unit is used as the input characteristic data of the next operation unit, so that the expandability of the system scale is good; in addition, the characteristics of the intelligent algorithm structure and the in-memory computing array are fully utilized, so that the data transmission quantity between the processor and the memory can be reduced, the data carrying cost is reduced, and the power consumption is further reduced; and the in-memory computing array can also reduce the complexity of computation, thereby improving the overall performance of the system.
In yet another embodiment of the present application, the neural network acceleration device 20 described in the foregoing embodiment may be implemented in a form of hardware or a form of a software functional module. If implemented as a software functional module, that is not sold or otherwise used as a stand-alone product, may be stored on a computer-readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or processor to perform all or part of the steps of the method described in the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Accordingly, the present embodiment provides a computer storage medium storing a computer program which, when executed by at least one processor, implements the neural network acceleration method of any one of the preceding embodiments.
In still another embodiment of the present application, referring to fig. 9, a schematic diagram of a specific hardware structure of an electronic device according to an embodiment of the present application is shown based on the foregoing composition of the neural network acceleration device 20 and a computer storage medium. As shown in fig. 9, the electronic device 90 may include a processor 901, and the processor 901 may call and execute a computer program from a memory to implement the neural network acceleration method according to any one of the foregoing embodiments.
Optionally, as shown in fig. 9, the electronic device 90 may also include a memory 902. Wherein the processor 901 may call and run a computer program from the memory 902 to implement the neural network acceleration method of any of the foregoing embodiments.
The memory 902 may be a separate device independent of the processor 901, or may be integrated into the processor 901.
Optionally, as shown in fig. 9, the electronic device 90 may further include a transceiver 903, and the processor 901 may control the transceiver 903 to communicate with other devices, and in particular, may send information or data to other devices, or receive information or data sent by other devices.
Where transceiver 903 may include a transmitter and a receiver, transceiver 903 may further include antennas, and the number of antennas may be one or more.
Alternatively, the electronic device 90 may be a smart phone, a tablet computer, a palmtop computer, a notebook computer, a desktop computer, or the like, or a device integrated with the neural network acceleration device 20 according to any one of the foregoing embodiments. Here, the electronic device 90 may implement the corresponding procedures described in the methods of the embodiments of the present application, which are not described herein for brevity.
In still another embodiment of the present application, based on the composition of the neural network acceleration device 20 and the computer storage medium, in a possible example, referring to fig. 10, a schematic diagram of the composition structure of a chip provided in an embodiment of the present application is shown. As shown in fig. 10, the chip 100 may include the neural network acceleration device 20 described in any of the previous embodiments.
In another possible example, referring to fig. 11, a specific hardware architecture diagram of a chip provided in an embodiment of the present application is shown. As shown in fig. 11, the chip 100 may include a processor 1101, and the processor 1101 may call and execute a computer program from a memory to implement the neural network acceleration method according to any of the foregoing embodiments.
Optionally, as shown in fig. 11, the chip 100 may further include a memory 1102. Wherein the processor 1101 may call and run a computer program from the memory 1102 to implement the neural network acceleration method according to any of the previous embodiments. It is noted that the memory 1102 may be a separate device from the processor 1101 or may be integrated in the processor 1101.
Optionally, as shown in fig. 11, the chip 100 may further comprise an input interface 1103. The processor 1101 may control the input interface 1103 to communicate with other devices or chips, and in particular, may acquire information or data sent by other devices or chips.
Optionally, as shown in fig. 11, the chip 100 may further include an output interface 1104. Wherein the processor 1101 may control the output interface 1104 to communicate with other devices or chips, and in particular may output information or data to the other devices or chips.
Alternatively, the chip 100 may be applied to the electronic device described in the foregoing embodiment, and the chip may implement the corresponding processes described in the methods of the embodiments of the present application, which are not described herein for brevity.
It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-a-chip, system-on-chip, chip system or system-on-chip chips, etc., and are not limited in any way herein.
It should be noted that the processor in the embodiments of the present application may be an integrated circuit chip with signal processing capability. In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The processor may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.
It should also be noted that the memory in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DRRAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more application specific integrated circuits (Application Specific Integrated Circuits, ASIC), digital signal processors (Digital Signal Processing, DSP), digital signal processing devices (DSP devices, DSPD), programmable logic devices (Programmable Logic Device, PLD), field programmable gate arrays (Field-Programmable Gate Array, FPGA), general purpose processors, controllers, microcontrollers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof. For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
It should be noted that, in this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
The methods disclosed in the several method embodiments provided in the present application may be arbitrarily combined without collision to obtain a new method embodiment.
The features disclosed in the several product embodiments provided in the present application may be combined arbitrarily without conflict to obtain new product embodiments.
The features disclosed in the several method or apparatus embodiments provided in the present application may be arbitrarily combined without conflict to obtain new method embodiments or apparatus embodiments.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (19)

1. The neural network accelerating device is characterized by comprising a plurality of operation units, wherein each operation unit comprises an in-memory computing array and a first operator module, and the first operator module comprises a plurality of first operators; wherein,,
the in-memory computing array is used for acquiring input characteristic data and carrying out convolution operation on the input characteristic data to obtain an initial computing result;
the first operator module is configured to perform operator operation on the initial calculation result through the first type operator, obtain an intermediate calculation result, and use the intermediate calculation result as input feature data of a next operation unit.
2. The neural network acceleration apparatus of claim 1, wherein weight parameters corresponding to a target convolutional layer are pre-stored in the in-memory computing array; wherein,,
and the in-memory computing array is used for carrying out convolution operation on the input characteristic data according to the weight parameters after the input characteristic data corresponding to the target convolution layer is acquired, so as to obtain the initial computing result.
3. The neural network acceleration apparatus of claim 2, wherein the in-memory computing array comprises a digital-to-analog conversion module, a storage array, and an analog-to-digital conversion module; wherein,,
the digital-to-analog conversion module is used for carrying out digital-to-analog conversion on the input characteristic data to obtain a first analog signal;
the storage array is used for performing multiply-accumulate calculation according to the weight parameter and the first analog signal to obtain a second analog signal;
the analog-to-digital conversion module is used for performing analog-to-digital conversion on the second analog signal to obtain a target digital signal, and determining the target digital signal as the initial calculation result.
4. The neural network acceleration apparatus according to claim 2, wherein the arithmetic unit is an ith arithmetic unit, and the in-memory calculation array in the ith arithmetic unit stores weight parameters corresponding to an ith convolution layer in advance; wherein,,
The in-memory computing array is used for acquiring the input characteristic data corresponding to the ith convolution layer, and carrying out convolution operation on the input characteristic data corresponding to the ith convolution layer according to the weight parameter corresponding to the ith convolution layer to obtain an initial computing result of the ith convolution layer;
the first operator module is configured to perform operator operation on the initial calculation result of the ith convolution layer through the first operator, obtain an intermediate calculation result of the ith convolution layer, and determine the intermediate calculation result of the ith convolution layer as input feature data corresponding to the (i+1) th convolution layer;
wherein i is an integer greater than zero and less than or equal to N; n represents the number of the operation units, and N is an integer greater than zero.
5. The neural network acceleration apparatus of claim 1, further comprising a receiving unit; wherein,,
the receiving unit is used for receiving the characteristic image, dividing the characteristic image into at least one characteristic block and sequentially reading the at least one characteristic block into the operation unit.
6. The neural network acceleration apparatus of claim 5,
And in the plurality of operation units, the input characteristic data of a first operation unit is a first characteristic block, after the intermediate calculation result output by the first operation unit is obtained, the intermediate calculation result output by the first operation unit is used as the input characteristic data of a next operation unit, and the next characteristic block is used as the input characteristic data of the first operation unit until all the plurality of operation units are processed.
7. The neural network acceleration apparatus of claim 1, further comprising a digital signal processor; wherein,,
and the digital signal processor is used for processing the initial calculation result to obtain the intermediate calculation result under the condition that the first type operator cannot be used.
8. The neural network acceleration apparatus of claim 7, wherein the first type of operator corresponds to an acceleration operation applicable to a dedicated digital circuit, and the digital signal processor is configured to process operations other than the first type of operator that are not applicable to a dedicated digital circuit;
the first type of operators at least comprise one of the following: an operator for performing a pooling operation, an operator for performing an activation function operation, and an operator for performing an addition operation.
9. The neural network acceleration method is characterized by being applied to a neural network acceleration device, wherein the neural network acceleration device comprises a plurality of operation units, and each operation unit comprises an in-memory calculation array and a first operator module; the method comprises the following steps:
acquiring input characteristic data through the in-memory computing array, and performing convolution operation on the input characteristic data to obtain an initial computing result;
performing operator operation on the initial calculation result through a first type operator in the first operator module to obtain an intermediate calculation result;
and taking the intermediate calculation result as input characteristic data of the next operation unit until all the operation units are processed, and determining a target output result.
10. The method of claim 9, wherein the in-memory computing array has weight parameters corresponding to a target convolutional layer stored in advance;
correspondingly, the obtaining the input feature data through the in-memory computing array, and performing convolution operation on the input feature data to obtain an initial computing result, including:
and after the in-memory computing array acquires the input characteristic data corresponding to the target convolution layer, carrying out convolution operation on the input characteristic data according to the weight parameters to obtain the initial computing result.
11. The method according to claim 10, wherein the convolving the input feature data according to the weight parameter to obtain the initial calculation result includes:
performing digital-to-analog conversion on the input characteristic data to obtain a first analog signal;
performing multiply-accumulate calculation according to the weight parameter and the first analog signal to obtain a second analog signal;
and carrying out analog-to-digital conversion on the second analog signal to obtain a target digital signal, and determining the target digital signal as the initial calculation result.
12. The method of claim 10, wherein when the in-memory computing array in the ith arithmetic unit stores weight parameters corresponding to the ith convolutional layer in advance, the method further comprises:
acquiring input characteristic data corresponding to the ith convolution layer through the in-memory computing array, and carrying out convolution operation on the input characteristic data corresponding to the ith convolution layer according to the weight parameter corresponding to the ith convolution layer to obtain an initial computing result of the ith convolution layer;
performing operator operation on the initial calculation result of the ith convolution layer through a first operator in the first operator module to obtain an intermediate calculation result of the ith convolution layer, determining the intermediate calculation result of the ith convolution layer as input characteristic data corresponding to the (i+1) th convolution layer, and inputting the input characteristic data into an (i+1) th operation unit for relevant processing;
Wherein i is an integer greater than zero and less than or equal to N; n represents the number of the operation units, and N is an integer greater than zero.
13. The method of claim 10, wherein when the in-memory computing array in the ith arithmetic unit stores weight parameters corresponding to the ith convolutional layer and the (i+1) th layer in advance, the method further comprises:
acquiring input characteristic data corresponding to the ith convolution layer through the in-memory computing array, and carrying out convolution operation on the input characteristic data corresponding to the ith convolution layer according to the weight parameter corresponding to the ith convolution layer to obtain an initial computing result of the ith convolution layer;
performing operator operation on the initial calculation result of the ith convolution layer through a first operator in the first operator module to obtain an intermediate calculation result of the ith convolution layer, determining the intermediate calculation result of the ith convolution layer as input characteristic data corresponding to the (i+1) th convolution layer, and still inputting the input characteristic data into an ith operation unit for relevant processing;
after obtaining the intermediate calculation result of the (i+1) -th convolution layer according to the (i) th calculation unit, determining the intermediate calculation result of the (i+1) -th convolution layer as input characteristic data corresponding to the (i+2) -th convolution layer and inputting the input characteristic data into the (i+1) -th calculation unit for relevant processing;
Wherein i is an integer greater than zero and less than or equal to N; n represents the number of the operation units, and N is an integer greater than zero.
14. The method according to claim 9, wherein the method further comprises:
receiving a characteristic image;
dividing the characteristic image into at least one characteristic block, and sequentially reading the at least one characteristic block into the operation unit according to the sequence;
and the input characteristic data of the first operation unit is a first characteristic block, after the intermediate calculation result output by the first operation unit is obtained, the intermediate calculation result output by the first operation unit is used as the input characteristic data of the next operation unit, and the next characteristic block is used as the input characteristic data of the first operation unit until all the processing of the plurality of operation units is completed.
15. The method of claim 9, wherein the neural network acceleration device further comprises a digital signal processor, the method further comprising:
and under the condition that the first type operator cannot be used, the digital signal processor processes the initial calculation result to obtain the intermediate calculation result.
16. The method of claim 15, wherein the first type of operator corresponds to an accelerated operation applicable to a dedicated digital circuit, and wherein the digital signal processor is configured to process operations other than the first type of operator that are not applicable to a dedicated digital circuit;
the first type of operators at least comprise one of the following: an operator for performing a pooling operation, an operator for performing an activation function operation, and an operator for performing an addition operation.
17. A chip comprising a neural network acceleration device according to any one of claims 1 to 8.
18. An electronic device comprising a memory and a processor; wherein,,
the memory is used for storing a computer program capable of running on the processor;
the processor being configured to perform the method of any of claims 9 to 16 when the computer program is run.
19. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, implements the method according to any of claims 9 to 16.
CN202111592393.6A 2021-12-23 2021-12-23 Neural network acceleration device, method, equipment and computer storage medium Pending CN116362312A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111592393.6A CN116362312A (en) 2021-12-23 2021-12-23 Neural network acceleration device, method, equipment and computer storage medium
PCT/CN2022/133443 WO2023116314A1 (en) 2021-12-23 2022-11-22 Neural network acceleration apparatus and method, and device and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111592393.6A CN116362312A (en) 2021-12-23 2021-12-23 Neural network acceleration device, method, equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN116362312A true CN116362312A (en) 2023-06-30

Family

ID=86901193

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111592393.6A Pending CN116362312A (en) 2021-12-23 2021-12-23 Neural network acceleration device, method, equipment and computer storage medium

Country Status (2)

Country Link
CN (1) CN116362312A (en)
WO (1) WO2023116314A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116881195A (en) * 2023-09-04 2023-10-13 北京怀美科技有限公司 Chip system facing detection calculation and chip method facing detection calculation
CN117348998A (en) * 2023-12-04 2024-01-05 北京怀美科技有限公司 Acceleration chip architecture applied to detection and calculation method
CN117829149A (en) * 2024-02-29 2024-04-05 苏州元脑智能科技有限公司 Language model hybrid training method and device, electronic equipment and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117057400B (en) * 2023-10-13 2023-12-26 芯原科技(上海)有限公司 Visual image processor, neural network processor and image convolution calculation method
CN117077726B (en) * 2023-10-17 2024-01-09 之江实验室 Method, device and medium for generating in-memory computing neural network model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3671748A1 (en) * 2018-12-21 2020-06-24 IMEC vzw In-memory computing for machine learning
CN113159302B (en) * 2020-12-15 2022-07-19 浙江大学 Routing structure for reconfigurable neural network processor
CN113222107A (en) * 2021-03-09 2021-08-06 北京大学 Data processing method, device, equipment and storage medium
CN113743600B (en) * 2021-08-26 2022-11-11 南方科技大学 Storage and calculation integrated architecture pulse array design method suitable for multi-precision neural network

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116881195A (en) * 2023-09-04 2023-10-13 北京怀美科技有限公司 Chip system facing detection calculation and chip method facing detection calculation
CN116881195B (en) * 2023-09-04 2023-11-17 北京怀美科技有限公司 Chip system facing detection calculation and chip method facing detection calculation
CN117348998A (en) * 2023-12-04 2024-01-05 北京怀美科技有限公司 Acceleration chip architecture applied to detection and calculation method
CN117829149A (en) * 2024-02-29 2024-04-05 苏州元脑智能科技有限公司 Language model hybrid training method and device, electronic equipment and storage medium
CN117829149B (en) * 2024-02-29 2024-05-31 苏州元脑智能科技有限公司 Language model hybrid training method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2023116314A1 (en) 2023-06-29

Similar Documents

Publication Publication Date Title
CN116362312A (en) Neural network acceleration device, method, equipment and computer storage medium
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
CN110998570A (en) Hardware node having matrix vector unit with block floating point processing
US20210357735A1 (en) Split accumulator for convolutional neural network accelerator
CN110929865B (en) Network quantification method, service processing method and related product
CN110543939B (en) Hardware acceleration realization device for convolutional neural network backward training based on FPGA
CN113326930B (en) Data processing method, neural network training method, related device and equipment
CN110807513A (en) Convolutional neural network accelerator based on Winograd sparse algorithm
US11593628B2 (en) Dynamic variable bit width neural processor
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
CN111582465B (en) Convolutional neural network acceleration processing system and method based on FPGA and terminal
US20220207327A1 (en) Method for dividing processing capabilities of artificial intelligence between devices and servers in network environment
CN110991630A (en) Convolutional neural network processor for edge calculation
KR20190098671A (en) High speed processing method of neural network and apparatus using thereof
KR20220095533A (en) Neural network processing unit with Network Processor and convolution array
CN114005458A (en) Voice noise reduction method and system based on pipeline architecture and storage medium
US20230376733A1 (en) Convolutional neural network accelerator hardware
CN113537479A (en) Neural network circuit, edge device, and neural network operation method
WO2021081854A1 (en) Convolution operation circuit and convolution operation method
Zhan et al. Field programmable gate array‐based all‐layer accelerator with quantization neural networks for sustainable cyber‐physical systems
CN111788567B (en) Data processing equipment and data processing method
WO2023115814A1 (en) Fpga hardware architecture, data processing method therefor and storage medium
CN110766136A (en) Compression method of sparse matrix and vector
Sawaguchi et al. Slightly-slacked dropout for improving neural network learning on FPGA
CN112561050A (en) Neural network model training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination