CN116090530A - Systolic array structure and method capable of configuring convolution kernel size and parallel calculation number - Google Patents

Systolic array structure and method capable of configuring convolution kernel size and parallel calculation number Download PDF

Info

Publication number
CN116090530A
CN116090530A CN202310150887.1A CN202310150887A CN116090530A CN 116090530 A CN116090530 A CN 116090530A CN 202310150887 A CN202310150887 A CN 202310150887A CN 116090530 A CN116090530 A CN 116090530A
Authority
CN
China
Prior art keywords
data
convolution
convolution kernel
kernel size
parallel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310150887.1A
Other languages
Chinese (zh)
Inventor
窦思远
朱博源
杨冬立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Songke Intelligent Technology Co ltd
Original Assignee
Guangdong Songke Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Songke Intelligent Technology Co ltd filed Critical Guangdong Songke Intelligent Technology Co ltd
Priority to CN202310150887.1A priority Critical patent/CN116090530A/en
Publication of CN116090530A publication Critical patent/CN116090530A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a pulse array structure and a method capable of configuring the size of a convolution kernel and calculating the number in parallel, comprising the following steps of firstly, fixing weight data; step two, broadcasting and arranging the input feature map data; step three, convolution calculation; step four, judging the validity of the output data; step five, data storage; the invention adopts the same pulse array structure with the same configurable convolution kernel size and parallel quantity, the same pulse array multiplication and addition pipeline structure, the same pulse array window sliding and broadcasting structure and the same parallel output effective signal and structure, does not need to take out data from a memory into a buffer or a cache, realizes the hardware acceleration of a convolution layer, can configure the pulse array structure with the convolution kernel size and parallel calculation quantity, supports the convolution calculation with different sizes and different parallel lines, improves the data multiplexing capability, reduces the consumption of a hardware calculation unit and improves the convolution calculation efficiency.

Description

Systolic array structure and method capable of configuring convolution kernel size and parallel calculation number
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a pulse array structure and a pulse array method capable of configuring the size of a convolution kernel and calculating numbers in parallel.
Background
Along with technological development, artificial intelligence is widely applied in various fields, research and application of a convolutional neural network are always a core of the artificial intelligence, the neural network can provide high-precision reasoning for the wide application, along with deep and complicated neural networks, people have higher demands on the calculation speed and hardware area of the neural network, in the convolutional neural network, the calculated amount of a convolutional layer occupies more than 80%, and in order to enable a higher-efficiency network to be deployed in a low-power-consumption soc, the hardware acceleration of the convolutional layer is indispensable.
In the processor architecture, a large amount of data is stored in a memory outside the processor, when the computing module needs to operate, the data needs to be fetched from the memory to a buffer or a cache and then sent to the computing module for operation, in the general convolutional neural network convolutional layer calculation, the convolutional layer operation time is much shorter than the data carrying time, the general convolutional neural network needs to use a large amount of data, and the data access speed is far lower than the data processing speed.
The traditional convolution computing unit has the defects of low data multiplexing, more consumed hardware units and low convolution computing efficiency, the traditional pulse array is optimized for specific convolution kernel size and parallel number, and the array structure is required to be redesigned once the network structure is replaced.
Disclosure of Invention
The present invention is directed to a systolic array structure and method capable of configuring the convolution kernel size and the number of parallel computations, so as to solve the above-mentioned problems in the prior art.
In order to achieve the above purpose, the present invention provides the following technical solutions: the pulse ARRAY structure capable of configuring the convolution kernel size and the parallel calculation number comprises an IFMAP_RAM unit, a PE_ARRAY unit and a WEIGHT_RAM unit, wherein the IFMAP_RAM unit is in data connection with the PE_ARRAY unit through an outdata interface, the PE_ARRAY unit is in data connection with the WEIGHT_RAM unit through taps, and the PE_ARRAY unit is in data connection with the outside through a done interface, an outsum_final and an overall interface.
Preferably, the pe_array unit includes a multiplier and a D flip-flop, and the multiplier is an 8-bit multiplier.
Preferably, the pe_array unit has 7×15 units, and the supported convolution kernels are one of 21 2×2 convolution kernels, 10 3×3 convolution kernels, 3 5×5 convolution kernels, and 2 7×7 convolution kernels.
Preferably, in the pe_array unit, the convolution kernel size is 7, the parallelism is 2 at maximum, the convolution kernel size is 5, the parallelism is 3 at maximum, the convolution kernel size is 3, the parallelism is 10 at maximum, and the parallelism is 21 at maximum if the convolution kernel size is 2.
The method for configuring convolution kernel size and parallel calculation number of systolic arrays comprises the steps of firstly, fixing weight data; step two, broadcasting and arranging the input feature map data; step three, convolution calculation; step four, judging the validity of the output data; step five, data storage;
in the first step, the convolution kernel data in ram is taken out, parallel quantity signals are sent to a sliding window according to the convolution kernel size signals, the sliding window is connected with a row of PE units at the top of a pulse array, and each cycle of the sliding window and the row of PE units transmit data downwards until a convolution kernel sliding enabling signal is set to zero, and at the moment, different weight data are fixed on the pulse array unit;
in the second step, when the weight data in the first step is fixed, the input feature map data in the ping-pong buffer is taken out, the signals are transmitted to the sliding window according to the size signals of the input feature map, the convolution kernel size signals and the parallel quantity signals, the output data of the sliding window are arranged, if the parallel quantity is greater than 1, the data are required to be broadcasted and connected to the leftmost first column of the pulse array, then the data are transmitted once to the right side of the sliding window and the PE unit in each period until the PE unit which needs to be calculated on the rightmost side receives the input feature map data, and calculation is started;
in the third step, when the feature map data is input to the PE unit to be calculated on the rightmost side in the second step, the operation enabling signal is pulled up, the PE unit starts to perform convolution calculation on the fixed weight data and the input feature map data, and then the multiplication result is obtained as the left input feature map data flows to the right, and then the multiplication result calculated by the same convolution kernel is accumulated in pairs according to the convolution kernel size signal, if the convolution kernel size is 2×2, two cycles are needed to obtain the accumulation result, if the convolution kernel size is 3×3, four cycles are needed to obtain the accumulation result, if the convolution kernel size is 5×5, six cycles are needed to obtain the accumulation result, and if the convolution kernel size is 7×7, six cycles are also needed to obtain the accumulation result;
in the fourth step, when the data accumulation in the third step is completed, the data is output through the output port, and the output port data is not fully valid at this time, the valid output signal is required to be identified according to the valid flag bit of the output data and the parallel quantity, and the flag is marked according to the completion time of each column of data accumulation;
in the fifth step, after the first step, the second step, the third step and the fourth step are completed, the output buffer stores data according to the output data of the systolic array and out_enable, and the parallel quantity.
Preferably, in the second step, if the convolution kernel size is 2 and the convolution parallel line number is 21, the systolic array is arranged as shown in the bottom right corner diagram of fig. 3, and when the input feature image flows into the 2 nd cycle of the systolic array, the convolution operation of all color PEs in 1,2 lines is started, the convolution operation of all color PEs in 3,4 lines is started, the convolution operation of all color PEs in 6 th cycle is started, the convolution operation of all color PEs in 1,2 lines is completed when the 5 th cycle is the last, the convolution operation of all color PEs in 3,4 lines is completed when the 3 rd cycle is the last, and the convolution operation of all color PEs in 5,6 lines is completed when the 1 st cycle is the last.
Preferably, in the fourth step, the output signal is marked as valid by the common flag of ovalid [2:0], kernel_size [1:0] and kernel_num [4:0 ].
Compared with the prior art, the invention has the beneficial effects that: the invention adopts the same pulse array structure with the same configurable convolution kernel size and parallel quantity, the same pulse array multiplication and addition pipeline structure, the same pulse array window sliding and broadcasting structure and the same parallel output effective signal and structure, does not need to take the data out of a memory into a buffer or a cache, and is beneficial to realizing hardware acceleration of a convolution layer in calculation, the pulse array structure with the configurable convolution kernel size and parallel calculation quantity supports convolution calculation with different sizes and different parallel numbers, further improves the capability of data multiplexing, reduces the consumption of a hardware calculation unit and improves the efficiency of convolution calculation.
Drawings
FIG. 1 is a block diagram of a system architecture of the present invention;
FIG. 2 is a diagram of a systolic array structure of the present invention;
FIG. 3 is a diagram showing the output of the systolic array to the output signature cache;
FIG. 4 is a flow chart of the method of the present invention;
in the figure: 1. ifmap_ram unit; 2. a PE_ARRAY unit; 3. WEIGHT_RAM cells.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-3, an embodiment of the present invention is provided: the pulse ARRAY structure capable of configuring convolution kernel size and parallel calculation number comprises an IFMAP_RAM unit 1, a PE_ARRAY unit 2 and a WEIGHT_RAM unit 3, wherein the IFMAP_RAM unit 1 is in data connection with the PE_ARRAY unit 2 through an outdata interface, the PE_ARRAY unit 2 is in data connection with the WEIGHT_RAM unit 3 through taps, the PE_ARRAY unit 2 is in data connection with the outside through a done interface, an outsum_final interface and an overlap interface, the PE_ARRAY unit 2 comprises a multiplier and a D trigger, the multiplier is an 8-bit multiplier, the PE_ARRAY unit 2 is provided with 7×15 units, the supported convolution kernels are 21 2×2 convolution kernels, 10 3×3 convolution kernels, 3 5×5 convolution kernels and 2×7 convolution kernels, the degree of parallelism is maximum 2, the degree of parallelism is maximum is 5, the degree of parallelism is maximum 3, and the degree of parallelism is maximum is 10, and the degree of parallelism is maximum is 21, if the degree of maximum is 2.
Referring to fig. 4, an embodiment of the present invention is provided: the method for configuring convolution kernel size and parallel calculation number of systolic arrays comprises the steps of firstly, fixing weight data; step two, broadcasting and arranging the input feature map data; step three, convolution calculation; step four, judging the validity of the output data; step five, data storage;
in the first step, the convolution kernel data in ram is taken out, parallel quantity signals are sent to a sliding window according to the convolution kernel size signals, the sliding window is connected with a row of PE units at the top of a pulse array, and each cycle of the sliding window and the row of PE units transmit data downwards until a convolution kernel sliding enabling signal is set to zero, and at the moment, different weight data are fixed on the pulse array unit;
in the second step, when the weight data in the first step is fixed, the input feature map data in the ping-pong buffer is taken out, according to the size signal of the input feature map, the size signal of the convolution kernel and the parallel number signal, the signal is transmitted to the sliding window, the output data of the sliding window is arranged, if the parallel number is greater than 1, the data is required to be broadcasted and is connected to the leftmost first column of the pulse array, then the sliding window and the line PE unit of each period are transmitted once to the right until the PE unit of the rightmost period receives the input feature map data, the calculation is started, and if the size of the convolution kernel is 2, the convolution parallel number is 21, the pulse array is arranged according to the size signal of the lower right corner map of fig. 3, the convolution operation of all color PE of 1 and 2 lines is started when the input feature map flows in the 2 th period of the pulse array, the convolution operation of all color PE of 3 and 4 lines is started in the 4 th period, the convolution operation of all color PE of 5 and the color PE of 6 is started in the 6 th period, the convolution operation of all color PE of 1 and the color PE of the convolution operation of the 1 and 2 lines is finished in the 3 and the reciprocal operation of all color PE of the line and 5 are finished when the convolution operation of the color PE of the 2 is calculated in the fourth period is the reciprocal is finished;
in the third step, when the feature map data is input to the PE unit to be calculated on the rightmost side in the second step, the operation enabling signal is pulled up, the PE unit starts to perform convolution calculation on the fixed weight data and the input feature map data, and then the multiplication result is obtained as the left input feature map data flows to the right, and then the multiplication result calculated by the same convolution kernel is accumulated in pairs according to the convolution kernel size signal, if the convolution kernel size is 2×2, two cycles are needed to obtain the accumulation result, if the convolution kernel size is 3×3, four cycles are needed to obtain the accumulation result, if the convolution kernel size is 5×5, six cycles are needed to obtain the accumulation result, and if the convolution kernel size is 7×7, six cycles are also needed to obtain the accumulation result;
in the fourth step, when the data accumulation in the third step is completed, the data is output through the output port, and the output port data is not fully valid at this time, the valid output signal is identified according to the valid flag bit and the parallel number of the output data, the flag is marked according to the completion time of each column of data accumulation, and whether the output signal is valid or not is marked by the ovalid [2:0], the kernel_size [1:0] and the kernel_num [4:0] together;
in the fifth step, after the first step, the second step, the third step and the fourth step are completed, the output buffer stores data according to the output data of the systolic array and out_enable, and the parallel quantity.
Based on the above, the invention has the advantages that the invention adopts the same pulse array structure with the configurable convolution kernel size and the parallel quantity, the same pulse array multiplication and addition pipeline structure, the same pulse array window sliding and broadcasting structure and the same parallel output effective signal and structure, and the pulse array structure with the configurable convolution kernel size and the parallel calculation quantity supports the convolution calculation with different sizes and different parallel quantity, further improves the data multiplexing capability, reduces the consumption of a hardware calculation unit and improves the convolution calculation efficiency.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims (7)

1. A systolic ARRAY structure of configurable convolution kernel size and parallel computation count, comprising an ifmap_ram unit (1), a pe_array unit (2) and a weight_ram unit (3), characterized by: the IFMAP_RAM unit (1) establishes data connection with the PE_ARRAY unit (2) through an outdata interface, the PE_ARRAY unit (2) establishes data connection with the WEIGHT_RAM unit (3) through taps, and the PE_ARRAY unit (2) establishes data connection with the outside through a done interface, an output_final and an interface.
2. The configurable systolic array structure of convolution kernel size and parallel operands of claim 1, wherein: the PE_ARRAY unit (2) comprises a multiplier and a D trigger, and the multiplier is an 8-bit multiplier.
3. The configurable systolic array structure of convolution kernel size and parallel operands of claim 1, wherein: the PE_ARRAY unit (2) has 7×15 units, and the supported convolution kernels are one of 21 2×2 convolution kernels, 10 3×3 convolution kernels, 3 5×5 convolution kernels and 2 7×7 convolution kernels.
4. The configurable systolic array structure of convolution kernel size and parallel operands of claim 1, wherein: the convolution kernel size in the pe_array unit (2) is 7, the maximum parallelism is 2, the convolution kernel size is 5, the maximum parallelism is 3, the convolution kernel size is 3, the maximum parallelism is 10, and the maximum parallelism is 21 if the convolution kernel size is 2.
5. The method for configuring convolution kernel size and parallel calculation number of systolic arrays comprises the steps of firstly, fixing weight data; step two, broadcasting and arranging the input feature map data; step three, convolution calculation; step four, judging the validity of the output data; step five, data storage; the method is characterized in that:
in the first step, the convolution kernel data in ram is taken out, parallel quantity signals are sent to a sliding window according to the convolution kernel size signals, the sliding window is connected with a row of PE units at the top of a pulse array, and each cycle of the sliding window and the row of PE units transmit data downwards until a convolution kernel sliding enabling signal is set to zero, and at the moment, different weight data are fixed on the pulse array unit;
in the second step, when the weight data in the first step is fixed, the input feature map data in the ping-pong buffer is taken out, the signals are transmitted to the sliding window according to the size signals of the input feature map, the convolution kernel size signals and the parallel quantity signals, the output data of the sliding window are arranged, if the parallel quantity is greater than 1, the data are required to be broadcasted and connected to the leftmost first column of the pulse array, then the data are transmitted once to the right side of the sliding window and the PE unit in each period until the PE unit which needs to be calculated on the rightmost side receives the input feature map data, and calculation is started;
in the third step, when the feature map data is input to the PE unit to be calculated on the rightmost side in the second step, the operation enabling signal is pulled up, the PE unit starts to perform convolution calculation on the fixed weight data and the input feature map data, and then the multiplication result is obtained as the left input feature map data flows to the right, and then the multiplication result calculated by the same convolution kernel is accumulated in pairs according to the convolution kernel size signal, if the convolution kernel size is 2×2, two cycles are needed to obtain the accumulation result, if the convolution kernel size is 3×3, four cycles are needed to obtain the accumulation result, if the convolution kernel size is 5×5, six cycles are needed to obtain the accumulation result, and if the convolution kernel size is 7×7, six cycles are also needed to obtain the accumulation result;
in the fourth step, when the data accumulation in the third step is completed, the data is output through the output port, and the output port data is not fully valid at this time, the valid output signal is required to be identified according to the valid flag bit of the output data and the parallel quantity, and the flag is marked according to the completion time of each column of data accumulation;
in the fifth step, after the first step, the second step, the third step and the fourth step are completed, the output buffer stores data according to the output data of the systolic array and out_enable, and the parallel quantity.
6. The method of configurable convolution kernel size and parallel computing of systolic arrays of numbers according to claim 5, wherein: in the second step, if the convolution kernel size is 2 and the convolution parallel line number is 21, the pulse array is arranged according to the lower right corner diagram of fig. 3, when the input feature image flows into the pulse array at the 2 nd period, the convolution operation of all color PEs of 1,2 lines is started, the convolution operation of all color PEs of 3,4 lines is started at the 4 th period, the convolution operation of all color PEs of 5,6 lines is started at the 6 th period, the convolution operation of all color PEs of 1,2 lines is completed at the 5 th period, the convolution operation of all color PEs of 3,4 lines is completed at the 3 rd period, and the convolution operation of all color PEs of 5,6 lines is completed at the 1 st period.
7. The method of configurable convolution kernel size and parallel computing of systolic arrays of numbers according to claim 5, wherein: in the fourth step, the output signal is marked by the common sign of ovalid [2:0], kernel_size [1:0] and kernel_num [4:0] to be valid.
CN202310150887.1A 2023-02-22 2023-02-22 Systolic array structure and method capable of configuring convolution kernel size and parallel calculation number Pending CN116090530A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310150887.1A CN116090530A (en) 2023-02-22 2023-02-22 Systolic array structure and method capable of configuring convolution kernel size and parallel calculation number

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310150887.1A CN116090530A (en) 2023-02-22 2023-02-22 Systolic array structure and method capable of configuring convolution kernel size and parallel calculation number

Publications (1)

Publication Number Publication Date
CN116090530A true CN116090530A (en) 2023-05-09

Family

ID=86199178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310150887.1A Pending CN116090530A (en) 2023-02-22 2023-02-22 Systolic array structure and method capable of configuring convolution kernel size and parallel calculation number

Country Status (1)

Country Link
CN (1) CN116090530A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116980277A (en) * 2023-09-18 2023-10-31 腾讯科技(深圳)有限公司 Data processing method, device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116980277A (en) * 2023-09-18 2023-10-31 腾讯科技(深圳)有限公司 Data processing method, device, computer equipment and storage medium
CN116980277B (en) * 2023-09-18 2024-01-12 腾讯科技(深圳)有限公司 Data processing method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108108809B (en) Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof
CN108681984B (en) Acceleration circuit of 3*3 convolution algorithm
CN106875011B (en) Hardware architecture of binary weight convolution neural network accelerator and calculation flow thereof
CN109284817B (en) Deep separable convolutional neural network processing architecture/method/system and medium
CN108537331A (en) A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic
CN108629406B (en) Arithmetic device for convolutional neural network
CN112836813B (en) Reconfigurable pulse array system for mixed-precision neural network calculation
CN109284824A (en) A kind of device for being used to accelerate the operation of convolution sum pond based on Reconfiguration Technologies
CN116090530A (en) Systolic array structure and method capable of configuring convolution kernel size and parallel calculation number
CN112950656A (en) Block convolution method for pre-reading data according to channel based on FPGA platform
CN112905530A (en) On-chip architecture, pooled computational accelerator array, unit and control method
CN110598844A (en) Parallel convolution neural network accelerator based on FPGA and acceleration method
CN115423081A (en) Neural network accelerator based on CNN _ LSTM algorithm of FPGA
Xiao et al. FPGA-based scalable and highly concurrent convolutional neural network acceleration
CN113222129B (en) Convolution operation processing unit and system based on multi-level cache cyclic utilization
CN107368459B (en) Scheduling method of reconfigurable computing structure based on arbitrary dimension matrix multiplication
CN111610963B (en) Chip structure and multiply-add calculation engine thereof
CN201111042Y (en) Two-dimension wavelet transform integrate circuit structure
CN111222090B (en) Convolution calculation module, neural network processor, chip and electronic equipment
CN115293978A (en) Convolution operation circuit and method, image processing apparatus
KR20240035999A (en) Hybrid machine learning architecture using neural processing units and compute-in-memory processing elements
Yang et al. A Parallel Processing CNN Accelerator on Embedded Devices Based on Optimized MobileNet
CN113592067B (en) Configurable convolution calculation circuit for convolution neural network
CN113869507B (en) Neural network accelerator convolution calculation device and method based on pulse array
CN112230884B (en) Target detection hardware accelerator and acceleration method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination