CN117289897A - Self-adaptive double-frequency multiply-add array suitable for neural network accelerator - Google Patents

Self-adaptive double-frequency multiply-add array suitable for neural network accelerator Download PDF

Info

Publication number
CN117289897A
CN117289897A CN202311576340.4A CN202311576340A CN117289897A CN 117289897 A CN117289897 A CN 117289897A CN 202311576340 A CN202311576340 A CN 202311576340A CN 117289897 A CN117289897 A CN 117289897A
Authority
CN
China
Prior art keywords
calculation
weight
input image
module
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311576340.4A
Other languages
Chinese (zh)
Other versions
CN117289897B (en
Inventor
张�浩
汪粲星
李泽钜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Magnichip Microelectronics Co ltd
Original Assignee
Nanjing Magnichip Microelectronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Magnichip Microelectronics Co ltd filed Critical Nanjing Magnichip Microelectronics Co ltd
Priority to CN202311576340.4A priority Critical patent/CN117289897B/en
Publication of CN117289897A publication Critical patent/CN117289897A/en
Application granted granted Critical
Publication of CN117289897B publication Critical patent/CN117289897B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/28Indexing scheme for image data processing or generation, in general involving image processing hardware
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Optimization (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a self-adaptive double-frequency multiply-add array suitable for a neural network accelerator, which introduces a double-fixed data flow scheduling strategy to realize fixed output and fixed weight part, so that weight data can be further multiplexed, the bandwidth pressure of a weight static memory is relieved, the array frequency can be higher than the system frequency, and higher calculation energy efficiency is realized. The self-adaptive dual-frequency array adopted by the invention greatly reduces the operation of repeated read-write memory, dynamically adjusts the frequency of the calculation array through the input image, the weight buffer and the frequency adjusting module, controls the calculation of the whole array through the state machine, greatly increases the throughput of the array, and has clear logic, simplicity and effectiveness. The invention has high modularization degree, clear dependence among modules, simple structure and good feasibility, realizes better pressure balance between the memory and the computing array, and improves the energy efficiency of the array.

Description

Self-adaptive double-frequency multiply-add array suitable for neural network accelerator
Technical Field
The invention relates to a self-adaptive double-frequency multiply-add array suitable for a neural network accelerator, and belongs to the technical field of neural networks.
Background
In recent years, neural networks have been widely used in the fields of image recognition, speech recognition, natural language processing, and the like. With the rapid development of neural network technology, the limitations of the frequency of the computing array and the bandwidth of the memory in the neural network accelerator affect the operating efficiency and resource utilization of the neural network accelerator.
Unlike conventional computer architectures, neural network computing arrays typically use a large number of parallel computing units to process input data, the speed of operation of which is limited by the design of the hardware and the manner in which the data streams are scheduled. Meanwhile, the storage unit for storing the weight, bias and other parameters in the neural network is usually implemented by using a cache or a memory, and the read-write speed is limited by the physical characteristics of the device and the hardware design, so that the bandwidth is limited. The training and reasoning process of the neural network requires a large number of parameter read-write operations, and if the bandwidth of the storage unit is insufficient, the calculation speed of the neural network is slow. In addition, as the size of the neural network is continuously enlarged, the bandwidth bottleneck problem of the memory unit is increasingly significant.
In neural network accelerators, data in a conventional data stream flows between computing units in a fixed format. In general, the computing unit needs more data, and the transmission speed of the storage unit cannot meet the requirement of the computing unit, so that the load imbalance between the computing unit and the storage unit is formed, the highest working frequency of the computing array is far less than the theoretical highest working frequency, the front-end design and the resource redundancy are serious, and the chip efficiency is reduced.
To solve this problem, researchers have proposed many optimization methods, such as rearranging the data stream, optimizing the access manner of the memory, distributed computation, in-memory computation/near-memory computation, and the like. The above data stream rearrangement and memory optimizing access mode is from space dimension optimization, usually adopting larger input image and weight buffer, putting more data in faster storage medium as much as possible, so that the storage bandwidth satisfies the data amount required by one-time calculation of the array; or better data flow is adopted, so that the array has higher utilization rate, and the computational bottleneck caused by frequency is relieved by the space utilization rate of the array. Distributed computation and in-memory computation/near-memory computation are optimized in the time dimension to shorten the path of data read-in as much as possible, placing the computation unit directly near the memory unit so that the array can run at a higher frequency. However, the suboptimal solution phenomenon exists in the scheme, and the optimization of the space dimension often leads to the deterioration in the time dimension, so that the array speed cannot reach the theoretical highest speed; while optimization of the time dimension ignores utilization in array space.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: the adaptive double-frequency multiply-add array is suitable for the neural network accelerator, improves the output fixed data stream, forms a double-fixed data stream scheduling strategy, can be applied to the neural network accelerator which supports the output fixed data stream, further improves the frequency of the computing array, and reduces the bottleneck caused by the bandwidth of the weight memory.
The invention adopts the following technical scheme for solving the technical problems:
an adaptive double-frequency multiply-add array suitable for a neural network accelerator comprises a state machine control module, an address generation module, a data multiplexing calculation module, an input image and weight buffer, a frequency adjustment module, a calculation module, an output data buffer and an output data shaping module; the address generation module, the data multiplexing calculation module, the input image and weight buffer and the frequency adjustment module are respectively connected with the state machine control module, the data multiplexing calculation module, the frequency adjustment module, the calculation module, the output data buffer and the output data shaping module are sequentially connected, the data multiplexing calculation module is also connected with the input image and the weight buffer, the input image and the weight buffer are connected with the calculation module, and the address generation module is connected with the data multiplexing calculation module;
the state machine control module is used for receiving external instructions, reading current neural network parameters, decoding the current neural network parameters and then sending the decoded current neural network parameters to the address generation module and the data multiplexing calculation module;
the address generation module is used for receiving the neural network parameters sent by the state machine control module, generating a storage address of an original input image of the neural network, receiving a pre-trained weight, writing the pre-trained weight into a corresponding position, calculating an input image address and a weight address which currently participate in calculation, generating a corresponding read-write signal, and reading the input image address and the weight address into the data multiplexing calculation module;
the data multiplexing calculation module is used for receiving the neural network parameters sent by the state machine control module, calculating the data multiplexing times by combining the hardware parameters of the neural network, wherein the data multiplexing times comprise the multiplexing times required by the input images and weights which are currently participated in calculation, sending the input images and the weights multiplexing times to the frequency adjustment module, and writing the input images and the weights which are currently participated in calculation into the input images and the weights buffer;
the frequency adjustment module is used for receiving the input images and the weight multiplexing times sent by the data multiplexing calculation module, dynamically adjusting the frequency of the calculation module after receiving the frequency configuration instruction sent by the state machine control module, sending an adjustment completion mark to the state machine control module after the frequency adjustment is completed, and controlling the input images and the weight buffer to send the input images and the weights currently participating in calculation to the calculation module after the state machine control module receives the adjustment completion mark;
the calculation module is used for receiving the input image and the weight which are currently participated in calculation, finishing two-dimensional convolution calculation or full connection calculation, outputting a calculation result to the output data buffer, and outputting the calculation result after the output data shaping module shapes the calculation result.
As a preferred scheme of the invention, the working state of the state machine control module comprises an idle mode and a working mode, wherein the working mode comprises a neural network parameter reading stage, an input image and weight loading stage, a frequency configuration stage and a calculation stage; the state machine control module toggles between an idle mode and an operational mode.
As a preferable mode of the present invention, the address generating module includes an input image read address unit, an input image write address unit, a weight data read address unit, and a weight data write address unit; the input image reading address unit is used for calculating input image addresses participating in calculation; the input image writing address unit is used for generating a storage address of an original input image of the neural network; the weight data reading address unit is used for calculating the weight address which participates in the calculation currently; the weight data writing address unit is used for receiving the weight trained in advance and writing the weight into the corresponding position.
As a preferred scheme of the present invention, the data multiplexing calculation module includes a neural network parameter receiving unit, a multiplexing frequency calculation unit, an input image and weight shaping unit, and an input image and weight rewriting unit, wherein the multiplexing frequency calculation unit includes an input image multiplexing frequency calculation unit and a weight multiplexing frequency calculation unit;
the neural network parameter receiving unit is used for receiving the neural network parameters sent by the state machine control module, decoding the neural network parameters to obtain the neural network parameters including the transverse step length, the longitudinal step length, the convolution kernel height, the convolution kernel width, the input image height, the input image width, the number of input image channels and the number of output image channels of the neural network; the input multiplexing frequency calculation unit and the weight multiplexing frequency calculation unit are respectively used for calculating multiplexing frequency needed by the input image and the weight which are currently involved in calculation by combining hardware parameters of the neural network; the input image and weight shaping unit is used for receiving the results calculated by the input image multiplexing frequency calculation unit and the weight multiplexing frequency calculation unit, calculating the specific addresses of the input image and the weight which participate in calculation and writing the specific addresses into the input image and the weight buffer, and sending the specific addresses to the input image and the weight writing unit; the input image and weight writing unit is used for judging whether the input image and the weight buffer are in a writable state or not, and writing the input image and the weight into specific addresses in the input image and the weight buffer when the input image and the weight buffer are in the writable state.
As a preferred solution of the present invention, the frequency adjustment module is configured to dynamically adjust the frequency of the calculation module, and specifically includes:
if the single parallelism of the calculation module cannot cover all the input images and the size and the step length of the convolution kernel meet the overlapping in the convolution area, namely, the multiplexing times required by the input images and the weights which are currently involved in calculation are not 0, the frequency of the calculation module is adjusted to be twice the frequency of the input images and the weight buffer; if the single parallelism of the calculation module can cover all the input images or the sizes and the step length of the convolution kernels and the adjacent convolution areas are not overlapped, namely, when the multiplexing frequency needed by the input images or the weights which are currently involved in calculation is 0, the frequency of the calculation module is kept consistent with the frequency of the input images and the weight buffer.
As a preferable scheme of the invention, the calculation module comprises a three-dimensional vector processing array, the three-dimensional vector processing array is composed of 4 rows and 16 columns of intelligent calculation vector processing units, and the calculation module realizes parallel operation of vector multiplication and addition through the three-dimensional vector processing array.
Compared with the prior art, the technical scheme provided by the invention has the following technical effects:
1. the invention realizes that the frequency of the computing array and the on-chip memory chip work at different frequencies, greatly relieves the computing bottleneck caused by on-chip memory, improves the performance and energy efficiency of the chip, and balances the utilization rate of the array in the time dimension while taking the utilization rate of the neural network accelerator in space into consideration.
2. The invention has high hardware parameter adjustability and universality, can fully adjust the size of the hardware according to the complexity of the actual application scene, and does not influence the functions.
3. The invention has high modularization degree, clear dependency relationship among different modules, simpler scheduling scheme and reduced area overhead of the control module.
Drawings
FIG. 1 is a block diagram of an adaptive dual-frequency multiply-add array suitable for use with a neural network accelerator in accordance with the present invention;
FIG. 2 is an operational state of a state machine control module of the present invention;
FIG. 3 is a block diagram of a data multiplexing computation module and a frequency adjustment module according to the present invention;
fig. 4 is a schematic diagram of a four-row sixteen-column array.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.
The invention provides a self-adaptive double-frequency multiply-add array in a neural network, which determines a data flow scheduling scheme of the current network by analyzing the current input characteristics, automatically selects the frequency relation between a computing array and other modules, reduces the frequency limit of storage in the neural network on the computing array, and further improves the performance of the computing array.
As shown in FIG. 1, the self-adaptive double-frequency multiply-add array suitable for the neural network accelerator comprises a state machine control module, an address generation module, a data multiplexing calculation module, an input image and weight buffer, a frequency adjustment module, a calculation module, an output data buffer and an output data shaping module.
The state machine control module receives an instruction input from the outside and decodes the instruction, and the instruction content specifically comprises: the base address of the image, the base address of the weight data, and the neural network parameters are input. The state machine control module is connected with all the rest modules; the address generation module receives the instruction of the state machine control module, calculates the weight parameter of the neural network and the address of the input image and generates a read-write signal; the data multiplexing calculation module receives an instruction from the state machine, calculates the number of times of the current input image and weight needed to be used later, and the input image and weight buffer is used for storing the input image and weight calculated by the data multiplexing calculation module, and the state machine decides when to take the input image and weight out; the frequency adjusting module receives signals from the state machine control module and the data multiplexing calculation module and is used for dynamically adjusting the frequency of the calculation module. The calculation module carries out two-dimensional convolution operation and full-connection calculation by receiving the input image and the weight from the input image and the weight buffer, selects whether to output data to the output data buffer or not through the control of the state machine control module, and the output data shaping module carries out further arrangement on the data and outputs the data to the subsequent module for operation.
As shown in fig. 2, the state machine control module is divided into an idle mode and an operating mode in an operating state, wherein the operating mode includes a neural network parameter reading stage, an input image and weight loading stage, a frequency configuration stage and a calculation stage, and the calculation stage includes a calculation progress stage and a calculation idle stage. The frequency configuration stage comprises a neural network parameter pre-reading stage, a multiplexing degree calculating stage and a frequency configuration stage. The computation phase includes convolution computation, full connection computation, wherein the idle mode of the state machine control module is a wait for command state.
The state machine control module toggles between an idle mode and an operational mode. The idle mode of the state machine control module refers to the standby state of the state machine control module; the working mode of the state machine control module refers to the state of the state machine control module when two-dimensional convolution or full connection calculation is performed.
The address generation module is divided into an input image read address unit, an input image write address unit, a weight data read address unit and a weight data write address unit according to different responsible objects and different preset logic relations. The subunit in the address generation module firstly acquires the base address and the network parameter of the input image and the weight data from the state machine control module, then performs address conversion through calculation, and reads the current required data from the input image and the weight buffer.
The subunits of the address generation module function as follows: the method comprises the steps of receiving a base address and various parameters of a neural network, which are given by a state machine control module, wherein the parameters are as follows: the neural network inputs the length, width and number of channels of the image. The input image reading address unit calculates the address of the input image which needs to participate in calculation each time through the parameters. The input image writing address unit is used for generating an address where the original input image is stored. The weight data read address unit is used for generating a weight address which needs to participate in calculation every time. The weight data writing address unit is used for receiving the weight from the training and writing the weight into the corresponding position. When the state machine control module is in a data weight preloading stage and a calculation stage in a working mode, the writing unit stands by and the reading unit works, and when the state machine control module enters a calculation ending stage, the writing unit works and the reading unit stands by. The modules are matched with each other to manage the access address of the state machine control module, so that the deviation between the extraction and storage of data during calculation is avoided.
The data multiplexing calculation and frequency adjustment module is shown in fig. 3, and the data multiplexing calculation module and the frequency adjustment module are core configuration modules for realizing dynamic frequency adjustment in the invention, namely, the modules for directly performing frequency adjustment on the calculation modules.
Since the data multiplexing calculation and the determination of the frequency adjustment module are determined by calculating the size of the array, the relevant content of the array will be described first, and a specific picture thereof is shown in fig. 4.
Here we set the calculation array size under the calculation module to 4 rows and 16 columns, each point in the array is a neuron, and each intelligent calculation vector calculation unit is composed of a plurality of multipliers with several bits wide, here we set the intelligent calculation vector multipliers to be composed of 16 multipliers with 8 bits wide, meaning that one neuron can directly process 16 data with 8 bits wide, and the multiplied 16 data with 8 bits wide are added by an addition tree to obtain a partial sum, which is then stored in the calculation unit of the intelligent neural network to wait for output or to perform the next addition.
The data multiplexing calculation module comprises a neural network parameter receiving unit, a multiplexing frequency calculation unit, an input image and weight shaping unit and an input image and weight rewriting unit; the multiplexing frequency calculating unit comprises an input image multiplexing frequency calculating unit and a weight multiplexing frequency calculating unit.
The data multiplexing calculation module is used for calculating the multiplexing times of the current input image and the weight. The neural network parameter receiving unit is used for decoding the neural network instruction sent by the state machine to obtain corresponding neural network parameters, and specifically comprises the following steps: the method comprises the steps of a horizontal step size, a vertical step size, a convolution kernel height, a convolution kernel width, an input image height, an input image width, the number of input image channels and the number of output image channels of the neural network. After the data are decoded, the multiplexing frequency calculation unit calculates the frequency of multiplexing the current input image and the weight data according to the current array size and the time taken out from the input image and the weight buffer, the input image and the weight shaping unit are used for receiving the result of the multiplexing frequency calculation unit and calculating the specific address of the current input image and the weight which should be written into the input image and the weight buffer, the address is received by the input image and the weight writing unit, and the input image and the weight writing unit judges whether the current input image and the weight buffer are in a writable state or not and if so, the current input image and the weight buffer are written into the input image and the weight buffer.
The input image and weight buffer is composed of two autonomously designed register files capable of simultaneously writing four data and reading four data and a plurality of single-depth multi-byte-width registers, wherein the two register files supporting writing four data and writing four data form a ping-pong structure so as to relieve the process that the computing array needs to wait for the register files to load data. The single-depth multi-byte width register is responsible for buffering each weight, respectively. Unlike input buffers, registers tend to pass through a larger area in exchange for the highest clock that is the same as the overall system, and reading data in the buffer tends to be without the highest frequency limit.
The frequency adjusting module receives the calculated multiplexing times from the data multiplexing calculating module to dynamically adjust the frequency of the current layer. The frequency adjusting module is provided with two adjustable gears, if the current input image size is large enough, the single parallelism of the calculating module cannot cover all the input images, the size and the step length of the convolution kernel meet the overlapping in the convolution area, the data do not need to be replaced frequently, the data taken out of the input images and the weight buffer can be utilized by the calculating module for many times, and therefore the calculating array can run to the frequency twice as high as the input images and the weight buffer; if the current input image size is small enough, the single parallelism of the calculation module already covers all the input images, or the size and the step length of the convolution kernel meet the condition that no overlap exists in the adjacent convolution areas, the data need to be replaced frequently, and the frequency of the calculation array is the same as the frequency of the input image and the working frequency of the weight buffer at the moment, so that effective data can be obtained in each period of the array.
The computing module comprises a three-dimensional vector processing array formed by intelligent computing vector processing units. The intelligent computing vector processing units are provided with a certain number of basic multiplication units and addition trees, the basic bit width of the multipliers and the number of the multipliers can be configured, in the invention, the basic bit width of the inside of each intelligent computing vector processing unit is set to be 8, and the number of the multipliers is set to be 16. 16 data multiply-add with bit width 8 can be processed simultaneously. With this configuration, the intelligent computing vector processing unit can process the multiplication of 16 input channels at a time, and add the results obtained by the 16 multipliers through the addition tree, and store the results in the data buffer inside the intelligent computing vector processing unit.
When the computing module realizes convolution or full-connection computing function, each intelligent computing vector processing unit represents an actual neuron, and can be set differently according to the scale and the characteristics of the network to form three-dimensional vector processing arrays with different sizes. The hardware circuit of the invention takes an intelligent computing vector processing unit as a basic unit, and is provided with a three-dimensional vector processing array with a behavior of 4 and a column of 16. A maximum of 4 different convolution regions and 16 different output channels can be supported.
The computing module performs basic neural network two-dimensional convolution computation by calling a three-dimensional vector processing array, the intelligent computing vector processing unit is used for processing the multiplication of two most basic vectors, a large number of vector multiplication and addition parallel operations can be realized through the three-dimensional vector processing array, the result obtained by the three-dimensional array can be put into an output data buffer for waiting for the addition of the next result, the output data buffer receives signals from a state machine, and the decision of when to output the data in the output data buffer is made.
After the output data shaping module receives the signals output by the output data buffer, the data is further shaped so as to ensure that the signals are output to the subsequent module in correct sequence and ensure consistency of the front and rear data.
The invention provides a self-adaptive double-frequency multiply-add array suitable for a neural network accelerator, which comprises the following specific engineering processes:
101: the state machine control module reads the parameters of the neural network, decodes the parameters and sends the parameters to the address generation module and the data multiplexing calculation module.
102, the address generation module performs the following operations after receiving the network parameters: first, a storage address of trained weights is generated, and an address of original input image storage is generated. And then calculating the input image and weight which participate in the current calculation, generating corresponding read-write addresses and signals, reading the data into a data multiplexing calculation module, and regenerating the read-write addresses and signals of the weight data which participate in the current calculation, so as to carry out the next operation.
103: after the data multiplexing calculation module obtains the neural network parameter information, the data multiplexing times are calculated by combining the current hardware parameters of the neural network. The multiplexing times required by the input image and the multiplexing times required by the weight are respectively judged and sent to the frequency adjusting module, and the corresponding input image and weight are written into the input image and the weight buffer by the data multiplexing calculating module.
104: after receiving the instruction from the state machine and the multiplexing times of the input image and the weight and the multiplexing times of the weight calculated by the data multiplexing calculation module, the frequency adjustment module selects a proper clock through a clock selector in the frequency adjustment module, when the multiplexing times of the data and the weight are not 0, the clock frequency of the calculation module is set to be twice of the input image and the weight buffer, and when any one of the data or the weight is 0, the same frequency of the calculation array, the input image and the weight buffer is kept.
105: after waiting for a period of time, the clock is stabilized, and then the frequency adjustment module returns to the state machine to finish the mark, and at the moment, the input image and the weight are read into the calculation array from the input image, the weight buffer and the convolution data, so that the corresponding convolution calculation or full connection calculation is finished.
106: and after the calculation is completed, outputting the data obtained by the calculation to an output data buffer, and outputting the data to a subsequent module through an output data shaping module.
107: and repeating 101-106 to complete the calculation of one layer in the neural network.
108: and after the calculation of the complete neural network is finished, entering a standby state, and waiting for the calculation of the next neural network.
The invention provides a novel self-adaptive double-frequency multiply-add array, which is used for further improving the output fixed data stream, forming a double-fixed data stream scheduling strategy, and being applicable to a neural network accelerator which supports the output fixed data stream, so as to further improve the frequency of the computing array and reduce the bottleneck caused by the bandwidth of a weight memory. The invention realizes good modularization and high multiplexing design in hardware, ensures that the dependency relationship between modules and the complexity of a scheduling algorithm are well controlled in the self-adaptive configurable dual-frequency array, and compared with the traditional data flow module direct communication and the hardware size required by the scheduling algorithm, the invention has good expansibility and prepares for the emerging network structure and hardware scale.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereto, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the present invention.

Claims (6)

1. The self-adaptive double-frequency multiply-add array suitable for the neural network accelerator is characterized by comprising a state machine control module, an address generation module, a data multiplexing calculation module, an input image and weight buffer, a frequency adjustment module, a calculation module, an output data buffer and an output data shaping module; the address generation module, the data multiplexing calculation module, the input image and weight buffer and the frequency adjustment module are respectively connected with the state machine control module, the data multiplexing calculation module, the frequency adjustment module, the calculation module, the output data buffer and the output data shaping module are sequentially connected, the data multiplexing calculation module is also connected with the input image and the weight buffer, the input image and the weight buffer are connected with the calculation module, and the address generation module is connected with the data multiplexing calculation module;
the state machine control module is used for receiving external instructions, reading current neural network parameters, decoding the current neural network parameters and then sending the decoded current neural network parameters to the address generation module and the data multiplexing calculation module;
the address generation module is used for receiving the neural network parameters sent by the state machine control module, generating a storage address of an original input image of the neural network, receiving a pre-trained weight, writing the pre-trained weight into a corresponding position, calculating an input image address and a weight address which currently participate in calculation, generating a corresponding read-write signal, and reading the input image address and the weight address into the data multiplexing calculation module;
the data multiplexing calculation module is used for receiving the neural network parameters sent by the state machine control module, calculating the data multiplexing times by combining the hardware parameters of the neural network, wherein the data multiplexing times comprise the multiplexing times required by the input images and weights which are currently participated in calculation, sending the input images and the weights multiplexing times to the frequency adjustment module, and writing the input images and the weights which are currently participated in calculation into the input images and the weights buffer;
the frequency adjustment module is used for receiving the input images and the weight multiplexing times sent by the data multiplexing calculation module, dynamically adjusting the frequency of the calculation module after receiving the frequency configuration instruction sent by the state machine control module, sending an adjustment completion mark to the state machine control module after the frequency adjustment is completed, and controlling the input images and the weight buffer to send the input images and the weights currently participating in calculation to the calculation module after the state machine control module receives the adjustment completion mark;
the calculation module is used for receiving the input image and the weight which are currently participated in calculation, finishing two-dimensional convolution calculation or full connection calculation, outputting a calculation result to the output data buffer, and outputting the calculation result after the output data shaping module shapes the calculation result.
2. The adaptive double-frequency multiply-add array for a neural network accelerator of claim 1, wherein the state machine control module operating states include an idle mode and an operating mode, wherein the operating mode includes neural network parameter reading, loading of input images and weights, frequency configuration, and a computation phase; the state machine control module toggles between an idle mode and an operational mode.
3. The adaptive double-frequency multiply-add array for a neural network accelerator of claim 1, wherein the address generation module comprises an input image read address unit, an input image write address unit, a weight data read address unit, and a weight data write address unit; the input image reading address unit is used for calculating input image addresses participating in calculation; the input image writing address unit is used for generating a storage address of an original input image of the neural network; the weight data reading address unit is used for calculating the weight address which participates in the calculation currently; the weight data writing address unit is used for receiving the weight trained in advance and writing the weight into the corresponding position.
4. The adaptive double-frequency multiply-add array adapted for a neural network accelerator according to claim 1, wherein the data multiplexing calculation module comprises a neural network parameter receiving unit, a multiplexing number calculation unit, an input image and weight shaping unit, and an input image and weight rewriting unit, wherein the multiplexing number calculation unit comprises an input image multiplexing number calculation unit and a weight multiplexing number calculation unit;
the neural network parameter receiving unit is used for receiving the neural network parameters sent by the state machine control module, decoding the neural network parameters to obtain the neural network parameters including the transverse step length, the longitudinal step length, the convolution kernel height, the convolution kernel width, the input image height, the input image width, the number of input image channels and the number of output image channels of the neural network; the input multiplexing frequency calculation unit and the weight multiplexing frequency calculation unit are respectively used for calculating multiplexing frequency needed by the input image and the weight which are currently involved in calculation by combining hardware parameters of the neural network; the input image and weight shaping unit is used for receiving the results calculated by the input image multiplexing frequency calculation unit and the weight multiplexing frequency calculation unit, calculating the specific addresses of the input image and the weight which participate in calculation and writing the specific addresses into the input image and the weight buffer, and sending the specific addresses to the input image and the weight writing unit; the input image and weight writing unit is used for judging whether the input image and the weight buffer are in a writable state or not, and writing the input image and the weight into specific addresses in the input image and the weight buffer when the input image and the weight buffer are in the writable state.
5. The adaptive double-frequency multiply-add array for a neural network accelerator of claim 1, wherein the frequency adjustment module is configured to dynamically adjust the frequency of the calculation module, specifically as follows:
if the single parallelism of the calculation module cannot cover all the input images and the size and the step length of the convolution kernel meet the overlapping in the convolution area, namely, the multiplexing times required by the input images and the weights which are currently involved in calculation are not 0, the frequency of the calculation module is adjusted to be twice the frequency of the input images and the weight buffer; if the single parallelism of the calculation module can cover all the input images or the sizes and the step length of the convolution kernels and the adjacent convolution areas are not overlapped, namely, when the multiplexing frequency needed by the input images or the weights which are currently involved in calculation is 0, the frequency of the calculation module is kept consistent with the frequency of the input images and the weight buffer.
6. The adaptive double-frequency multiply-add array for a neural network accelerator of claim 1, wherein the computing module comprises a three-dimensional vector processing array, the three-dimensional vector processing array being comprised of 4 rows and 16 columns of intelligent computing vector processing units, the computing module implementing parallel operations of vector multiply-add through the three-dimensional vector processing array.
CN202311576340.4A 2023-11-24 2023-11-24 Self-adaptive double-frequency multiply-add array suitable for neural network accelerator Active CN117289897B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311576340.4A CN117289897B (en) 2023-11-24 2023-11-24 Self-adaptive double-frequency multiply-add array suitable for neural network accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311576340.4A CN117289897B (en) 2023-11-24 2023-11-24 Self-adaptive double-frequency multiply-add array suitable for neural network accelerator

Publications (2)

Publication Number Publication Date
CN117289897A true CN117289897A (en) 2023-12-26
CN117289897B CN117289897B (en) 2024-04-02

Family

ID=89241004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311576340.4A Active CN117289897B (en) 2023-11-24 2023-11-24 Self-adaptive double-frequency multiply-add array suitable for neural network accelerator

Country Status (1)

Country Link
CN (1) CN117289897B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20210064044A (en) * 2019-11-25 2021-06-02 울산과학기술원 Apparatus and method for performing artificial neural network inference in mobile terminal
US20220188613A1 (en) * 2020-12-15 2022-06-16 The George Washington University Sgcnax: a scalable graph convolutional neural network accelerator with workload balancing
CN116702851A (en) * 2023-06-27 2023-09-05 中科南京智能技术研究院 Pulsation array unit and pulsation array structure suitable for weight multiplexing neural network
CN116840821A (en) * 2023-09-01 2023-10-03 无锡市海鹰加科海洋技术有限责任公司 Double-frequency sounding control system based on data analysis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20210064044A (en) * 2019-11-25 2021-06-02 울산과학기술원 Apparatus and method for performing artificial neural network inference in mobile terminal
US20220188613A1 (en) * 2020-12-15 2022-06-16 The George Washington University Sgcnax: a scalable graph convolutional neural network accelerator with workload balancing
CN116702851A (en) * 2023-06-27 2023-09-05 中科南京智能技术研究院 Pulsation array unit and pulsation array structure suitable for weight multiplexing neural network
CN116840821A (en) * 2023-09-01 2023-10-03 无锡市海鹰加科海洋技术有限责任公司 Double-frequency sounding control system based on data analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周超;王跃科;乔纯捷;戴卫华;: "全球导航卫星系统接收机的复信号自适应陷波干扰抑制", 国防科技大学学报, no. 05 *
王娟;杜雄;周雒维;: "双频Buck变换器参数对系统性能的影响", 电源世界, no. 05 *

Also Published As

Publication number Publication date
CN117289897B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
US20210224125A1 (en) Operation Accelerator, Processing Method, and Related Device
CN110348574B (en) ZYNQ-based universal convolutional neural network acceleration structure and design method
CN101331464A (en) Storage region allocation system, method, and control device
CN110222818B (en) Multi-bank row-column interleaving read-write method for convolutional neural network data storage
CN110738308B (en) Neural network accelerator
CN111832718B (en) Chip architecture
CN113220630B (en) Reconfigurable array optimization method and automatic optimization method for hardware accelerator
KR102545658B1 (en) Apparatus and Method for Convolutional Neural Network Quantization Inference
CN111931909B (en) Lightweight convolutional neural network reconfigurable deployment method based on FPGA
CN112950656A (en) Block convolution method for pre-reading data according to channel based on FPGA platform
CN111353586A (en) System for realizing CNN acceleration based on FPGA
JP2023178385A (en) Information processing apparatus and information processing method
CN113065643A (en) Apparatus and method for performing multi-task convolutional neural network prediction
CN113157638B (en) Low-power-consumption in-memory calculation processor and processing operation method
CN117289897B (en) Self-adaptive double-frequency multiply-add array suitable for neural network accelerator
Voss et al. Convolutional neural networks on dataflow engines
CN114912596A (en) Sparse convolution neural network-oriented multi-chip system and method thereof
Xiao et al. FGPA: Fine-grained pipelined acceleration for depthwise separable CNN in resource constraint scenarios
CN111191780A (en) Average value pooling accumulation circuit, device and method
CN115719088B (en) Intermediate cache scheduling circuit device supporting in-memory CNN
US11886973B2 (en) Neural processing unit including variable internal memory
CN115965067B (en) Neural network accelerator for ReRAM
CN115312095B (en) In-memory computation running water multiplication and addition circuit supporting internal data updating
US20230385622A1 (en) Neural processing unit including internal memory having scalable bandwidth and driving method thereof
US20230033179A1 (en) Accumulator and processing-in-memory (pim) device including the accumulator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant