CN113762480B - Time sequence processing accelerator based on one-dimensional convolutional neural network - Google Patents

Time sequence processing accelerator based on one-dimensional convolutional neural network Download PDF

Info

Publication number
CN113762480B
CN113762480B CN202111065987.1A CN202111065987A CN113762480B CN 113762480 B CN113762480 B CN 113762480B CN 202111065987 A CN202111065987 A CN 202111065987A CN 113762480 B CN113762480 B CN 113762480B
Authority
CN
China
Prior art keywords
processing module
data
register
row
reg
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111065987.1A
Other languages
Chinese (zh)
Other versions
CN113762480A (en
Inventor
刘冬生
朱令松
陆家昊
胡昂
魏来
成轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202111065987.1A priority Critical patent/CN113762480B/en
Publication of CN113762480A publication Critical patent/CN113762480A/en
Application granted granted Critical
Publication of CN113762480B publication Critical patent/CN113762480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Neurology (AREA)
  • General Health & Medical Sciences (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a time sequence processing accelerator based on a one-dimensional convolutional neural network, which belongs to the field of artificial intelligence and integrated circuit design, and comprises the following components: the input processing module comprises N rows of register groups, wherein the number of registers in the first row of register groups is N, and the number of registers in each row of register groups is reduced by one row; under the cooperative control of the global control module, each reasoning data sequentially passes through the register reg 1N Input, cross-flow in first row register set, through register reg 11 The data in the first row of register sets is outputted and passed through the register reg after flowing longitudinally in each row of register sets nn Output, n=2, 3, …, N; the convolution operation array carries out convolution operation and activation on the data output by the input processing module, the pooling processing module pools the activation result and outputs the activation result, and the full-connection processing module carries out full-connection addition operation on the activation result and outputs the activation result. The method realizes the multiplexing of the reasoning data, reduces the moving amount of the reasoning data, and improves the network operation efficiency and the configurability.

Description

Time sequence processing accelerator based on one-dimensional convolutional neural network
Technical Field
The invention belongs to the field of artificial intelligence and integrated circuit design, and in particular relates to a time sequence processing accelerator based on a one-dimensional convolutional neural network.
Background
Convolutional neural network (Convolutional NeuralNetworks, CNN) has advantages of high prediction accuracy, good classification effect, and wide tolerance of data sets, and has been widely used in various recognition and classification tasks such as image recognition, voice recognition, text recognition, and information classification in recent years. Along with the increasing scale and the increasing depth of the convolutional neural network, a great number of problems of parallel multiplication and addition operation, massive data moving operation and the like are brought, and the problems become important factors influencing the operation efficiency of the convolutional neural network.
The method has the advantages that the time sequence related problems such as electrocardiographic classification, electroencephalogram identification and the like are realized by using a one-dimensional convolutional neural network model algorithm, and a good effect can be obtained. The problems generally need to be read, operated and output continuously for a long time, and have higher requirements on the operation efficiency and the power consumption of the neural network accelerator. However, the computational power of existing embedded central processor architecture devices is far from supporting the above operations, and in such a context, hardware circuit based convolutional neural network accelerators have evolved.
The acceleration of convolutional neural network operation by using a hardware circuit is a novel direction at present. The existing convolutional neural network accelerator realized based on a hardware circuit is designed for a two-dimensional convolutional neural network, and the advantages of high parallelism and high efficiency can not be fully reflected when the convolutional neural network accelerator is directly applied to one-dimensional convolutional neural network operation such as electrocardio and myoelectricity.
Disclosure of Invention
Aiming at the defects and improvement demands of the prior art, the invention provides a time sequence processing accelerator based on a one-dimensional convolutional neural network, which aims to improve the operation efficiency, the configurability and the energy efficiency ratio of the one-dimensional convolutional neural network.
In order to achieve the above purpose, the invention provides a time series processing accelerator based on a one-dimensional convolutional neural network, which comprises an input processing module, a convolutional operation array, a pooling processing module, a fully-connected processing module and a global control module; the input processing module comprises N rows of register groups, the number of registers in the first row of register groups is N, the number of registers in each row of register groups is reduced by one row, each column of registers in the first row of register groups are sequentially connected, each row of register groups are sequentially connected with each column of registers, and N is the size of a convolution kernel in the convolution operation array; under the cooperative control of the global control module, each reasoning data sequentially passes through a register reg 1N Input in the first row of registersCross-flow post-pass register reg in group 11 The data in the first row of register sets is outputted and passed through the register reg after flowing longitudinally in each row of register sets nn Output, n=2, 3, …, N, where register reg ij Registers in the (N-j+1) th column in the (i) th row register set; the convolution operation array is used for carrying out convolution operation and activation on the data output by the input processing module, the pooling processing module is used for pooling the activation result and outputting the result, and the full-connection processing module is used for carrying out full-connection addition operation on the activation result and outputting the result.
Still further, the convolution kernel in the time-series processing accelerator includes: n multipliers, two inputs of the ith multiplier are registers reg respectively ii I=1, 2, …, N; the two inputs of the first adder are convolution kernel offset and convolution part sum of the corresponding position of the input characteristic diagram of the previous round, the two inputs of the kth adder are output of the kth-1 adder and output of the kth multiplier, k=2, 3, …, n+1, and the output of the nth+1 adder is convolution part sum of the corresponding position of the input characteristic diagram of the present round.
Still further, the method further comprises: the input of the first multiplexer is connected with the output of the second adder to the N adder, the Fcmode port is used for receiving a mode control signal s of the global control module and outputting the output of the s adder corresponding to the mode control signal s, and s=2, 3, … and N; when the first multiplexer receives the mode control signal s, the working mode of the convolution kernel is full-connection layer operation, otherwise, the working mode of the convolution kernel is convolution operation.
Still further, the method further comprises: and the reasoning data storage module is used for storing the reasoning data originally input into the time sequence processing accelerator and the reasoning data output by the convolution operation array, the pooling processing module and the fully-connected processing module, and outputting the stored reasoning data to the input processing module.
Further, the reasoning data storage module comprises a plurality of partitions which are respectively used for storing different types of reasoning data; the input processing module also comprises a second multiplexer, and the second multiplexer is used for selecting the reasoning data or the zero filling sequence in the corresponding partition according to the current layer state of the time sequence processing accelerator and sending the reasoning data or the zero filling sequence to the input processing module.
Still further, the system also comprises a convolution kernel weight storage module and a convolution kernel offset storage module which are respectively used for storing the convolution kernel weight and the convolution kernel offset in each layer of convolution operation.
Still further, the system also comprises a reset port, an enable port and an output port; when the enabling port receives continuous enabling signals and the resetting port receives resetting enabling signals with more than two clock cycles, the time sequence processing accelerator conducts one-dimensional convolutional neural network reasoning, and after the reasoning is completed, the output port outputs high-level pulse to represent completion.
Furthermore, two signal lines are connected between the global control module and the pooling processing module, and the global control module outputs a maximum pooling layer mark or a global average pooling layer mark to the pooling processing module through the two signal lines so as to control the pooling operation mode of the pooling processing module.
In general, through the above technical solutions conceived by the present invention, the following beneficial effects can be obtained:
(1) Providing a new register set structure capable of carrying out transverse data flow and longitudinal data flow in parallel for an input processing module, forming a new data flow processing mode, generating a pipeline data flow and providing the pipeline data flow to a convolution operation related unit, wherein the data flow can simultaneously use multiplication and addition resources in a convolution operation multiplication and addition unit, so that an additional addition circuit is obviously reduced, and meanwhile, the data flow enables reasoning data to be multiplexed, and reduces the additional overhead generated by reasoning data movement;
(2) In the convolution operation array, partial sum of the input characteristic diagram and convolution kernel offset addition can be synchronously carried out with the previous round in the convolution operation process, and the operation of the whole convolution kernel can be completed by only one extra adder, so that the number of the adders is further reduced;
(3) The multiplexer is arranged to be connected to the output end of each multiplication and addition unit in the convolution kernel, the convolution operation array can also realize multiplication operation of a full connection layer by controlling the multiplexer to select and output the calculation result of the corresponding multiplication and addition unit, and the resource cost generated by the additional multiplication array is reduced.
Drawings
FIG. 1 is a schematic diagram of a time-series processing accelerator based on a one-dimensional convolutional neural network according to an embodiment of the present invention;
fig. 2 is a schematic circuit diagram of an input processing module according to an embodiment of the present invention;
FIG. 3 is a schematic circuit diagram of a convolution operation array according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram of a workflow of a time series processing accelerator reasoning based on a one-dimensional convolutional neural network according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
In the present invention, the terms "first," "second," and the like in the description and in the drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
Fig. 1 is a schematic diagram of a time-series processing accelerator based on a one-dimensional convolutional neural network according to an embodiment of the present invention. Referring to fig. 1, a time-series processing accelerator based on a one-dimensional convolutional neural network in this embodiment will be described in detail with reference to fig. 2 to 4.
Referring to fig. 1, the time series processing accelerator based on the one-dimensional convolutional neural network comprises an input processing module, a convolutional operation array, a pooling processing module, a full-connection processing module and a global control module. The input processing module comprises N rows of register groups, the number of registers in the first row of register groups is N, the number of registers in each row of register groups is reduced by one row, each column of registers in the first row of register groups are sequentially connected, each row of register groups are sequentially connected with each column of registers, and N is the size of a convolution kernel in the convolution operation array, as shown in fig. 2.
Under the cooperative control of the global control module, the input processing module, the convolution operation array, the pooling processing module and the full-connection processing module perform the following operations. Each reasoning data sequentially passes through the register reg 1N Input, cross-flow in first row register set, through register reg 11 The data in the first row of register sets is outputted and passed through the register reg after flowing longitudinally in each row of register sets nn Output, n=2, 3, …, N, where register reg ij Is the register of the N-j+1 column in the i-th row register set. The convolution operation array carries out convolution operation and activation on the data output by the input processing module, the pooling processing module pools the activation result and outputs the activation result, and the full-connection processing module carries out full-connection addition operation on the activation result and outputs the activation result.
In accordance with an embodiment of the present invention, the time series processing accelerator based on the one-dimensional convolutional neural network further comprises an inference data storage module, as shown in fig. 1. The reasoning data storage module is used for storing the reasoning data originally input into the time sequence processing accelerator and the reasoning data output by the convolution operation array, the pooling processing module and the full-connection processing module, and outputting the stored reasoning data to the input processing module. The input processing module processes the data stored in the reasoning data storage module and outputs the processed data. Further, the reasoning data storage module may be divided into a plurality of partitions for storing different types of reasoning data, respectively. For example, the reasoning data storage module is divided into two partitions, namely a storage group 1PixelRAM1 and a storage group 2 PixelRAM 2, which are respectively used for storing the original reasoning data input into the time sequence processing accelerator and the intermediate reasoning data obtained by the accelerator processing.
This embodimentIn the present embodiment, the circuit configuration of the input processing module is described with n=5 (convolution kernel size is 5×1). Referring to FIG. 2, reg 11 ~reg 15 、reg 22 ~reg 25 、reg 33 ~reg 35 、reg 44 ~reg 45 Reg 55 All of which are data flow register sets. The input processing module further comprises a second multiplexer, the port PixelInMode is an input data chip selection signal of the second multiplexer, and is used for receiving the PixelInMode signal (determined by the current layer state of the time sequence processing accelerator) sent by the global control module, so that the second multiplexer selects reasoning data or zero filling sequences (corresponding to PADDING input modes) in corresponding partitions according to the current layer state of the time sequence processing accelerator and sends the reasoning data or zero filling sequences (corresponding to PADDING input modes) to the input processing module, and the chip selection process of the reasoning data input and PADDING filling is completed. The port MulAddEn is an enable signal for turning on/off the longitudinal flow of data in the register set, i.e. turning on/off the shift-up operation in the register set. Ports PixelIn1 and PixelIn2 are reasoning data reading ports; ports PixelW 1-PixelW 5 are reasoning data stream output ports, and are correspondingly connected with reasoning data input ports of the same 5×1 convolution kernels MAC 1-MAC 5 respectively.
Specifically, when a round of input feature map starts to be input, the second multiplexer performs chip selection according to the PixelInMode and inputs the chip-selected data into reg 15 . The next clock cycle comes, reg 15 Data to the left to feed reg 14 And the next data from the second multiplexer is fed into reg 15 The method comprises the steps of carrying out a first treatment on the surface of the And then the next clock cycle reg 14 Data to the left to feed reg 13 ,reg 15 Data to the left to feed reg 14 The next clock cycle, the data from the second multiplexer is sent to reg 15 . Thus reciprocating reg 11 ~reg 15 Performing a data move operation to the left until reg 11 ~reg 15 And (5) finishing the filling.
When reg 11 ~reg 15 When the filling is finished, the global control module synchronously gives MulAddEn enable and reg 11 Medium data is output through PixelW1, reg 12 ~reg 15 The data in the medium is moved upwards to correspondingly send reg 22 ~reg 25 At the same time reg 12 ~reg 15 Synchronously moving the data in the buffer to the left; temporary, reg, next clock cycle 11 ~reg 15 Moving data according to the above operations reg 22 Is output through PixelW2 while reg 23 ~reg 25 The data in the middle correspondingly moves upwards to reg 33 ~reg 35 In (a) and (b); and then the next clock cycle comes, reg 11 ~reg 15 、reg 22 ~reg 25 Moving data according to the above operations reg 33 Is output through PixelW3 while reg 34 、reg 35 Corresponding upward movement of data to reg 44 、reg 45 In (a) and (b); and then the next clock cycle comes, reg 11 ~reg 15 、reg 22 ~reg 25 、reg 33 ~reg 35 Moving data according to the above operations reg 44 Data in (a) is output through PixelW4, and reg 45 Moves up to reg 55 The method comprises the steps of carrying out a first treatment on the surface of the On the next clock cycle, reg 11 ~reg 15 、reg 22 ~reg 25 、reg 33 ~reg 35 、reg 44 ~reg 45 Moving data according to the above operations reg 55 The data in the data are output through the PixelW5, the subsequent clock cycles are repeated, and the pipelined inference data stream is continuously output to the convolution operation array.
When the last data of the input feature diagram of the present round is composed of reg 55 When the input characteristic diagram is fed into the PixelW5 output port, the global control module stops enabling MulAddEn, closes the upward moving process of the register group, and completes the data flow processing of the reasoning data of the input characteristic diagram of the round. At the same time, the next round of input profile inference data can continue to fill reg by shifting left 11 ~reg 15 The forming of the input pipelined data stream is thus reciprocally completed.
The time series processing accelerator based on the one-dimensional convolutional neural network further comprises a convolutional kernel weight storage module and a convolutional kernel bias storage module, which are shown in fig. 1. The convolution kernel weight storage module and the convolution kernel offset storage module are respectively used for storing the convolution kernel weight and the convolution kernel offset in the convolution operation of each layer.
The convolution kernel in the time-series processing accelerator includes: n multipliers, two inputs of the ith multiplier are registers reg respectively ii I=1, 2, …, N; the two inputs of the first adder are convolution kernel offset and the convolution part sum of the corresponding position of the input characteristic diagram of the previous round, the two inputs of the kth adder are the output of the kth-1 adder and the output of the kth multiplier, k=2, 3, …, n+1, and the output of the nth+1 adder is the convolution part sum of the corresponding position of the input characteristic diagram of the present round, as shown in fig. 3.
Referring to fig. 3, taking n=5 as an example, in the convolution operation, multiply-add operation units MAC1 to MAC5, partial and store FIFOs, pre-adder PartSumADD, speculative data input ports PixelW1 to PixelW5, input port MulAddEn (multiply-add enable port), partSumFIFO (partial and store FIFO read port), output port FcDone (full-connection-layer multiply output port), and output port PartSumDone (partial and output) are involved.
Specifically, when the accelerator works, the convolution kernel Weight storage module and the convolution kernel Bias storage module read convolution kernel Weight parameters and Bias parameters required by the input feature map of the round from corresponding storage, when first reasoning data is sent into the MAC1 through the PixelW1, the global control module enables MulAddEn, at the moment, the MAC1 starts to work, the convolution kernel Weight data Weight1 and the reasoning data from the PixelW1 are sent into the multiplication unit of the multiplication adder to carry out multiplication operation, and meanwhile, the last round of parts and the convolution kernel Bias are sent into the PartSumADD to carry out addition operation; the next clock comes, the MAC1 multiplication result and the PartSumADD addition result are sent to the MAC1 addition unit for addition operation, and meanwhile, the reasoning data from PixelW2 and the Weight2 are read into the MAC2 multiplication unit for multiplication operation; then, the next clock period is used for temporarily sending the MAC2 multiplication result and the MAC1 addition result to the MAC2 addition unit for addition operation, and meanwhile, the inferred data from PixelW3 and Weight3 are sent to the MAC3 multiplication unit for operation; then, the next clock period is used for temporarily sending the MAC3 multiplication result and the MAC2 addition result to the MAC3 addition unit for addition operation, and meanwhile, the inferred data from PixelW4 and Weight4 are sent to the MAC4 multiplication unit for operation; then, the next clock period is used for temporarily sending the MAC4 multiplication result and the MAC3 addition result to the MAC4 addition unit for addition operation, and meanwhile, the inferred data from PixelW5 and Weight5 are sent to the MAC5 multiplication unit for operation; and the next clock period comes, the MAC5 multiplication result and the MAC4 addition result are sent to the MAC5 addition unit for addition operation, and the MAC 1-MAC 5 can realize the pipeline implementation of the high-efficiency one-dimensional convolution neural network volume product multiplication and addition operation, so that the utilization rate of the MAC 1-MAC 5 multiplication and addition units is 100%, the operation of the whole convolution kernel can be completed only by adding one adder, and the use of the adder is greatly reduced.
According to an embodiment of the present invention, the time-series processing accelerator based on the one-dimensional convolutional neural network further includes a first multiplexer. The input of the first multiplexer is connected with the output of the second adder to the nth adder, the FcMode port is used for receiving a mode control signal s of the global control module and outputting the output of the s-th adder corresponding to the mode control signal s, and s=2, 3, … and N; when the first multiplexer receives the mode control signal s, the working mode of the convolution kernel is full-connection operation, otherwise, the working mode of the convolution kernel is convolution operation, as shown in fig. 3.
The circuit structure of the convolution operation array in this embodiment may be multiplexed into an operation array of a full connection layer, and the full connection layer performs a 1×1 convolution. Taking 4 neurons as an input layer of the full connection layer as an example, if the current layer state is the full connection layer, the global control module enables FcLayer, and the mode control signal s of the FcMode port synchronously gives the full connection layer mode as 4 neurons of the input layer. During full-connection layer operation, MAC 1-MAC 4 are identical to the convolution operation process, and the difference is that when the addition of MAC4 is completed, data are directly sent to FcMUX through Fc4Done, and because the mode control signal s received by FcMUX is 4 at this time, fcMUX gates Fc4Done to output the data to a subsequent processing module through FcDone ports, and the effect that the convolution operation array is multiplexed as the full-connection layer operation array is realized through the circuit structure.
The ports of each module in the processing accelerator based on the time series of the one-dimensional convolutional neural network, the connection relationship between each port, and the data flow in this embodiment are described below.
The time series processing accelerator includes three I/O ports, two sets of input ports rst, modelStart and one set of output ports ModelDone. rst is the reset port of the accelerator, modelStart is the enable port of the accelerator, and ModelDone is the inference completion flag bit. When the enabling port ModelStart receives a continuous enabling signal and the reset port rst receives a reset enabling signal with more than two clock cycles, all registers in the accelerator execute reset operation, the time sequence processing accelerator performs one-dimensional convolutional neural network reasoning after the reset is completed, and the output port ModelDone outputs high-level pulses after the reasoning is completed to indicate that the reasoning is completed.
The global control module is connected with the storage module through eight groups of signal lines, and all the eight groups of signal lines are output to the storage management module by the global control module. The eight signal lines are respectively 1-bit wide inferred data read control PixelRd1/PixelRd2, 1-bit wide inferred data write control PixelWr1/PixelWr2, 1-bit wide convolution kernel weight read control weight Rd, 1-bit wide convolution kernel bias read BiasRd, 4-bit wide convolution kernel bias BiasAddr read address, 8-bit wide convolution kernel weight read address weight addr, 12-bit wide inferred data read address PixelRdAddr, and 12-bit wide inferred data write address PixelWrAddr, and are responsible for transferring access addresses and access control signals to inferred data storage, convolution kernel weight storage, and convolution kernel bias storage.
The global control module is connected with the input processing module through two groups of signal lines, and the signals are all output from the global control module to the input processing module. The two signal lines are respectively 1-bit-wide data stream processing enabling MulAddEn and are used for controlling the opening and closing of the input processing module, and 2-bit-wide reading chip selection PixelInMode and are used for reasoning the source selection of data reading. The storage module is connected with the input processing module through two groups of signal lines, all the storage modules are output to the input processing module, all the storage modules are 64-bit wide, specific ports are inferred data output PixelIn1 and PixelIn2, and the two groups of ports are responsible for moving inferred data from a memory to the input processing module.
The global control module is connected with two groups of signal lines of the convolution operation array, the bit width is 1bit, and the bit width is output to the convolution operation array from the global control module. The two groups of signal lines are respectively a multiplication and addition unit enabling MulAddEn and an activation function circuit enabling ActEn and are respectively used for controlling the opening of the convolution array circuit and the activation function circuit. The input processing module is connected with the convolution operation array through five groups of signal lines, the bit width is 64 bits, the input processing module is respectively input PixelW 1-PixelW 5 for completing data stream processing, and the direction is from the input processing module to the convolution operation array, and the input processing module is used for sending the reasoning data which has formed the data stream into the convolution operation array.
The global control module is connected with the pooling processing module through two groups of signal lines, and the two groups of signal lines are all output to the pooling processing module through the global control module. The two groups of signal lines are respectively a global average pooling layer mark GAP with the bit width of 1bit and the maximum pooling layer mark MaxPool with the bit width of 1bit and are used for informing the pooling processing module of the current pooling operation mode so as to control the pooling operation mode of the pooling processing module. The convolution operation array is connected with the pooling processing module through a group of signal lines, and the convolution operation array is used for carrying the activated partial convolution and PartSumDone to the pooling processing module. The pooling processing module is connected with the storage module through two groups of signal lines, the bit width is 64 bits, the output buffer module outputs the signals to the storage management module, and the output buffer module outputs ConvOut and PoolOut which are convolution layer operation results respectively and is used for moving the pooled result back to the reasoning data storage.
The global control module is connected with two groups of signal lines of the full-connection processing module, the bit width is 1bit, and the two groups of signal lines are all output to the full-connection processing module by the global control module, specifically, a full-connection layer mark Fc and a full-connection layer adder enabling FcAddEn and are used for enabling the full-connection layer processing module and the full-connection layer adder. The full-connection processing module is connected with the storage module through a group of signal lines, the bit width is 64 bits, and the full-connection layer output FcOut which is output from the full-connection processing module to the storage module is responsible for carrying the full-connection layer operation result back to the reasoning data storage.
Referring to fig. 4, the time-series processing accelerator in this embodiment operates as follows. The global control module controls state circulation and sends control signal enabling and address signals to the corresponding functional modules. The reasoning data read from the reasoning data storage is processed by an input processing module, a convolution operation module, a pooling processing module or a fully-connected processing module, and finally sent back to another group of reasoning data storage until one neural network layer completely completes exchange input and output, and the cycle is repeated until all the neural network layers complete processing, the final result is output to a designated position, and a high-level pulse is given out at an output port ModelDone to represent that the reasoning is completed.
It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (6)

1. The time sequence processing accelerator based on the one-dimensional convolutional neural network is characterized by comprising an input processing module, a convolutional operation array, a pooling processing module, a full-connection processing module and a global control module;
the input processing module comprises N rows of register groups, the number of registers in the first row of register groups is N, the number of registers in each row of register groups is reduced by one row, each column of registers in the first row of register groups are sequentially connected, each row of register groups are sequentially connected with each column of registers, and N is the size of a convolution kernel in the convolution operation array;
under the cooperative control of the global control module, each reasoning data sequentially passes through a register reg 1N Input, cross-flow in first row register set, through register reg 11 Outputting the data in the first row register set in each row register setMiddle-to-vertical flow back through register reg nn Output, n=2, 3, …, N, where register reg ij Registers in the (N-j+1) th column in the (i) th row register set;
the convolution operation array is used for carrying out convolution operation and activation on the data output by the input processing module, the pooling processing module is used for pooling the activation result and outputting the result, and the full-connection processing module is used for carrying out full-connection addition operation on the activation result and outputting the result;
the convolution kernel in the time-series processing accelerator includes:
n multipliers, two inputs of the ith multiplier are registers reg respectively ii I=1, 2, …, N;
the two inputs of the first adder are respectively convolution kernel offset and convolution part summation of the corresponding position of the input characteristic diagram of the previous round, the two inputs of the kth adder are respectively output of the kth-1 adder and output of the kth multiplier, k=2, 3, …, n+1, and the output of the nth+1 adder is convolution part summation of the corresponding position of the input characteristic diagram of the present round;
the time sequence processing accelerator further comprises a first multiplexer, wherein the input of the first multiplexer is connected with the output of the second adder to the nth adder, the FcMode port is used for receiving a mode control signal s of the global control module and outputting the output of the s-th adder corresponding to the mode control signal s, and s=2, 3, … and N;
when the first multiplexer receives the mode control signal s, the working mode of the convolution kernel is full-connection layer operation, otherwise, the working mode of the convolution kernel is convolution operation.
2. A time-series processing accelerator based on a one-dimensional convolutional neural network as recited in any one of claim 1, further comprising:
and the reasoning data storage module is used for storing the reasoning data originally input into the time sequence processing accelerator and the reasoning data output by the convolution operation array, the pooling processing module and the fully-connected processing module, and outputting the stored reasoning data to the input processing module.
3. The time-series processing accelerator based on a one-dimensional convolutional neural network according to claim 2, wherein the inference data storage module comprises a plurality of partitions for storing different types of inference data, respectively;
the input processing module also comprises a second multiplexer, and the second multiplexer is used for selecting the reasoning data or the zero filling sequence in the corresponding partition according to the current layer state of the time sequence processing accelerator and sending the reasoning data or the zero filling sequence to the input processing module.
4. The time series processing accelerator based on one-dimensional convolutional neural network of claim 1, further comprising a convolutional kernel weight storage module and a convolutional kernel bias storage module for storing the convolutional kernel weights and the convolutional kernel biases in the convolutional operations of each layer, respectively.
5. The one-dimensional convolutional neural network-based time-series processing accelerator of claim 1, further comprising a reset port, an enable port, and an output port; when the enabling port receives continuous enabling signals and the resetting port receives resetting enabling signals with more than two clock cycles, the time sequence processing accelerator conducts one-dimensional convolutional neural network reasoning, and after the reasoning is completed, the output port outputs high-level pulse to represent completion.
6. The time-series processing accelerator based on the one-dimensional convolutional neural network according to claim 1, wherein two signal lines are connected between the global control module and the pooling processing module, and the global control module outputs a maximum pooling layer flag or a global average pooling layer flag to the pooling processing module through the two signal lines so as to control a pooling operation mode of the pooling processing module.
CN202111065987.1A 2021-09-10 2021-09-10 Time sequence processing accelerator based on one-dimensional convolutional neural network Active CN113762480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111065987.1A CN113762480B (en) 2021-09-10 2021-09-10 Time sequence processing accelerator based on one-dimensional convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111065987.1A CN113762480B (en) 2021-09-10 2021-09-10 Time sequence processing accelerator based on one-dimensional convolutional neural network

Publications (2)

Publication Number Publication Date
CN113762480A CN113762480A (en) 2021-12-07
CN113762480B true CN113762480B (en) 2024-03-19

Family

ID=78795041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111065987.1A Active CN113762480B (en) 2021-09-10 2021-09-10 Time sequence processing accelerator based on one-dimensional convolutional neural network

Country Status (1)

Country Link
CN (1) CN113762480B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114781629B (en) * 2022-04-06 2024-03-05 合肥工业大学 Hardware accelerator of convolutional neural network based on parallel multiplexing and parallel multiplexing method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109032781A (en) * 2018-07-13 2018-12-18 重庆邮电大学 A kind of FPGA parallel system of convolutional neural networks algorithm
CN109784489A (en) * 2019-01-16 2019-05-21 北京大学软件与微电子学院 Convolutional neural networks IP kernel based on FPGA
CN110263925A (en) * 2019-06-04 2019-09-20 电子科技大学 A kind of hardware-accelerated realization framework of the convolutional neural networks forward prediction based on FPGA

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460817B (en) * 2018-09-11 2021-08-03 华中科技大学 Convolutional neural network on-chip learning system based on nonvolatile memory

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109032781A (en) * 2018-07-13 2018-12-18 重庆邮电大学 A kind of FPGA parallel system of convolutional neural networks algorithm
CN109784489A (en) * 2019-01-16 2019-05-21 北京大学软件与微电子学院 Convolutional neural networks IP kernel based on FPGA
CN110263925A (en) * 2019-06-04 2019-09-20 电子科技大学 A kind of hardware-accelerated realization framework of the convolutional neural networks forward prediction based on FPGA

Also Published As

Publication number Publication date
CN113762480A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
CN111897579B (en) Image data processing method, device, computer equipment and storage medium
JP7166389B2 (en) Systems and integrated circuits for bit-serial computation in neural networks
US10846591B2 (en) Configurable and programmable multi-core architecture with a specialized instruction set for embedded application based on neural networks
CN110263925B (en) Hardware acceleration implementation device for convolutional neural network forward prediction based on FPGA
CN110738308B (en) Neural network accelerator
Ma et al. End-to-end scalable FPGA accelerator for deep residual networks
CN110766127B (en) Neural network computing special circuit and related computing platform and implementation method thereof
CN110674927A (en) Data recombination method for pulse array structure
CN113076521B (en) Reconfigurable architecture method based on GPGPU and computing system
CN110991630A (en) Convolutional neural network processor for edge calculation
CN110580519A (en) Convolution operation structure and method thereof
CN113762480B (en) Time sequence processing accelerator based on one-dimensional convolutional neural network
Chen et al. An efficient accelerator for multiple convolutions from the sparsity perspective
Hareth et al. Low power CNN hardware FPGA implementation
CN113762493A (en) Neural network model compression method and device, acceleration unit and computing system
Iliev et al. Low latency CMOS hardware acceleration for fully connected layers in deep neural networks
CN110766136B (en) Compression method of sparse matrix and vector
Guo et al. A high-efficiency fpga-based accelerator for binarized neural network
CN109978143B (en) Stack type self-encoder based on SIMD architecture and encoding method
CN112836793B (en) Floating point separable convolution calculation accelerating device, system and image processing method
Wang et al. Reboc: Accelerating block-circulant neural networks in reram
CN109343826B (en) Reconfigurable processor operation unit for deep learning
CN116167419A (en) Architecture compatible with N-M sparse transducer accelerator and acceleration method
CN112905954A (en) CNN model convolution operation accelerated calculation method using FPGA BRAM
Kong et al. A high efficient architecture for convolution neural network accelerator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant