CN111753962A - Adder, multiplier, convolution layer structure, processor and accelerator - Google Patents

Adder, multiplier, convolution layer structure, processor and accelerator Download PDF

Info

Publication number
CN111753962A
CN111753962A CN202010594416.6A CN202010594416A CN111753962A CN 111753962 A CN111753962 A CN 111753962A CN 202010594416 A CN202010594416 A CN 202010594416A CN 111753962 A CN111753962 A CN 111753962A
Authority
CN
China
Prior art keywords
adder
neural network
clock
programmable device
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010594416.6A
Other languages
Chinese (zh)
Other versions
CN111753962B (en
Inventor
唐波
贺龙龙
张耀辉
林志杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd
Original Assignee
Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd filed Critical Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd
Priority to CN202010594416.6A priority Critical patent/CN111753962B/en
Publication of CN111753962A publication Critical patent/CN111753962A/en
Application granted granted Critical
Publication of CN111753962B publication Critical patent/CN111753962B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses an adder, a multiplier, a convolutional layer structure, a processor and an accelerator, wherein the neural network adder based on a programmable device comprises: the gated clock modules are used for executing logic AND operation according to the received enable signals and clock signals and generating gated clock signals according to operation results; the plurality of adders are cascaded according to a pipeline structure to form a plurality of levels of adders, each level of adder is respectively connected with the corresponding gated clock modules one by one, and the output end of the previous level of adder is respectively connected with the input end of the next level of adder and the enabling end of the gated clock module corresponding to the next level of adder, so that the gated clock module corresponding to the next level of adder receives the enabling signal sent by the previous level of adder through the enabling end and controls the next level of adder to perform data updating operation according to the data received by the input end through the generated gated clock signal. By implementing the invention, the energy consumption can be reduced.

Description

Adder, multiplier, convolution layer structure, processor and accelerator
Technical Field
The invention relates to the technical field of neural networks, in particular to an adder, a multiplier, a convolutional layer structure, a processor and an accelerator.
Background
The convolutional neural network is one of the most representative neural networks in the technical field of deep learning, and has made a lot of breakthrough progress in the field of image analysis and processing. In the field of automatic driving, deep learning is widely applied to target recognition, image feature extraction and classification, scene recognition and the like, and an internal algorithm of a convolutional neural network operates in a computing platform domain controller with power consumption restricted by vehicle power supply, so that when automatic driving is realized based on the convolutional neural network, the contradiction that the power consumption and the computational power are difficult to combine exists. Therefore, a hardware device which does not affect the cruising ability of the vehicle while not reducing the computational power requirement and the algorithm precision is urgently needed to be provided.
Disclosure of Invention
In view of this, embodiments of the present invention provide an adder, a multiplier, a convolutional layer structure, a processor, and an accelerator to solve the problem in the prior art that power consumption and computational power are difficult to combine.
According to a first aspect, an embodiment of the present invention provides a programmable device based neural network adder, including: the gated clock modules are used for executing logic AND operation according to the received enable signals and clock signals and generating gated clock signals according to operation results; the system comprises a plurality of adders, wherein the adders are cascaded according to a pipeline structure to form a multi-stage adder, each stage of adder is respectively connected with the corresponding gated clock modules one by one, and the output end of the previous stage of adder is respectively connected with the input end of the next stage of adder and the enabling end of the gated clock module corresponding to the next stage of adder, so that the gated clock module corresponding to the next stage of adder receives the enabling signal sent by the previous stage of adder through the enabling end and controls the next stage of adder to perform data updating operation according to the data received by the input end through the generated gated clock signal.
Optionally, each of the adders includes at least one multi-bit full adder, the multi-bit full adder is formed by serially cascading a plurality of one-bit full adders, each of the one-bit full adders is connected to a corresponding register one by one, and is configured to store a calculation result to the corresponding register, and a clock end of each of the registers is connected to an output end of the corresponding gated clock module, and is configured to send the calculation result according to the received gated clock signal.
According to a second aspect, an embodiment of the present invention provides a neural network adder based on a programmable device, including: the gated clock modules are used for executing logic AND operation according to the received enable signals and clock signals and generating gated clock signals according to operation results; the system comprises a plurality of adders, wherein the adders are cascaded according to a pipeline structure to form a multi-stage adder, each stage of adder is respectively connected with the corresponding gated clock modules one by one, and the output end of the previous stage of adder is respectively connected with the input end of the next stage of adder and the enabling end of the gated clock module corresponding to the next stage of adder, so that the gated clock module corresponding to the next stage of adder receives the enabling signal sent by the previous stage of adder through the enabling end and controls the next stage of adder to perform data updating operation according to the data received by the input end through the generated gated clock signal.
According to a third aspect, an embodiment of the present invention provides a neural network multiplier based on a programmable device, including: the first gated clock module is used for executing logic AND operation according to the received first enable signal and the first clock signal and generating a first gated clock signal according to an operation result; the reset module is connected with the first gated clock module at a clock end and used for executing reset operation according to the received first gated clock signal; the enabling end of the second gated clock module is connected with the output end of the reset module and used for receiving a second enabling signal sent by the reset module, executing logic AND operation according to the second enabling signal and the second clock signal and generating a second gated clock signal according to an operation result; and the clock end of the multiplier is connected with the second gated clock module, and the input end of the multiplier is connected with the output end of the reset module and used for executing data updating operation according to the second gated clock signal received by the clock end and the data received by the input end.
According to a fourth aspect, an embodiment of the present invention provides a neural network convolutional layer structure based on a programmable device, including: the third gated clock module is used for executing logic AND operation according to the received third enable signal and the third clock signal and generating a third gated clock signal according to an operation result; a plurality of programmable device-based neural network multipliers as described in the fourth aspect, wherein a clock terminal of each of the programmable device-based neural network multipliers is connected to an output terminal of the third gated clock module, and is configured to perform a data update operation on data received by the input terminal according to the third gated clock signal received by the clock terminal; a plurality of fourth gate-controlled clock modules, wherein enable ends of the fourth gate-controlled clock modules are respectively connected with output ends of the corresponding programmable device-based neural network multipliers, and are used for receiving a fourth enable signal, executing logical and operation according to the fourth clock signal and the corresponding fourth enable signal, and generating a fourth gate-controlled clock signal according to an operation result; a plurality of programmable device-based neural network adders, wherein clock terminals of the programmable device-based neural network adders are respectively connected to output terminals of the corresponding fourth gated clock modules, and are configured to perform a data update operation according to the received fourth gated clock signal.
Optionally, the neural network convolutional layer structure based on a programmable device further includes: the data storage modules comprise a data register module and a weight register module, and the data storage modules are used for storing data and weights required by the calculation of the convolutional neural network.
Optionally, the data storage module further includes: and the cache memory is connected to the data register module and the weight register module and is used for caching data and weight required by the calculation of the convolutional neural network.
Optionally, the cache memory comprises 4 memory blocks.
Optionally, the neural network convolutional layer structure based on a programmable device further includes: a fifth gated clock module, an enable terminal of which is connected to an output terminal of any adder that performs data update operations, and configured to receive a fifth enable signal, perform a logical and operation according to the fifth clock signal and the fifth enable signal, and generate a fifth gated clock signal according to an operation result; and the clock end of the accumulator is respectively connected with the output end of the fifth gating clock module and is used for executing data updating operation according to the received fifth gating clock signal.
According to a fifth aspect, an embodiment of the present invention provides a neural network processor based on a programmable device, including: the neural network convolutional layer structure based on a programmable device according to any one of the fifth aspect and the fifth aspect.
According to a sixth aspect, an embodiment of the present invention provides a neural network accelerator based on a programmable device, including: a plurality of programmable device based neural network processors as described in the fifth aspect.
The invention has the advantages that:
1. in the neural network adder based on the programmable device provided by the embodiment, the full adders are connected in a cascade structure of a production line, a constraint relation exists among all levels, the enable signal of the next full adder is controlled by the previous full adder, and the next full adder is controlled to update data only when the previous full adder completes data updating, so that useless data updating at each clock signal triggering time in the prior art is avoided, and energy consumption is reduced.
2. The neural network multiplier based on the programmable device provided by the embodiment is formed by connecting a reset module, a multiplier, a first gated clock module and a second gated clock module, wherein a constraint relation exists between the reset module and the multiplier and is used for controlling the second gated clock signal of the multiplier to be controlled by the output of the reset module, and the multiplier is controlled to update data only when the reset module finishes resetting, so that useless data updating at each clock signal triggering moment in the prior art is avoided, and the power consumption of the multiplier is reduced.
3. The neural network convolution layer structure based on the programmable device provided by the embodiment is formed by connecting a neural network multiplier based on the programmable device, a neural network adder based on the programmable device, a third gated clock module and a fourth gated clock module, wherein a constraint relation exists between the neural network multiplier based on the programmable device and the neural network adder based on the programmable device, and the neural network adder based on the programmable device is controlled to update data after the data updating is completed by the neural network multiplier based on the programmable device, so that useless data updating at each clock signal triggering moment in the prior art is avoided, and the power consumption of the neural network convolution layer structure based on the programmable device is reduced.
4. The neural network convolution layer structure based on the programmable device provided by the embodiment comprises a plurality of memory blocks, and a cache memory structure is formed by a plurality of small memory blocks, in the prior art, in most memory block structures, the whole cache memory is used for completing data reading and writing, but in fact, half or less memory blocks have lower power consumption than the whole large memory block when the data is read and written, and the embodiment further reduces the power consumption of the neural network convolution layer structure based on the programmable device through a memory segmentation technology.
5. In the neural network processor based on the programmable device provided by this embodiment, because the power consumption of the internal convolutional layer structure is low, the power consumption of the whole neural network processor based on the programmable device is reduced.
6. The neural network accelerator based on the programmable device provided by the embodiment comprises: the plurality of programmable device based neural network processors in the above embodiments reduce the power consumption of the programmable device based neural network accelerator.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a block diagram of a programmable device based neural network adder according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating operation of a programmable device based neural network adder according to an embodiment of the present invention;
FIG. 3 shows a multi-bit full adder architecture diagram of an embodiment of the invention;
FIG. 4 is a diagram illustrating an internal connection structure of a programmable device based neural network adder, according to an embodiment of the present invention;
FIG. 5 shows a block diagram of a programmable device based neural network multiplier of an embodiment of the present invention;
FIG. 6 shows a block diagram of a programmable device based neural network convolutional layer structure of an embodiment of the present invention;
FIG. 7 illustrates a cache memory partitioning diagram of an embodiment of the present invention;
FIG. 8 is a schematic diagram of a programmable device based neural network processor according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a neural network accelerator based on a programmable device according to an embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; the two elements may be directly connected or indirectly connected through an intermediate medium, or may be communicated with each other inside the two elements, or may be wirelessly connected or wired connected. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The embodiment provides a neural network adder based on a programmable device, as shown in fig. 1, including: the clock gating modules 101 are used for executing logic AND operation according to the received enable signals and clock signals and generating clock gating signals according to operation results;
the adders 102 are cascaded according to a pipeline structure to form a multi-stage adder, each stage of the adder 102 is connected with the corresponding gated clock module 101 one by one, and the output end of the previous stage of the adder 102 is connected with the input end of the next stage of the adder 102 and the enable end of the gated clock module 101 corresponding to the next stage of the adder, so that the gated clock module 101 corresponding to the next stage of the adder 102 receives the enable signal sent by the previous stage of the adder through the enable end, and the next stage of the adder 102 is controlled by the generated gated clock signal to perform data updating operation according to the data received by the input end.
Illustratively, the clock gating module 101 may be composed of a latch and a logic and gate, wherein an enable signal is used as an input signal of an input terminal of the latch, a clock signal is connected to a clock terminal of the latch, and is used as an input of the logic and gate, and the clock gating signal is an output result obtained by performing a logical and operation on the clock signal and an output signal of the latch by the logic and gate. That is, when the clock signal is at a rising edge while the enable signal is at a high level, the gated clock signal is an active signal. The enable signal can be triggered by other devices connected with the neural network full adder based on the programmable device in the embodiment, and the other devices can be multipliers, adders and the like; the clock signal may be sent by a controller, such as a CPU, which controls the programmable device-based neural network full adder to perform operations in this embodiment.
IN this embodiment, 16 paths of 32-bit data are taken as an example for addition calculation, an Adder operation flow is shown IN fig. 2, 16 paths of inputs IN fig. 2 correspond to 16 paths of inputs IN0 and IN1 … IN15 of an Adder ad 0 IN fig. 1, 8 32-bit full adders are integrated IN the Adder ad 0, and so on, and the four-level Adder 102 (the Adder0, the Adder1, the Adder2, and the Adder3) IN fig. 1 respectively integrates 8 32-bit full adders, 4 33-bit full adders, 2 34-bit full adders, and 1 35-bit full Adder. The number of adders integrated, the number of full adders integrated in the adders, and the number of processing bits are not limited in this embodiment, and can be determined by those skilled in the art as needed.
The adders 102 are cascaded according to a pipeline structure to form a multi-stage adder, the output end of the adder 102 at the previous stage is connected with the enable end of the gated clock module 101 corresponding to the adder 102 at the next stage, the gated clock module corresponding to the adder at the next stage generates a gated clock signal according to the enable signal and the clock signal, and the gated clock signal is used for controlling a register in the corresponding full adder to update data. Taking 4 adders shown in fig. 1 as an example, the adders Adder0, Adder1, Adder2 and Adder3 are respectively connected with a gated clock gate, when the Adder0 completes data updating and outputs a result, an enable signal is input to a gated clock module connected to the Adder1, the gated clock module controls the Adder1 to update data according to the enable signal and a clock signal, and the working processes of the rest adders Adder2 and Adder3 are analogized in sequence.
In the neural network adder based on the programmable device provided by the embodiment, the full adders are connected in a cascade structure of a production line, a constraint relation exists among all levels, the enable signal of the next full adder is controlled by the previous full adder, and the next full adder is controlled to update data only when the previous full adder completes data updating, so that useless data updating at each clock signal triggering time in the prior art is avoided, and energy consumption is reduced.
As an optional implementation manner of this embodiment, as shown in fig. 3, each adder includes at least one multi-bit full adder, the multi-bit full adder is formed by serially cascading a plurality of one-bit full adders 201, each one-bit full adder 201 is connected to a corresponding register 202 one by one, and is configured to store a calculation result to the corresponding register 202, and a clock end of each register 202 is connected to an output end of a corresponding gated clock module, and is configured to send the calculation result according to the received gated clock signal.
Illustratively, the present embodiment takes 36 one-bit full adders 102 as an example to form a 36-bit full adder, and it is assumed that the input of the programmable device based neural network full adder in the present embodiment is a [ 35: 0], B [ 35: 0], that is, a and B need to be added to obtain an addition result. Then FA0, FA1 … FA35 in fig. 1 represent 36 one-bit full adders 201, respectively, computing a 36-bit result in A, B. C represents the carry of two-bit addition, for example, C0 represents the carry of a0+ B0, and carries the carry information to the one-bit full adder 201 with higher bits; s represents the sum of two-bit additions, such as S0 represents the sum of A0+ B0, and the addition result is stored in register 202, such as D0 stores S0, where register 202 may be a D flip-flop. It should be noted that in the present embodiment, D0-D35 store S0-S35, respectively, and D36 stores carry information C35 with the highest bit.
The data stored in the registers D0-D36 can control the data in the registers D0-D36 to be updated only when the clock signal and the enable signal satisfy the update condition at the same time, so as to output the data stored this time and store the result of the next addition operation. For example, the data in registers D0-D36 output the stored operation result and store a new operation result, updating the stored data, if and only if the clock signal is at a rising edge and the enable signal is high.
The multi-bit full adders integrated in the adder 102 are connected with the gated clock module 101 in a manner as shown in fig. 4, and the update of the data stored in the registers of the multi-bit full adders in the same hierarchy is controlled by one gated clock module 101. When the clock signal is in a rising edge state and the enable signal is at a high level, the gating clock signal is an effective signal, and simultaneously, all the full adders are triggered to control the register in the multi-bit full adder to update data.
According to the neural network full adder based on the programmable device, the gated clock module is arranged on the adder, data updating is controlled through the gated clock module, the gated clock module is added to enable the result output of the adder not only to depend on the clock signal, but also to be controlled by the enable signal, the adder can output the result at the same time only when the enable signal and the clock signal simultaneously meet the condition, unnecessary updating of the content of the register is not needed when every clock signal rising edge in a traditional register is needed, and therefore the invalid energy consumption of the full adder is reduced.
The embodiment provides a neural network multiplier based on a programmable device, as shown in fig. 5, including:
a first gated clock module 301, configured to perform a logical and operation according to the received first enable signal and the first clock signal, and generate a first gated clock signal according to an operation result;
a reset module 302, a clock end of which is connected to the first gated clock module, for executing a reset operation according to the received first gated clock signal;
the second gated clock module 303, an enable end of which is connected to an output end of the reset module, is configured to receive a second enable signal sent by the reset module, perform a logical and operation according to the second enable signal and the second clock signal, and generate a second gated clock signal according to an operation result;
and a multiplier 304, a clock end of which is connected to the second gated clock module, and an input end of which is connected to the output end of the reset module, and configured to perform a data update operation according to the second gated clock signal received by the clock end and the data received by the input end.
Illustratively, the enable signal of the first clock gating module 301 may be triggered by other devices connected to the reset module 302, such as a multiplier, an adder, a memory, etc.; the clock signal may be sent by a controller, such as a CPU, that controls various devices to perform operations. The enable terminal of the second gated clock module 303 is connected to the output terminal of the reset module 302, when the reset module 302 completes the reset operation, the enable terminal of the second gated clock module 303 receives the enable signal, and the second gated clock module 303 generates a second gated clock signal according to the enable signal and the clock signal, so as to control the multiplier 304 to perform data update. In the present embodiment, the multiplier 304 internally uses a 32 × 16-bit register to store the operation result, and the clock end of the register is connected to the clock gating module, so as to reduce the total dynamic power consumption of the register.
The neural network multiplier based on the programmable device provided by the embodiment is formed by connecting a reset module, a multiplier, a first gated clock module and a second gated clock module, wherein a constraint relation exists between the reset module and the multiplier and is used for controlling the second gated clock signal of the multiplier to be controlled by the output of the reset module, and the multiplier is controlled to update data only when the reset module finishes resetting, so that useless data updating at each clock signal triggering moment in the prior art is avoided, and the power consumption of the multiplier is reduced.
The present embodiment provides a neural network convolutional layer structure based on a programmable device, as shown in fig. 6, including:
a third gated clock module 401, configured to perform a logical and operation according to the received third enable signal and the third clock signal, and generate a third gated clock signal according to an operation result;
a plurality of programmable device-based neural network multipliers 402 in the above embodiments, wherein a clock terminal of each programmable device-based neural network multiplier is connected to an output terminal of the third gated clock module, and is configured to perform a data update operation on data received by an input terminal according to a third gated clock signal received by the clock terminal;
a plurality of fourth gated clock modules 403, the enable terminals of which are respectively connected to the output terminals of the corresponding programmable device-based neural network multipliers, and are configured to receive a fourth enable signal, perform a logical and operation according to the fourth clock signal and the corresponding fourth enable signal, and generate a fourth gated clock signal according to the operation result;
a plurality of programmable device-based neural network adders 404 as described in the above embodiments, wherein the clock terminals of the programmable device-based neural network adders are respectively connected to the output terminals of the corresponding fourth gated clock modules, and are configured to perform a data update operation according to the received fourth gated clock signal.
For example, the clock signals of the third and fourth gated clock modules may be sent out by the controller controlling the neural network to perform convolution operation, and the third enable signal may be sent out by a previous device connected to the programmable device based neural network multiplier 402, for example, a data memory connected in front of the programmable device based neural network multiplier 402, and then the third enable signal is sent to the third gated clock module 401 after the data memory outputs data. When the third clock signal received by the third clock gating module 401 is at a rising edge and the third enable signal is at a high level, the third clock gating module 401 sends out the third clock gating signal to control all programmable device-based neural network multipliers 402 connected thereto to perform data updating simultaneously.
The enable terminal of the fourth gated clock module 403 is connected to the output terminal of the neural network multiplier 402 based on the programmable device, and when the data update of the neural network multiplier 402 based on the programmable device is completed, an effective fourth enable signal is sent to the fourth gated clock module 403, and the fourth gated clock module 403 sends a fourth gated clock signal according to the fourth enable signal and the fourth clock signal, and activates the neural network adder 404 based on the programmable device to perform the data update.
The neural network convolution layer structure based on the programmable device provided by the embodiment is formed by connecting a neural network multiplier based on the programmable device, a neural network adder based on the programmable device, a third gated clock module and a fourth gated clock module, wherein a constraint relation exists between the neural network multiplier based on the programmable device and the neural network adder based on the programmable device, and the neural network adder based on the programmable device is controlled to update data after the data updating is completed by the neural network multiplier based on the programmable device, so that useless data updating at each clock signal triggering moment in the prior art is avoided, and the power consumption of the neural network convolution layer structure based on the programmable device is reduced.
As an optional implementation manner of this embodiment, the neural network convolutional layer structure based on a programmable device further includes: the data storage modules comprise a data register module and a weight register module, and the data storage modules are used for storing data and weights required by the calculation of the convolutional neural network.
As an optional implementation manner of this embodiment, as shown in fig. 7, the data storage module further includes: and the cache memory is connected to the data register module and the weight register module and is used for caching data and weight required by the calculation of the convolutional neural network.
Illustratively, since the storage capacities of the data register module and the weight register module are very small, before data is sent to the data register module and the weight register module, the data needs to be buffered by first passing through a buffer memory with a larger storage capacity, and then the data is input into the data register module and the weight register module. For a cache memory in a convolutional layer, optimizing by adopting a memory splitting technology, specifically: by using multiplexer and multiplexer techniques, the power consumption is reduced by dividing the entire cache memory into several smaller memories. Fig. 7 shows a method of dividing a 512-byte memory block into 4 128 bytes and controlling each 128-byte memory block by a decoder and a multiplexer to achieve power consumption reduction. Address of the original cache memory [ 8: 0], two upper bits Address [ 8: 7] is used for decoding of a decoder, and decoding outputs enable signals en 0-en 3 used as four small memory blocks, namely when Address [ 8: when 7 ═ 2' b00, memory block 0 is enabled, i.e., stored using memory block 0, when Address [ 8: when 7 ═ 2' b01, memory block 1 is enabled, when Address [ 8: when 7 ═ 2' b10, memory block 2 is enabled, when Address [ 8: when 7 ═ 2' b11, memory block 3 is enabled; low 7-bit Address [ 6: 0] is used as the data Address of the small memory block, i.e. the data strobe of the different small memory blocks is accessed from the upper two bits Address of the original memory block Address [ 8: 7] determining.
In the prior art, in most of memory block structures, data reading and writing are completed by using the whole cache memory, but in fact, half or less memory blocks consume lower power when reading and writing data than the whole large memory block, and the embodiment further reduces the power consumption of the neural network convolutional layer structure based on the programmable device through a memory partitioning technology.
As an optional implementation manner of this embodiment, the cache memory includes 4 memory blocks.
Illustratively, in order to achieve such a small memory block size, various preliminary studies are required to determine the optimal memory block configuration number and memory block size. Theoretically, the same cache memory can be divided into storage blocks with larger quantity and smaller size, such as 8, 16, 32, 64 and the like, but in the actual measurement process, when the division number of the storage blocks is more than 4, the fragmentation of the cache memory is serious, so that the data reading and writing throughput is reduced sharply, and the convolution operation performance is influenced; when the division number of the cache memory is 2 and 4, the performance of the cache memory, namely the data read-write throughput, is not obviously changed, and compared with the division number of the cache memory being 4, the power consumption is lower, so that in the embodiment of the invention, the memory is divided into 4 small blocks on the premise of not influencing the convolution operation performance. The actual measurement proves that even if the number of times of reading/writing is the same, the executed reading/writing operation is only executed in one small storage block, but not in a full-size register, so that the power consumption is reduced.
As an optional implementation manner of this embodiment, a convolutional structure of a neural network based on a programmable device is shown in fig. 6, and further includes:
a fifth gated clock module 405, an enable terminal of which is connected to an output terminal of any adder that performs data update operations, and configured to receive a fifth enable signal, perform a logical and operation according to the fifth clock signal and the fifth enable signal, and generate a fifth gated clock signal according to an operation result;
and an accumulator 406, a clock terminal of which is respectively connected to the output terminal of the fifth gated clock module, for performing a data update operation according to the received fifth gated clock signal.
Illustratively, the accumulator 406 is used to calculate the calculation results of the adders in all convolution structures, so as to obtain the final convolution result. The fifth clock signal may be sent by a controller that controls the neural network to perform convolution operation, the fifth enable signal is sent by any one of the programmable device based neural network adders 404, and when the programmable device based neural network adder 404 completes data update, an effective fifth enable signal is sent to the fifth gated clock module 405. The accumulator 406 receives the fifth gated clock signal from the fifth gated clock module 405, and performs a corresponding data update operation according to the fifth gated clock signal.
The embodiment provides a neural network processor based on a programmable device, as shown in fig. 8, including: any of the programmable device based neural network convolutional layer structures as described in the above embodiments.
Illustratively, the neural network processor based on the programmable device comprises a ReLU layer and a pooling layer structure in addition to any one of the neural network convolutional layer structures based on the programmable device in the above embodiments. The input of the neural network convolution layer structure based on the programmable device is derived from an input buffer area, the data of the input buffer area is convoluted, the convolution result is output to a ReLU layer, the ReLU layer result is output to a pooling layer, and a clock signal is sent out from a controller and the convolution layer is controlled to execute convolution operation.
In the neural network processor based on the programmable device provided by this embodiment, because the power consumption of the internal convolutional layer structure is low, the power consumption of the whole neural network processor based on the programmable device is reduced.
The embodiment provides a neural network accelerator based on a programmable device, as shown in fig. 9, including: a number of the programmable device based neural network processors of the above embodiments.
Illustratively, a programmable device-based neural network accelerator includes: the device comprises a state controller, a DMA controller, an input buffer area, an output buffer area and a plurality of neural network processors based on programmable devices in the above embodiments; the controller is connected with an external central processing unit and is used for controlling a plurality of programmable device-based neural network processors in the above embodiments to carry out convolution operation; the DMA controller is connected with the external central processing unit and the external memory and is used for directly accessing the data in the external memory and sending the convolution calculation result to the external memory; the input buffer area is respectively connected with the controller and the DMA controller and is used for caching data input into the neural network processor based on the programmable device in a plurality of the embodiments; and the output buffer area is respectively connected with the neural network processor based on the programmable device and the DMA controller in the plurality of embodiments and is used for caching data output by the neural network processor based on the programmable device.
The neural network accelerator based on the programmable device provided by the embodiment comprises: the plurality of programmable device based neural network processors in the above embodiments reduce the power consumption of the programmable device based neural network accelerator.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (10)

1. A programmable device based neural network adder, comprising:
the gated clock modules are used for executing logic AND operation according to the received enable signals and clock signals and generating gated clock signals according to operation results;
the system comprises a plurality of adders, wherein the adders are cascaded according to a pipeline structure to form a multi-stage adder, each stage of adder is respectively connected with the corresponding gated clock modules one by one, and the output end of the previous stage of adder is respectively connected with the input end of the next stage of adder and the enabling end of the gated clock module corresponding to the next stage of adder, so that the gated clock module corresponding to the next stage of adder receives the enabling signal sent by the previous stage of adder through the enabling end and controls the next stage of adder to perform data updating operation according to the data received by the input end through the generated gated clock signal.
2. The programmable device-based neural network adder according to claim 1, wherein each adder includes at least one multi-bit full adder, the multi-bit full adder is formed by serially cascading a plurality of one-bit full adders, each one-bit full adder is connected to a corresponding register one by one for storing a calculation result into the corresponding register, and a clock terminal of each register is connected to an output terminal of the corresponding clock gating module for sending the calculation result according to the received clock gating signal.
3. A programmable device based neural network multiplier, comprising:
the first gated clock module is used for executing logic AND operation according to the received first enable signal and the first clock signal and generating a first gated clock signal according to an operation result;
the reset module is connected with the first gated clock module at a clock end and used for executing reset operation according to the received first gated clock signal;
the enabling end of the second gated clock module is connected with the output end of the reset module and used for receiving a second enabling signal sent by the reset module, executing logic AND operation according to the second enabling signal and the second clock signal and generating a second gated clock signal according to an operation result;
and the clock end of the multiplier is connected with the second gated clock module, and the input end of the multiplier is connected with the output end of the reset module and used for executing data updating operation according to the second gated clock signal received by the clock end and the data received by the input end.
4. A neural network convolutional layer structure based on a programmable device, comprising:
the third gated clock module is used for executing logic AND operation according to the received third enable signal and the third clock signal and generating a third gated clock signal according to an operation result;
a plurality of programmable device-based neural network multipliers as claimed in claim 3, wherein the clock terminal of each of the programmable device-based neural network multipliers is connected to the output terminal of the third clock gating module, and is configured to perform a data update operation on the data received at the input terminal according to the third clock gating signal received at the clock terminal;
a plurality of fourth gate-controlled clock modules, wherein enable ends of the fourth gate-controlled clock modules are respectively connected with output ends of the corresponding programmable device-based neural network multipliers, and are used for receiving a fourth enable signal, executing logical and operation according to the fourth clock signal and the corresponding fourth enable signal, and generating a fourth gate-controlled clock signal according to an operation result;
a plurality of programmable device-based neural network adders as claimed in claim 2, wherein the clock terminals of the programmable device-based neural network adders are respectively connected to the output terminals of the corresponding fourth gated clock modules for performing data update operations according to the received fourth gated clock signals.
5. The programmable device-based neural network convolutional layer structure of claim 4, further comprising:
the data storage modules comprise a data register module and a weight register module, and the data storage modules are used for storing data and weights required by the calculation of the convolutional neural network.
6. The programmable device-based neural network convolutional layer structure of claim 5, wherein the data storage module further comprises: and the cache memory is connected to the data register module and the weight register module and is used for caching data and weight required by the calculation of the convolutional neural network.
7. The programmable device-based neural network convolutional layer structure of claim 6, wherein the cache memory comprises 4 memory blocks.
8. The programmable device-based neural network convolutional layer structure of claim 4, further comprising:
a fifth gated clock module, an enable terminal of which is connected to an output terminal of any adder that performs data update operations, and configured to receive a fifth enable signal, perform a logical and operation according to the fifth clock signal and the fifth enable signal, and generate a fifth gated clock signal according to an operation result;
and the clock end of the accumulator is respectively connected with the output end of the fifth gating clock module and is used for executing data updating operation according to the received fifth gating clock signal.
9. A programmable device based neural network processor, comprising: the programmable device based neural network convolutional layer structure of any of claims 4-8.
10. A programmable device based neural network accelerator, comprising: a plurality of programmable device based neural network processors as claimed in claim 9.
CN202010594416.6A 2020-06-24 2020-06-24 Adder, multiplier, convolution layer structure, processor and accelerator Active CN111753962B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010594416.6A CN111753962B (en) 2020-06-24 2020-06-24 Adder, multiplier, convolution layer structure, processor and accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010594416.6A CN111753962B (en) 2020-06-24 2020-06-24 Adder, multiplier, convolution layer structure, processor and accelerator

Publications (2)

Publication Number Publication Date
CN111753962A true CN111753962A (en) 2020-10-09
CN111753962B CN111753962B (en) 2023-07-11

Family

ID=72677388

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010594416.6A Active CN111753962B (en) 2020-06-24 2020-06-24 Adder, multiplier, convolution layer structure, processor and accelerator

Country Status (1)

Country Link
CN (1) CN111753962B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112462845A (en) * 2020-11-25 2021-03-09 海光信息技术股份有限公司 Data transmission clock control circuit, method and processor
CN113271086A (en) * 2021-07-19 2021-08-17 深圳英集芯科技股份有限公司 Clock burr-free switching circuit, chip and electronic equipment
CN113642278A (en) * 2021-07-15 2021-11-12 加弘科技咨询(上海)有限公司 Power consumption generation system and method of programmable logic device

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05289850A (en) * 1992-04-14 1993-11-05 Sumitomo Electric Ind Ltd Multiplier
CN101686041A (en) * 2008-09-27 2010-03-31 深圳市芯海科技有限公司 Gated clock circuit and gated clock signal generation method
CN102487272A (en) * 2010-12-01 2012-06-06 Arm有限公司 Integrated circuit, clock gate control circuit and method
CN104090737A (en) * 2014-07-04 2014-10-08 东南大学 Improved partial parallel architecture multiplying unit and processing method thereof
CN105512724A (en) * 2015-12-01 2016-04-20 中国科学院计算技术研究所 Adder device, data accumulation method, and data processing device
CN106055026A (en) * 2016-07-20 2016-10-26 深圳市博巨兴实业发展有限公司 Real time clock unit in microcontroller SOC (System On Chip)
CN106528046A (en) * 2016-11-02 2017-03-22 上海集成电路研发中心有限公司 Long bit width time sequence accumulation multiplying unit
CN107844826A (en) * 2017-10-30 2018-03-27 中国科学院计算技术研究所 Neural-network processing unit and the processing system comprising the processing unit
CN108133263A (en) * 2016-12-01 2018-06-08 上海兆芯集成电路有限公司 Neural network unit
CN108133267A (en) * 2016-12-01 2018-06-08 上海兆芯集成电路有限公司 With the processor that can be used as most rear class cache tile or the memory array of neural network cell memory operation
CN108171317A (en) * 2017-11-27 2018-06-15 北京时代民芯科技有限公司 A kind of data-reusing convolutional neural networks accelerator based on SOC
CN109828744A (en) * 2019-01-18 2019-05-31 东北师范大学 A kind of configurable floating point vector multiplication IP kernel based on FPGA

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05289850A (en) * 1992-04-14 1993-11-05 Sumitomo Electric Ind Ltd Multiplier
CN101686041A (en) * 2008-09-27 2010-03-31 深圳市芯海科技有限公司 Gated clock circuit and gated clock signal generation method
CN102487272A (en) * 2010-12-01 2012-06-06 Arm有限公司 Integrated circuit, clock gate control circuit and method
CN104090737A (en) * 2014-07-04 2014-10-08 东南大学 Improved partial parallel architecture multiplying unit and processing method thereof
CN105512724A (en) * 2015-12-01 2016-04-20 中国科学院计算技术研究所 Adder device, data accumulation method, and data processing device
CN106055026A (en) * 2016-07-20 2016-10-26 深圳市博巨兴实业发展有限公司 Real time clock unit in microcontroller SOC (System On Chip)
CN106528046A (en) * 2016-11-02 2017-03-22 上海集成电路研发中心有限公司 Long bit width time sequence accumulation multiplying unit
CN108133263A (en) * 2016-12-01 2018-06-08 上海兆芯集成电路有限公司 Neural network unit
CN108133267A (en) * 2016-12-01 2018-06-08 上海兆芯集成电路有限公司 With the processor that can be used as most rear class cache tile or the memory array of neural network cell memory operation
CN107844826A (en) * 2017-10-30 2018-03-27 中国科学院计算技术研究所 Neural-network processing unit and the processing system comprising the processing unit
CN108171317A (en) * 2017-11-27 2018-06-15 北京时代民芯科技有限公司 A kind of data-reusing convolutional neural networks accelerator based on SOC
CN109828744A (en) * 2019-01-18 2019-05-31 东北师范大学 A kind of configurable floating point vector multiplication IP kernel based on FPGA

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112462845A (en) * 2020-11-25 2021-03-09 海光信息技术股份有限公司 Data transmission clock control circuit, method and processor
CN113642278A (en) * 2021-07-15 2021-11-12 加弘科技咨询(上海)有限公司 Power consumption generation system and method of programmable logic device
CN113642278B (en) * 2021-07-15 2023-12-12 加弘科技咨询(上海)有限公司 Power consumption generation system and method of programmable logic device
CN113271086A (en) * 2021-07-19 2021-08-17 深圳英集芯科技股份有限公司 Clock burr-free switching circuit, chip and electronic equipment

Also Published As

Publication number Publication date
CN111753962B (en) 2023-07-11

Similar Documents

Publication Publication Date Title
CN107657581B (en) Convolutional neural network CNN hardware accelerator and acceleration method
CN110390385B (en) BNRP-based configurable parallel general convolutional neural network accelerator
CN111753962B (en) Adder, multiplier, convolution layer structure, processor and accelerator
US20230196065A1 (en) Methods and devices for programming a state machine engine
US11551068B2 (en) Processing system and method for binary weight convolutional neural network
US11775430B1 (en) Memory access for multiple circuit components
CN106683158B (en) Modeling system of GPU texture mapping non-blocking storage Cache
CN108629406B (en) Arithmetic device for convolutional neural network
CN102968390B (en) Configuration information cache management method and system based on decoding analysis in advance
CN113743599B (en) Computing device and server of convolutional neural network
CN112633505B (en) RISC-V based artificial intelligence reasoning method and system
Benes et al. A fast asynchronous Huffman decoder for compressed-code embedded processors
WO2021259041A1 (en) Ai computational graph sorting method and apparatus, device, and storage medium
JP5027515B2 (en) Reconfigurable logic device for parallel computation of arbitrary algorithms
CN111105023A (en) Data stream reconstruction method and reconfigurable data stream processor
WO2016126376A1 (en) PROVIDING MEMORY BANDWIDTH COMPRESSION USING BACK-TO-BACK READ OPERATIONS BY COMPRESSED MEMORY CONTROLLERS (CMCs) IN A CENTRAL PROCESSING UNIT (CPU)-BASED SYSTEM
JP2007522571A5 (en)
CN115421899A (en) Reconfigurable processor multi-port configurable cache access method and device
CN113762493A (en) Neural network model compression method and device, acceleration unit and computing system
JP2002055879A (en) Multi-port cache memory
US9570125B1 (en) Apparatuses and methods for shifting data during a masked write to a buffer
WO2015094721A2 (en) Apparatuses and methods for writing masked data to a buffer
WO2023124304A1 (en) Chip cache system, data processing method, device, storage medium, and chip
US7159078B2 (en) Computer system embedding sequential buffers therein for performing a digital signal processing data access operation and a method thereof
KR20240036594A (en) Subsum management and reconfigurable systolic flow architectures for in-memory computation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant