CN111753962B - Adder, multiplier, convolution layer structure, processor and accelerator - Google Patents

Adder, multiplier, convolution layer structure, processor and accelerator Download PDF

Info

Publication number
CN111753962B
CN111753962B CN202010594416.6A CN202010594416A CN111753962B CN 111753962 B CN111753962 B CN 111753962B CN 202010594416 A CN202010594416 A CN 202010594416A CN 111753962 B CN111753962 B CN 111753962B
Authority
CN
China
Prior art keywords
neural network
gating clock
programmable device
adders
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010594416.6A
Other languages
Chinese (zh)
Other versions
CN111753962A (en
Inventor
唐波
贺龙龙
张耀辉
林志杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd
Original Assignee
Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd filed Critical Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd
Priority to CN202010594416.6A priority Critical patent/CN111753962B/en
Publication of CN111753962A publication Critical patent/CN111753962A/en
Application granted granted Critical
Publication of CN111753962B publication Critical patent/CN111753962B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses an adder, a multiplier, a convolution layer structure, a processor and an accelerator, wherein the neural network adder based on a programmable device comprises: the gating clock modules are used for performing logical AND operation according to the received enabling signals and clock signals and generating gating clock signals according to operation results; and the adders are cascaded according to a pipeline structure to form a plurality of stages of adders, each stage of adders are respectively connected with the corresponding gating clock modules one by one, the output end of the upper stage of adders is respectively connected with the input end of the lower stage of adders and the enabling end of the gating clock module corresponding to the lower stage of adders, so that the gating clock module corresponding to the lower stage of adders receives the enabling signal sent by the upper stage of adders through the enabling end, and the lower stage of adders are controlled by the generated gating clock signals to execute data updating operation according to the data received by the input end. By implementing the invention, the energy consumption can be reduced.

Description

Adder, multiplier, convolution layer structure, processor and accelerator
Technical Field
The invention relates to the technical field of neural networks, in particular to an adder, a multiplier, a convolution layer structure, a processor and an accelerator.
Background
Convolutional neural networks are one of the most representative neural networks in the field of deep learning technology, and have made a number of breakthrough developments in the field of image analysis and processing. In the automatic driving field, deep learning is widely applied to target recognition, image feature extraction and classification, scene recognition and the like, and an internal algorithm of a convolutional neural network operates in a computing platform domain controller with power consumption limited by vehicle power supply, so that when automatic driving is realized based on the convolutional neural network, the contradiction that the power consumption and the computational effort are difficult to combine exists. Therefore, it is highly desirable to provide a hardware device that does not affect the endurance of the vehicle while not reducing the power requirements and the accuracy of the algorithm.
Disclosure of Invention
In view of the above, the embodiments of the present invention provide an adder, a multiplier, a convolution layer structure, a processor and an accelerator, so as to solve the problem that the power consumption and the computing power are difficult to combine in the prior art.
According to a first aspect, an embodiment of the present invention provides a neural network adder based on a programmable device, including: the gating clock modules are used for performing logical AND operation according to the received enabling signals and clock signals and generating gating clock signals according to operation results; the adders are cascaded according to a pipeline structure to form multi-stage adders, each stage of adders are respectively connected with corresponding gating clock modules one by one, the output end of each upper stage of adders is respectively connected with the input end of each lower stage of adders and the enabling end of the corresponding gating clock module of each lower stage of adders, so that the corresponding gating clock module of each lower stage of adders receives enabling signals sent by the corresponding upper stage of adders through the enabling end, and the corresponding lower stage of adders are controlled by the generated gating clock signals to execute data updating operation according to data received by the input end of each lower stage of adders.
Optionally, each adder includes at least one multi-bit full adder, where the multi-bit full adder is formed by serial cascading a plurality of one-bit full adders, each one-bit full adder is connected to a corresponding register one by one, and is used to store a calculation result to the corresponding register, and a clock end of each register is connected to an output end of the corresponding gating clock module, and is used to send the calculation result according to the received gating clock signal.
According to a second aspect, an embodiment of the present invention provides a neural network adder based on a programmable device, including: the gating clock modules are used for performing logical AND operation according to the received enabling signals and clock signals and generating gating clock signals according to operation results; the adders are cascaded according to a pipeline structure to form multi-stage adders, each stage of adders are respectively connected with corresponding gating clock modules one by one, the output end of each upper stage of adders is respectively connected with the input end of each lower stage of adders and the enabling end of the corresponding gating clock module of each lower stage of adders, so that the corresponding gating clock module of each lower stage of adders receives enabling signals sent by the corresponding upper stage of adders through the enabling end, and the corresponding lower stage of adders are controlled by the generated gating clock signals to execute data updating operation according to data received by the input end of each lower stage of adders.
According to a third aspect, an embodiment of the present invention provides a neural network multiplier based on a programmable device, including: the first gating clock module is used for executing logical AND operation according to the received first enabling signal and the first clock signal and generating a first gating clock signal according to an operation result; the clock end of the reset module is connected with the first gating clock module and is used for executing reset operation according to the received first gating clock signal; the enabling end of the second gating clock module is connected with the output end of the reset module and is used for receiving a second enabling signal sent by the reset module, performing logical AND operation according to the second enabling signal and the second clock signal and generating a second gating clock signal according to an operation result; the clock end of the multiplier is connected with the second gating clock module, and the input end of the multiplier is connected with the output end of the reset module and is used for executing data updating operation according to the second gating clock signal received by the clock end and the data received by the input end.
According to a fourth aspect, an embodiment of the present invention provides a neural network convolutional layer structure based on a programmable device, including: the third gating clock module is used for executing logical AND operation according to the received third enabling signal and the third clock signal and generating a third gating clock signal according to an operation result; a plurality of programmable device-based neural network multipliers according to the fourth aspect, wherein each of the clock terminals of the programmable device-based neural network multipliers is connected to an output terminal of the third gating clock module, respectively, and is configured to perform a data update operation on data received by the input terminal according to the third gating clock signal received by the clock terminal; the enabling ends of the fourth gating clock modules are respectively connected with the corresponding output ends of the neural network multipliers based on the programmable devices and are used for receiving fourth enabling signals, performing logical AND operation according to the fourth clock signals and the corresponding fourth enabling signals and generating fourth gating clock signals according to operation results; the programmable device-based neural network adders of the third aspect, wherein clock terminals of the programmable device-based neural network adders are respectively connected with output terminals of the corresponding fourth gating clock modules, and are used for performing data updating operation according to the received fourth gating clock signals.
Optionally, the neural network convolution layer structure based on the programmable device further comprises: the data storage modules comprise a data register module and a weight register module, and are used for storing data and weights required by convolutional neural network calculation.
Optionally, the neural network convolutional layer structure based on the programmable device, and the data storage module further comprises: and the cache memory is connected to the data register module and the weight register module and is used for caching data and weights required by calculation of the convolutional neural network.
Optionally, the cache memory contains 4 memory blocks.
Optionally, the neural network convolutional layer structure based on the programmable device further comprises: the enabling end of the fifth gating clock module is connected with the output end of any adder for executing data updating operation and is used for receiving a fifth enabling signal, executing logical AND operation according to the fifth clock signal and the fifth enabling signal and generating a fifth gating clock signal according to an operation result; and the clock end of the accumulator is respectively connected with the output end of the fifth gating clock module and is used for executing data updating operation according to the received fifth gating clock signal.
According to a fifth aspect, an embodiment of the present invention provides a neural network processor based on a programmable device, including: the neural network convolutional layer structure based on a programmable device as described in any one of the fifth aspect and the fifth implementation manner.
According to a sixth aspect, an embodiment of the present invention provides a neural network accelerator based on a programmable device, including: a plurality of programmable device-based neural network processors as recited in the fifth aspect.
The implementation of the invention has the advantages that:
1. the neural network adder based on the programmable device provided by the embodiment is connected by the plurality of full adders in a cascade structure of the assembly line, constraint relations exist among all levels, the enabling signal of the full adder of the next stage is controlled by the previous stage, and only the current stage finishes data updating, the full adder of the next stage is controlled to update data, so that useless data updating at the triggering moment of each clock signal in the prior art is avoided, and energy consumption is reduced.
2. The neural network multiplier based on the programmable device provided by the embodiment is formed by connecting a reset module, the multiplier, a first gating clock module and a second gating clock module, wherein a constraint relation exists between the reset module and the multiplier and is used for controlling the second gating clock signal of the multiplier to be controlled by the output of the reset module, and the multiplier is controlled to update data only when the reset module completes reset, so that useless data update is prevented from being carried out at the triggering moment of each clock signal in the prior art, and the power consumption of the multiplier is reduced.
3. The neural network convolution layer structure based on the programmable device provided by the embodiment is formed by connecting the neural network multiplier based on the programmable device, the neural network adder based on the programmable device, the third gating clock module and the fourth gating clock module, wherein a constraint relation exists between the neural network multiplier based on the programmable device and the neural network adder based on the programmable device, and when the neural network multiplier based on the programmable device completes data updating, the neural network adder based on the programmable device is controlled to update data, so that useless data updating is prevented from being carried out once at the triggering moment of each clock signal in the prior art, and the power consumption of the neural network convolution layer structure based on the programmable device is reduced.
4. In the prior art, in most memory block structures, the whole cache memory is used for completing the reading and writing of data, but in fact half or less memory blocks have lower power consumption when reading and writing data than the whole large memory blocks.
5. The power consumption of the neural network processor based on the programmable device is reduced due to the low power consumption of the internal convolution layer structure.
6. The neural network accelerator based on the programmable device provided in this embodiment includes: the programmable device-based neural network processor in the plurality of embodiments reduces the power consumption of the programmable device-based neural network accelerator.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 shows a block diagram of a programmable device based neural network adder in an embodiment of the invention;
FIG. 2 illustrates a flow chart of the operation of a programmable device based neural network summer according to an embodiment of the present invention;
FIG. 3 illustrates a block diagram of a multi-bit full adder of an embodiment of the present invention;
FIG. 4 illustrates an internal connection block diagram of a programmable device-based neural network adder according to an embodiment of the invention;
FIG. 5 shows a block diagram of a programmable device-based neural network multiplier according to an embodiment of the present invention;
FIG. 6 shows a block diagram of a neural network convolutional layer structure based on a programmable device in accordance with an embodiment of the present invention;
FIG. 7 is a diagram illustrating cache memory partitioning according to an embodiment of the present invention;
FIG. 8 shows a schematic diagram of a programmable device-based neural network processor, according to an embodiment of the invention;
fig. 9 shows a schematic structural diagram of a neural network accelerator based on a programmable device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; the two components can be directly connected or indirectly connected through an intermediate medium, or can be communicated inside the two components, or can be connected wirelessly or in a wired way. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The present embodiment provides a neural network adder based on a programmable device, as shown in fig. 1, including: the clock gating modules 101 are used for performing logical AND operation according to the received enabling signals and clock signals and generating clock gating signals according to operation results;
the adders 102 are cascaded according to a pipeline structure to form a plurality of levels of adders, each level of adder 102 is respectively connected with a corresponding gating clock module 101 one by one, the output end of the upper level of adder 102 is respectively connected with the input end of the lower level of adder 102 and the enabling end of the gating clock module 101 corresponding to the lower level of adder, so that the gating clock module 101 corresponding to the lower level of adder 102 receives an enabling signal sent by the upper level of adder through the enabling end, and the lower level of adder 102 is controlled to execute data updating operation according to data received by the input end through the generated gating clock signal.
Illustratively, the gating clock module 101 may be composed of a latch and a logic and gate, where the enable signal is used as an input signal of an input terminal of the latch, the clock signal is connected to a clock terminal of the latch, and is used as one input of the logic and gate, and the gating clock signal is an output result obtained by performing a logical and operation on the clock signal and an output signal of the latch by the logic and gate. I.e. the gating clock signal is a valid signal when the clock signal is at a rising edge while the enable signal is at a high level. The enabling signal can be triggered by other devices connected with the neural network full adder based on the programmable device in the embodiment, and the other devices can be multipliers, adders and the like; the clock signal may be sent by a controller, such as a CPU, that controls the operation performed by the neural network full adder based on the programmable device in this embodiment.
IN this embodiment, taking addition calculation of 16-way 32-bit data as an example, the Adder operation flow is shown IN fig. 2, the 16-way inputs IN0 and IN1 … IN15 of the Adder0 IN fig. 2 correspond to the 16-way inputs IN0 and IN1 … IN15 of the Adder0 IN fig. 1, 8 32-bit full adders are integrated IN the Adder0, and similarly, 8 32-bit full adders, 4 33-bit full adders, 2 34-bit full adders and 1 35-bit full Adder are integrated IN the four-stage Adder 102 (Adder 0, adder1, adder2 and Adder 3) IN fig. 1. The number of adders to be integrated, the number of full adders to be integrated in the adders, and the number of processing bits are not limited in this embodiment, and can be determined by those skilled in the art as needed.
The adders 102 are cascaded according to a pipeline structure to form a multi-stage adder, the output end of the upper-stage adder 102 is connected with the enabling end of the gating clock module 101 corresponding to the lower-stage adder 102, and the gating clock module corresponding to the lower-stage adder generates a gating clock signal according to the enabling signal and the clock signal, wherein the gating clock signal is used for controlling registers in the corresponding full adder to update data. Taking 4 adders shown in fig. 1 as an example, adders Adder0, adder1, adder2 and Adder3 are respectively connected with a gating clock for gating, when the data updating of Adder0 is completed, an enabling signal is input to a gating clock module connected with the adders Adder1 while a result is output, and the gating clock module controls the adders Adder1 to update data according to the enabling signal and the clock signal, and the working processes of the rest adders Adder2 and Adder3 are analogized.
The neural network adder based on the programmable device provided by the embodiment is connected by the plurality of full adders in a cascade structure of the assembly line, constraint relations exist among all levels, the enabling signal of the full adder of the next stage is controlled by the previous stage, and only the current stage finishes data updating, the full adder of the next stage is controlled to update data, so that useless data updating at the triggering moment of each clock signal in the prior art is avoided, and energy consumption is reduced.
As an alternative implementation manner of this embodiment, as shown in fig. 3, each adder includes at least one multi-bit full adder, where the multi-bit full adder is formed by serial cascading a plurality of one-bit full adders 201, each one-bit full adder 201 is connected to a corresponding register 202 one by one, and is used for storing a calculation result in the corresponding register 202, and a clock end of each register 202 is connected to an output end of a corresponding gating clock module, and is used for sending the calculation result according to the received gating clock signal.
Illustratively, the present embodiment takes the example that 36 one-bit full adders 102 constitute a 36-bit full adder, and it is assumed that the input of the neural network full adder based on the programmable device in the present embodiment is a [35:0], B [35:0], i.e. a and B need to be added to get the addition result. Then FA0, FA1 … FA35 in fig. 1 represent 36 one-bit full adders 201, respectively, calculating 36-bit results in A, B, respectively. C represents a two-bit added carry, such as C0 represents a a0+b0 carry, and carries the carry information to the higher one-bit full adder 201; s represents a two-bit sum, e.g. S0 represents the sum a0+b0, and the addition result is stored in the register 202, e.g. D0 stores S0, wherein the register 202 may be a D flip-flop. In the present embodiment, D0 to D35 store S0 to S35, respectively, and D36 stores the carry information C35 of the most significant bit.
The data stored in the registers D0 to D36 can be controlled to be updated only when the clock signal and the enable signal satisfy the update condition at the same time, so as to output the data stored this time and store the result of the next addition operation. For example, the data in registers D0-D36 outputs the stored operation result if and only if the clock signal is on a rising edge and the enable signal is at a high level, and stores a new operation result, updating the stored data.
The multi-bit full adder integrated in adder 102 is connected to gating clock module 101 in a manner as shown in fig. 4, and updating of the register stored data in each multi-bit full adder in the same hierarchy is uniformly controlled by one gating clock module 101. When the clock signal is in a rising edge state and the enabling signal is in a high level, the gating clock signal is an effective signal, and meanwhile, each full adder is triggered to control a register in the multi-bit full adder to update data.
According to the neural network full adder based on the programmable device, the gating clock module is arranged on the adder, the data updating is controlled through the gating clock module, the addition of the gating clock module enables the result output of the adder to not only depend on clock signals, but also be controlled by enabling signals, the adder can output the result at the same time only when the enabling signals and the clock signals meet the conditions at the same time, unnecessary updating of register contents is not needed at the rising edge of each clock signal like in a traditional register, and therefore the non-energy consumption of the full adder is reduced.
The present embodiment provides a neural network multiplier based on a programmable device, as shown in fig. 5, including:
the first gating clock module 301 is configured to perform a logical and operation according to the received first enable signal and the first clock signal and generate a first gating clock signal according to an operation result;
the reset module 302, the clock end of the reset module is connected with the first gating clock module, and is used for executing the reset operation according to the received first gating clock signal;
the second gating clock module 303, the enabling end of the second gating clock module is connected with the output end of the reset module, and is used for receiving the second enabling signal sent by the reset module, performing logical AND operation according to the second enabling signal and the second clock signal, and generating a second gating clock signal according to the operation result;
and the clock end of the multiplier 304 is connected with the second gating clock module, and the input end of the multiplier is connected with the output end of the reset module and is used for executing data updating operation according to the second gating clock signal received by the clock end and the data received by the input end.
Illustratively, the enable signal of the first gating clock module 301 may be triggered by other devices connected to the reset module 302, which may be multipliers, adders, memories, etc.; the clock signal may be sent by a controller, such as a CPU, that controls the operations of the various devices. The enabling end of the second gating clock module 303 is connected to the output end of the reset module 302, and when the reset module 302 completes the reset operation, the enabling end of the second gating clock module 303 receives the enabling signal, and the second gating clock module 303 generates a second gating clock signal according to the enabling signal and the clock signal, so as to control the multiplier 304 to update data. In this embodiment, the multiplier 304 stores the operation result by using a 32×16 bit register, and the clock end of the register is connected to the gating clock module, so as to reduce the total dynamic power consumption of the register.
The neural network multiplier based on the programmable device provided by the embodiment is formed by connecting a reset module, the multiplier, a first gating clock module and a second gating clock module, wherein a constraint relation exists between the reset module and the multiplier and is used for controlling the second gating clock signal of the multiplier to be controlled by the output of the reset module, and the multiplier is controlled to update data only when the reset module completes reset, so that useless data update is prevented from being carried out at the triggering moment of each clock signal in the prior art, and the power consumption of the multiplier is reduced.
The present embodiment provides a neural network convolutional layer structure based on a programmable device, as shown in fig. 6, including:
the third gating clock module 401 is configured to perform a logical and operation according to the received third enable signal and the third clock signal and generate a third gating clock signal according to an operation result;
the clock end of each programmable device-based neural network multiplier 402 in the above embodiments is connected to the output end of the third gating clock module, and is configured to perform a data update operation on data received by the input end according to the third gating clock signal received by the clock end;
the enabling ends of the fourth gating clock modules 403 are respectively connected with the output ends of the corresponding neural network multipliers based on the programmable devices and are used for receiving fourth enabling signals, performing logical AND operation according to the fourth clock signals and the corresponding fourth enabling signals and generating fourth gating clock signals according to operation results;
the plurality of programmable device-based neural network adders 404 according to the above embodiments are respectively connected to the output terminals of the corresponding fourth gating clock modules at clock terminals of the programmable device-based neural network adders, and are configured to perform a data update operation according to the received fourth gating clock signal.
For example, the clock signals of the third gating clock module and the fourth gating clock module may be sent by a controller that controls the neural network to perform convolution operation, and the third enable signal may be sent by a previous device connected to the programmable device-based neural network multiplier 402, for example, a data memory is connected before the programmable device-based neural network multiplier 402, and then after the data memory outputs data, the third enable signal is sent to the third gating clock module 401. When the third clock signal received by the third gating clock module 401 is at a rising edge and the third enabling signal is at a high level, the third gating clock module 401 sends out the third gating clock signal to control all the neural network multipliers 402 connected with the third gating clock module and based on the programmable devices to update data simultaneously.
The enabling end of the fourth gating clock module 403 is connected to the output end of the neural network multiplier 402 based on the programmable device, and when the neural network multiplier 402 based on the programmable device completes data updating, an effective fourth enabling signal is sent to the fourth gating clock module 403, and the fourth gating clock module 403 sends a fourth gating clock signal according to the fourth enabling signal and the fourth clock signal, so that the neural network adder 404 based on the programmable device is activated to update data.
The neural network convolution layer structure based on the programmable device provided by the embodiment is formed by connecting the neural network multiplier based on the programmable device, the neural network adder based on the programmable device, the third gating clock module and the fourth gating clock module, wherein a constraint relation exists between the neural network multiplier based on the programmable device and the neural network adder based on the programmable device, and when the neural network multiplier based on the programmable device completes data updating, the neural network adder based on the programmable device is controlled to update data, so that useless data updating is prevented from being carried out once at the triggering moment of each clock signal in the prior art, and the power consumption of the neural network convolution layer structure based on the programmable device is reduced.
As an alternative implementation manner of this embodiment, the neural network convolutional layer structure based on the programmable device further includes: the data storage modules comprise a data register module and a weight register module, and are used for storing data and weights required by convolutional neural network calculation.
As an alternative implementation of this embodiment, as shown in fig. 7, the data storage module further includes: the buffer memory is connected to the data register module and the weight register module and is used for buffering data and weights required by calculation of the convolutional neural network.
Illustratively, since the data register module and the weight register module have very small storage capacities, it is necessary to first cache data by a cache memory having a larger storage capacity before inputting the data into the data register module and the weight register module. For the cache memory in the convolution layer, the memory splitting technology is adopted for optimization, and specifically: by using a multiplexer and multiplexer technique, the overall cache memory is divided into several smaller memories to reduce power consumption. Fig. 7 shows a method of dividing a 512-byte memory block into 4 128 bytes and controlling each 128-byte memory block through a decoder and a multiplexer to achieve power consumption reduction. Address [8 ] of the original cache memory: 0], the upper two addresses [8:7] are used as the decoding of the decoder, and the decoded outputs are used as the enable signals en0 to en3 of the four small memory blocks, i.e. when Address [8:7] =2' b00, enable memory block 0, i.e. store using memory block 0, when Address [8:7] =2' b01, enable memory block 1, when Address [8:7] =2' b10, enable memory block 2, when Address [8:7] =2' b11, memory block 3 is enabled; low 7-bit Address [6:0 is used as the data Address of the small memory block, i.e. the data strobe of the different small memory block is defined by the upper two bits Address [8 ] of the original memory block Address: 7] determining.
In the prior art, in most memory block structures, the whole cache memory is used for completing the reading and writing of data, but in fact half or less memory blocks have lower power consumption when the data are read and written than the whole large memory blocks, and the embodiment further reduces the power consumption of the neural network convolution layer structure based on the programmable device through a memory segmentation technology.
As an alternative implementation manner of this embodiment, the cache memory includes 4 memory blocks.
Illustratively, in order to achieve such a small memory block size, various preliminary studies are required to determine the optimal number of memory block configurations and memory block size. In theory, the same cache memory can be divided into more storage blocks with smaller sizes, such as 8 storage blocks, 16 storage blocks, 32 storage blocks, 64 storage blocks and the like, but in the actual measurement process, when the number of the storage blocks is greater than 4 storage blocks, the cache memory is severely fragmented, so that the data read-write throughput is drastically reduced, and the convolution operation performance is affected; when the number of the divided cache memories is 2 and 4, the performance of the cache memories, namely the data read-write throughput, is not obviously changed, and compared with the case that the number of the divided cache memories is 4, the power consumption is lower, so that in the embodiment of the invention, the memory is divided into 4 small blocks on the premise of not influencing the convolution operation performance. It was verified through actual measurement that even though the number of read/write times is the same, the read/write operation performed is performed in only one of the small memory blocks, not in the full-size register, thereby reducing power consumption.
As an alternative implementation of this embodiment, the neural network convolution structure based on the programmable device is shown in fig. 6, and further includes:
the fifth gating clock module 405, an enable end of which is connected to an output end of any adder that performs a data update operation, and is configured to receive a fifth enable signal, perform a logical and operation according to the fifth clock signal and the fifth enable signal, and generate a fifth gating clock signal according to an operation result;
and the clock end of the accumulator 406 is respectively connected with the output end of the fifth gating clock module and is used for executing data updating operation according to the received fifth gating clock signal.
Illustratively, the accumulator 406 is configured to calculate the calculation results of the adders in all convolution structures to obtain a final convolution result. The fifth clock signal may be sent by a controller that controls the neural network to perform convolution operation, and the fifth enable signal is sent by any programmable device-based neural network adder 404, and after the programmable device-based neural network adder 404 completes data update, a valid fifth enable signal is sent to the fifth gating clock module 405. The accumulator 406 receives the fifth gating clock signal sent by the fifth gating clock module 405, and performs a corresponding data update operation according to the fifth gating clock signal.
The present embodiment provides a neural network processor based on a programmable device, as shown in fig. 8, including: a neural network convolutional layer structure based on a programmable device as in any of the above embodiments.
Illustratively, the programmable device-based neural network processor includes a ReLU layer and a pooling layer structure in addition to the programmable device-based neural network convolutional layer structure of any of the above embodiments. The input of the neural network convolution layer structure based on the programmable device is sourced from an input buffer, the data of the input buffer is convolved, the convolution result is output to a ReLU layer, and the result of the ReLU layer is output to a pooling layer, wherein a clock signal is sent out at a controller and the convolution operation is controlled to be executed by the convolution layer.
The power consumption of the neural network processor based on the programmable device is reduced due to the low power consumption of the internal convolution layer structure.
The present embodiment provides a neural network accelerator based on a programmable device, as shown in fig. 9, including: the programmable device-based neural network processor of the plurality of above embodiments.
Illustratively, a programmable device-based neural network accelerator includes: a state controller, a DMA controller, an input buffer, an output buffer, a programmable device-based neural network processor of the plurality of above embodiments; the controller is connected with the external central processing unit and used for controlling the neural network processor based on the programmable device in the plurality of embodiments to carry out convolution operation; the DMA controller is connected with the external central processing unit and the external memory and is used for directly accessing data in the external memory and sending a convolution calculation result to the external memory; the input buffer area is respectively connected with the controller and the DMA controller and is used for caching data input into the neural network processor based on the programmable device in the plurality of embodiments; and the output buffer areas are respectively connected with the programmable device-based neural network processor and the DMA controller in the embodiments and are used for buffering data output by the programmable device-based neural network processor.
The neural network accelerator based on the programmable device provided in this embodiment includes: the programmable device-based neural network processor in the plurality of embodiments reduces the power consumption of the programmable device-based neural network accelerator.
It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims (9)

1. A programmable device-based neural network summer, comprising:
the gating clock modules are used for performing logical AND operation according to the received enabling signals and clock signals and generating gating clock signals according to operation results;
the adders are cascaded according to a pipeline structure to form multi-stage adders, each stage of adders are respectively connected with corresponding gating clock modules one by one, the output end of each upper stage of adders is respectively connected with the input end of each lower stage of adders and the enabling end of the corresponding gating clock module of each lower stage of adders, so that the corresponding gating clock module of each lower stage of adders receives enabling signals sent by the corresponding upper stage of adders through the enabling end, and the corresponding lower stage of adders are controlled by the generated gating clock signals to execute data updating operation according to data received by the input ends;
each adder comprises at least one multi-bit full adder, the multi-bit full adder is formed by serial cascading of a plurality of one-bit full adders, each one-bit full adder is connected with a corresponding register one by one and used for storing a calculation result to the corresponding register, and the clock end of each register is connected with the output end of the corresponding gating clock module and used for sending the calculation result according to the received gating clock signal.
2. A programmable device-based neural network multiplier, comprising:
the first gating clock module is used for executing logical AND operation according to the received first enabling signal and the first clock signal and generating a first gating clock signal according to an operation result;
the clock end of the reset module is connected with the first gating clock module and is used for executing reset operation according to the received first gating clock signal;
the enabling end of the second gating clock module is connected with the output end of the reset module and is used for receiving a second enabling signal sent by the reset module, performing logical AND operation according to the second enabling signal and the second clock signal and generating a second gating clock signal according to an operation result;
the clock end of the multiplier is connected with the second gating clock module, and the input end of the multiplier is connected with the output end of the reset module and is used for executing data updating operation according to the second gating clock signal received by the clock end and the data received by the input end.
3. A neural network convolutional layer structure based on a programmable device, comprising:
the third gating clock module is used for executing logical AND operation according to the received third enabling signal and the third clock signal and generating a third gating clock signal according to an operation result;
the programmable device-based neural network multiplier of claim 2, wherein each of the programmable device-based neural network multipliers has a clock terminal connected to an output terminal of the third gating clock module, respectively, for performing a data update operation on data received at the input terminal according to the third gating clock signal received at the clock terminal;
the enabling ends of the fourth gating clock modules are respectively connected with the corresponding output ends of the neural network multipliers based on the programmable devices and are used for receiving fourth enabling signals, performing logical AND operation according to the fourth clock signals and the corresponding fourth enabling signals and generating fourth gating clock signals according to operation results;
the programmable device-based neural network adder of claim 1, wherein the clock terminals of the programmable device-based neural network adder are respectively connected with the output terminals of the corresponding fourth gating clock module, and are used for performing data updating operation according to the received fourth gating clock signal.
4. The programmable device-based neural network convolutional layer structure of claim 3, further comprising:
the data storage modules comprise a data register module and a weight register module, and are used for storing data and weights required by convolutional neural network calculation.
5. The programmable device-based neural network convolutional layer structure of claim 4, wherein the data storage module further comprises: and the cache memory is connected to the data register module and the weight register module and is used for caching data and weights required by calculation of the convolutional neural network.
6. The programmable device based neural network convolutional layer structure of claim 5, wherein the cache memory comprises 4 memory blocks.
7. The programmable device-based neural network convolutional layer structure of claim 3, further comprising:
the enabling end of the fifth gating clock module is connected with the output end of any adder for executing data updating operation and is used for receiving a fifth enabling signal, executing logical AND operation according to the fifth clock signal and the fifth enabling signal and generating a fifth gating clock signal according to an operation result;
and the clock end of the accumulator is respectively connected with the output end of the fifth gating clock module and is used for executing data updating operation according to the received fifth gating clock signal.
8. A programmable device-based neural network processor, comprising: a programmable device based neural network convolutional layer structure as defined in any one of claims 3-7.
9. A programmable device-based neural network accelerator, comprising: a plurality of programmable device-based neural network processors as recited in claim 8.
CN202010594416.6A 2020-06-24 2020-06-24 Adder, multiplier, convolution layer structure, processor and accelerator Active CN111753962B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010594416.6A CN111753962B (en) 2020-06-24 2020-06-24 Adder, multiplier, convolution layer structure, processor and accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010594416.6A CN111753962B (en) 2020-06-24 2020-06-24 Adder, multiplier, convolution layer structure, processor and accelerator

Publications (2)

Publication Number Publication Date
CN111753962A CN111753962A (en) 2020-10-09
CN111753962B true CN111753962B (en) 2023-07-11

Family

ID=72677388

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010594416.6A Active CN111753962B (en) 2020-06-24 2020-06-24 Adder, multiplier, convolution layer structure, processor and accelerator

Country Status (1)

Country Link
CN (1) CN111753962B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112462845A (en) * 2020-11-25 2021-03-09 海光信息技术股份有限公司 Data transmission clock control circuit, method and processor
CN113642278B (en) * 2021-07-15 2023-12-12 加弘科技咨询(上海)有限公司 Power consumption generation system and method of programmable logic device
CN113271086A (en) * 2021-07-19 2021-08-17 深圳英集芯科技股份有限公司 Clock burr-free switching circuit, chip and electronic equipment

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05289850A (en) * 1992-04-14 1993-11-05 Sumitomo Electric Ind Ltd Multiplier
CN101686041A (en) * 2008-09-27 2010-03-31 深圳市芯海科技有限公司 Gated clock circuit and gated clock signal generation method
CN102487272A (en) * 2010-12-01 2012-06-06 Arm有限公司 Integrated circuit, clock gate control circuit and method
CN104090737A (en) * 2014-07-04 2014-10-08 东南大学 Improved partial parallel architecture multiplying unit and processing method thereof
CN105512724A (en) * 2015-12-01 2016-04-20 中国科学院计算技术研究所 Adder device, data accumulation method, and data processing device
CN106055026A (en) * 2016-07-20 2016-10-26 深圳市博巨兴实业发展有限公司 Real time clock unit in microcontroller SOC (System On Chip)
CN106528046A (en) * 2016-11-02 2017-03-22 上海集成电路研发中心有限公司 Long bit width time sequence accumulation multiplying unit
CN107844826A (en) * 2017-10-30 2018-03-27 中国科学院计算技术研究所 Neural-network processing unit and the processing system comprising the processing unit
CN108133267A (en) * 2016-12-01 2018-06-08 上海兆芯集成电路有限公司 With the processor that can be used as most rear class cache tile or the memory array of neural network cell memory operation
CN108133263A (en) * 2016-12-01 2018-06-08 上海兆芯集成电路有限公司 Neural network unit
CN108171317A (en) * 2017-11-27 2018-06-15 北京时代民芯科技有限公司 A kind of data-reusing convolutional neural networks accelerator based on SOC
CN109828744A (en) * 2019-01-18 2019-05-31 东北师范大学 A kind of configurable floating point vector multiplication IP kernel based on FPGA

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05289850A (en) * 1992-04-14 1993-11-05 Sumitomo Electric Ind Ltd Multiplier
CN101686041A (en) * 2008-09-27 2010-03-31 深圳市芯海科技有限公司 Gated clock circuit and gated clock signal generation method
CN102487272A (en) * 2010-12-01 2012-06-06 Arm有限公司 Integrated circuit, clock gate control circuit and method
CN104090737A (en) * 2014-07-04 2014-10-08 东南大学 Improved partial parallel architecture multiplying unit and processing method thereof
CN105512724A (en) * 2015-12-01 2016-04-20 中国科学院计算技术研究所 Adder device, data accumulation method, and data processing device
CN106055026A (en) * 2016-07-20 2016-10-26 深圳市博巨兴实业发展有限公司 Real time clock unit in microcontroller SOC (System On Chip)
CN106528046A (en) * 2016-11-02 2017-03-22 上海集成电路研发中心有限公司 Long bit width time sequence accumulation multiplying unit
CN108133267A (en) * 2016-12-01 2018-06-08 上海兆芯集成电路有限公司 With the processor that can be used as most rear class cache tile or the memory array of neural network cell memory operation
CN108133263A (en) * 2016-12-01 2018-06-08 上海兆芯集成电路有限公司 Neural network unit
CN107844826A (en) * 2017-10-30 2018-03-27 中国科学院计算技术研究所 Neural-network processing unit and the processing system comprising the processing unit
CN108171317A (en) * 2017-11-27 2018-06-15 北京时代民芯科技有限公司 A kind of data-reusing convolutional neural networks accelerator based on SOC
CN109828744A (en) * 2019-01-18 2019-05-31 东北师范大学 A kind of configurable floating point vector multiplication IP kernel based on FPGA

Also Published As

Publication number Publication date
CN111753962A (en) 2020-10-09

Similar Documents

Publication Publication Date Title
CN111753962B (en) Adder, multiplier, convolution layer structure, processor and accelerator
CN107657581B (en) Convolutional neural network CNN hardware accelerator and acceleration method
US11551068B2 (en) Processing system and method for binary weight convolutional neural network
CN106683158B (en) Modeling system of GPU texture mapping non-blocking storage Cache
US10224956B2 (en) Method and apparatus for hybrid compression processing for high levels of compression
KR102329308B1 (en) Cache replacement policy methods and systems
US11474951B2 (en) Memory management unit, address translation method, and processor
CN108629406B (en) Arithmetic device for convolutional neural network
TW201342110A (en) Counter operation in a state machine lattice
Benes et al. A fast asynchronous Huffman decoder for compressed-code embedded processors
CN112633505B (en) RISC-V based artificial intelligence reasoning method and system
JP5027515B2 (en) Reconfigurable logic device for parallel computation of arbitrary algorithms
WO2016126376A1 (en) PROVIDING MEMORY BANDWIDTH COMPRESSION USING BACK-TO-BACK READ OPERATIONS BY COMPRESSED MEMORY CONTROLLERS (CMCs) IN A CENTRAL PROCESSING UNIT (CPU)-BASED SYSTEM
JP2007522571A5 (en)
US20080244169A1 (en) Apparatus for Efficient Streaming Data Access on Reconfigurable Hardware and Method for Automatic Generation Thereof
CN115421899A (en) Reconfigurable processor multi-port configurable cache access method and device
CN110572164A (en) LDPC decoding method, apparatus, computer device and storage medium
WO2015094721A2 (en) Apparatuses and methods for writing masked data to a buffer
US8533396B2 (en) Memory elements for performing an allocation operation and related methods
US7159078B2 (en) Computer system embedding sequential buffers therein for performing a digital signal processing data access operation and a method thereof
CN115509611A (en) Instruction obtaining method and device based on simplified instruction set and computer equipment
CN111538622B (en) Error correction method for satellite-borne solid-state memory
CN114258533A (en) Optimizing access to page table entries in a processor-based device
JP5480793B2 (en) Programmable controller
CN112257843B (en) System for expanding instruction set based on MobileNet V1 network inference task

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant