CN115033204A - High-energy-efficiency approximate multiplier with reconfigurable precision and bit width - Google Patents

High-energy-efficiency approximate multiplier with reconfigurable precision and bit width Download PDF

Info

Publication number
CN115033204A
CN115033204A CN202210564410.3A CN202210564410A CN115033204A CN 115033204 A CN115033204 A CN 115033204A CN 202210564410 A CN202210564410 A CN 202210564410A CN 115033204 A CN115033204 A CN 115033204A
Authority
CN
China
Prior art keywords
approximate
multiplier
bits
precision
stage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210564410.3A
Other languages
Chinese (zh)
Inventor
刘波
张人元
沈桥
王学涛
徐子航
蔡浩
杨军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202210564410.3A priority Critical patent/CN115033204A/en
Publication of CN115033204A publication Critical patent/CN115033204A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a high-energy-efficiency approximate multiplier with reconfigurable precision and bit width, which is characterized in that an n multiplied by n low-order approximate multiplier is generated by utilizing a Cartesian genetic programming method introducing a penalty coefficient, and the n multiplied by n low-order approximate multiplier is combined with an n multiplied by n precise multiplier to be spliced into a 2n multiplied by 2n high-order approximate multiplier. And finally, introducing an approximate line in the approximate accumulation stage to realize accuracy reconstruction, and providing a scheme of mutual compensation of approximate multiplication and approximate addition errors to further optimize the accuracy. The circuit structure comprises a first-stage approximate multiplication circuit for generating an approximate multiplier by using Cartesian genetic programming and a second-stage addition circuit based on a low-order OR gate adder and an approximate line, and multiplication operation with high energy efficiency, high precision, low delay and reconfigurable bit width in neural network application is realized.

Description

High-energy-efficiency approximate multiplier with reconfigurable precision and bit width
Technical Field
The invention relates to a dynamic configuration approximate multiplication computing technology, in particular to a high-energy-efficiency approximate multiplier with reconfigurable precision and bit width and an implementation method thereof.
Background
In recent years, with the increasing popularization of low-power-consumption and artificial intelligence-based internet of things systems and wearable devices, the amount of data to be processed by hardware systems under limited battery power supply is increased sharply, and more attention is paid to design and optimization of energy-efficient computing units.
The neural network, one of the most representative algorithms in the fields of speech recognition, image segmentation, and the like, includes a large number of complex and parallel multiplications, and consumes a large amount of computing resources. Because of the natural fault tolerance of the neural network, a certain precision is sacrificed by using approximate calculation, the low energy consumption cost is reduced and the operation efficiency is improved on the premise of meeting the precision requirement, and the method becomes a popular solution for deploying the neural network. Therefore, how to design an energy-efficient and reconfigurable approximate multiplier aiming at different application requirements of the neural network is one of the research hotspots in the field of the current approximate calculation.
Besides the widely used artificial approximate design method of the simplified Boolean expression, the automatic design method based on the Cartesian genetic programming can also be used for generating an approximate calculation circuit, and a large number of experiments show that when the method is actually applied to a neural network, an approximate multiplier generated by the Cartesian genetic programming has better performance than a multiplier manually designed. However, when used to generate high order multipliers, the errors introduced by this approach tend to become difficult to control. In addition, a great number of additions are contained in the multiplier, and the traditional fresh research combines a Cartesian genetic programming method and an approximate adder. Therefore, an energy-efficient approximate multiplier with reconfigurable precision and bit width is provided, which combines an approximate multiplier automatically generated by Cartesian genetic programming with a low-order OR gate approximate adder, realizes the reconfigurable precision by utilizing the error compensation of the two parts, realizes the reconfigurable bit width by splicing a low-order multiplier and brings an energy-efficient solution for the multiplication operation in a convolutional neural network.
Disclosure of Invention
The technical problem is as follows: the technical problem to be solved by the invention is as follows: the high-energy-efficiency approximate multiplier and the method with reconfigurable precision and bit width are provided, and the calculation task is completed at low power consumption and high speed by sacrificing limited precision, so that a large number of multiplication operations in convolution operation are optimized.
The technical scheme is as follows: the invention discloses an energy-efficient approximate multiplier with reconfigurable precision and bit width, which comprises an approximate multiplier generation module based on Cartesian genetic programming, a first-stage approximate multiplication circuit based on a mixed precision multiplier, and a second-stage approximate addition circuit based on a low-order OR gate adder and an approximate line configuration;
the approximate multiplier generation module based on the Cartesian genetic programming generates an approximate multiplier used for high-order multiplication, a specific precise multiplier structure generates an initial population of the Cartesian genetic programming, part of optimal populations are selected under the evaluation of a fitness function, a next generation population is generated through point mutation on the basis of the optimal population, and the evaluation, selection and mutation are continuously repeated in this way until the maximum iteration times is reached or a target condition is met, so that an n multiplied by n approximate multiplier is obtained;
the first-stage approximate multiplication circuit is used for generating products of input multiplier part items, for two 2 n-bit binary unsigned input multipliers, the two 2 n-bit binary unsigned input multipliers are respectively divided into a lower n-bit part and an upper n-bit part, parts of the lower n-bit multiplied by the lower n-bit, the lower n-bit multiplied by the upper n-bit are used for generating an n multiplied by n precision multiplier, and the generated 4 2 n-bit part item products are sequentially output to the second-stage approximate addition circuit;
the second-stage approximate addition circuit is used for approximately adding partial item products generated in a preceding-stage circuit, firstly, 4 products are sequentially arranged from low order to high order, an approximate line is set, an OR gate adder is used for the digit behind the approximate line, carry is abandoned, an accurate full adder is used for the digit before the approximate line, and the accumulation result obtained by the second-stage approximate addition circuit is the final output result of the designed 2n multiplied by 2n unsigned digit approximate multiplier.
Wherein, the first and the second end of the pipe are connected with each other,
in the Cartesian genetic programming, different accurate multipliers are used as initial chromosomes to generate populations, and the area, delay and accuracy of the multipliers are subjected to multi-objective optimization in a fitness function so as to achieve balance of energy consumption and performance.
The Fitness function Fitness is shown as follows,
Figure BDA0003657256650000021
r(x)=max(0,x)
wherein, Error i For the set target average error, error (C) is the average error of the calculated candidate approximate multiplying circuit C, area (C), delay (C) is the normalized area and delay of the calculated candidate approximate multiplying circuit C in the interval (0, 1), the penalty coefficient p is a set value larger than 1, and α, β, and γ are the set weights of three fitness operators of error, area, and delay, respectively (α + β + γ ═ 1).
The penalty coefficient p is used for compensating errors introduced by approximate addition in the second-stage accumulation stage, the overall distribution of errors of the second-stage approximate accumulation part is always biased to be negative, and the Cartesian genetic programming is always searched towards the direction with smaller fitness function, so that the penalty coefficient more than 1 is set for the condition that the error of a candidate circuit structure in an error operator is smaller than a target error, the fitness of the structure is larger, the structure is more likely to be eliminated, the Cartesian genetic programming searches for an approximate multiplier towards the direction with biased positive errors, the value of the p is adjusted according to the biased negative degree of the errors of the later-stage approximate accumulation part, the errors of the two parts are counteracted to a certain degree, and the overall accuracy is improved.
The two 2n bit unsigned input multipliers are split and then subjected to approximate multiplication and addition operation to realize the reconfiguration of the bit width of the approximate multiplier, and for the two 2n bit unsigned input multipliers A, B, the two 2n bit unsigned input multipliers are firstly split into high n bits AH and BH and low n bits AL and BL respectively; AH multiplied by BH uses an accurate multiplier, and the output product is marked as H; the output products of AL multiplied by BH and AH multiplied by BL are respectively marked as M1 and M2 by using approximate multipliers; the output product of AL multiplied by BL is marked as L by using an approximate multiplier; splitting H into HH with high n bits and HL with low n bits, and splitting L into LH with high n bits and LL with low n bits; splicing HL and LH into M3, and carrying out approximate accumulation on M1, M2 and M3 by using approximate adders.
The approximate adder comprises an approximate line configuration module, the bit width of approximate accumulation is adjusted by setting the position k of an approximate line so as to realize dynamic adjustment of the precision of an approximate multiplier, for the parts of approximate accumulation of M1, M2 and M3, the carry is cut off by adopting an OR gate to replace a precise full adder for the k bits positioned behind the approximate line, the 2n-k bits positioned in front of the approximate line are still added by the precise full adder, and the carry of the 2n-k bits is generated by two partial products from 1 bit behind the approximate line through the AND gate.
The invention relates to a method for realizing a high-energy-efficiency approximate multiplier with reconfigurable precision and bit width, which specifically comprises the following steps:
the method comprises the following steps: constructing an accurate multiplier as an initial chromosome, wherein the chromosome structure of the multiplier is a basic gate circuit connected in sequence, fixing a partial product generating circuit formed by an AND gate, the input point number of a post-stage circuit is n multiplied by n, the gate circuit of the post-stage node is generated from a set formed by an AND gate, an OR gate, a NOT gate, an XOR gate, an XNOR gate and a NAND gate, the input of each node is connected with the output of two pre-stage nodes, and the final output point number of the circuit is n multiplied by n, constructing different types of accurate adders, wherein the accurate adders comprise Wallace tree multipliers, multipliers accumulated by traveling wave carry adders and multipliers accumulated by carry bypass adders as the initial chromosome;
step two: generating an n multiplied by n approximate multiplier through Cartesian genetic programming, configuring fitness function parameters, setting target precision error, area and delay of a circuit, setting an initial penalty coefficient p to be 1, continuously and iteratively searching a multiplier structure with a smaller fitness function until a structure meeting conditions appears or the maximum iteration times is reached, and stopping searching to obtain an n multiplied by n low-order approximate multiplier structure;
step three: and constructing a first-stage approximate multiplication circuit of the high-order multiplier, dividing two 2 n-bit unsigned input multipliers into high n bits and low n bits respectively, using an accurate multiplier in the multiplication of the high n bits and the high n bits which have the greatest influence on the precision, and using the approximate multipliers generated in the step two for the multiplication of the rest low n bits and the high n bits to obtain 4 partial product of the 2n bits.
Step four: constructing a second-stage approximate addition circuit of the high-order multiplier, arranging 4 partial product of 2n bits obtained in the third step according to bit sequence, and performing approximate accumulation operation on the middle 2n bits; configuring the position k of an approximate line to adjust the precision, adding the previous digit numbers of the approximate line by using a precise full adder, adding the subsequent digit numbers of the approximate line by using an OR gate and eliminating the intermediate carry, wherein the carry signal at the position of the approximate line is the sum of two subsequent addends;
step five: obtaining the output result of the high-order approximate multiplier, and arranging the high n bits of the output result obtained in the step three, the 2n bit output obtained in the step four and the low n bits of the output result obtained in the step three in order according to bits to obtain the final 4n bit output of the high-order approximate multiplier;
step six: and adjusting the punishment coefficient p to compensate errors, calculating and evaluating the error distribution of the approximate multiplier constructed in the steps, adjusting the punishment coefficient p in the fitness function, and repeating the steps from three to five to adjust the error deviation and counteract a part of errors so as to obtain the approximate multiplier structure with higher precision and error distribution more in line with normal distribution.
Has the advantages that: by adopting the technical scheme, the invention has the following advantages.
(1) The invention provides an energy-efficient approximate multiplier with reconfigurable precision and bit width, which utilizes the fault-tolerant characteristic of a neural network, uses a Cartesian genetic programming algorithm to generate a low-order approximate multiplier, performs conditional circuit optimization through an evolutionary algorithm, provides a set of automatic design solution for the multi-objective design and optimization problem of the approximate multiplier, and has the advantages of low power consumption, small module area and high calculation precision.
(2) By introducing a punishment coefficient into the fitness function, the errors of the low-order approximate multipliers based on the Cartesian genetic programming are guided to a specific direction and matched with the approximate line setting of the approximate addition part to perform mutual compensation of the errors of the two parts, so that the circuit performance is further improved, the hardware resource expenditure is reduced, and the accuracy reconstruction is realized.
(3) In the process of constructing the high-order approximate multiplier, two-stage approximation of multiplication and addition is used, and compared with approximation using only one operation, the circuit power consumption and the area benefit are more remarkable. The reconfiguration of the approximate bit width is realized by splicing the low-order approximate multipliers with different bit widths, and different approximate schemes can be selected more flexibly and pertinently compared with a static approximate multiplier.
(4) The high-efficiency dynamic calculation technology is suitable for neural network operation, and reasonable balance among key circuit parameters such as precision, power consumption and delay can be realized by using the approximate multiplier module generated by Cartesian genetic programming, the first-stage approximate multiplication module and the second-stage approximate multiplication module.
Drawings
FIG. 1 is a diagram of the overall architecture of a precision, bit width reconfigurable energy efficient approximate multiplier disclosed in the present invention.
FIG. 2 is a flow chart of a Cartesian genetic programming approximate multiplier generation module incorporating penalty factors as disclosed herein.
FIG. 3 is a diagram of a 4X 4 low order approximation multiplier generated by Cartesian genetic programming according to the present invention.
Fig. 4 is a schematic diagram of the operation of a high-order approximation multiplier, taking an 8 × 8 approximation multiplier as an example, in the present invention.
Detailed Description
The technical solutions of the present invention will be described in detail with reference to the accompanying drawings, it should be understood that these embodiments are merely illustrative of the present invention and are not intended to limit the scope of the present invention, and various equivalent modifications made by those skilled in the art after reading the present invention fall within the scope of the appended claims.
The invention relates to a high-energy-efficiency approximate multiplier with reconfigurable precision and bit width, which comprises a Cartesian genetic programming generation approximate multiplier module with introduced penalty coefficients, a first-stage approximate multiplier circuit based on a mixed precision multiplier, and a second-stage addition circuit based on a low-order OR gate adder and an approximate line;
inputting an accurate multiplier structure as an initial population of Cartesian genetic programming, selecting partial optimal population under the evaluation of fitness function, generating next generation population through point mutation on the basis, and continuously and repeatedly evaluating, selecting and mutating until the maximum iteration number is reached or a limit condition is broken through to obtain a low-order approximate multiplier.
The first-stage approximate multiplication circuit divides two input multipliers into a low-order part and a high-order part respectively, an approximate multiplier generated by the Cartesian genetic programming scheme is used in the part for multiplying the low order and the high order, a precise multiplier is used in the part for multiplying the high order and the high order, and the generated partial products are sequentially output to the second-stage approximate addition circuit;
the second-stage approximate addition circuit performs approximate addition on two products obtained by multiplying the low order and the high order in the preceding stage circuit to obtain a middle order product, then extracts and splices corresponding bits in the high order product and the low order product obtained in the preceding stage to obtain a second middle order product, and performs approximate addition on the two middle order products to obtain middle order output; the rest bits are directly output. The output of the second-stage approximate addition circuit is the final output result of the multiplier.
In the stage of generating the low-order approximate multiplier, the fitness function used by the high-energy-efficiency approximate multiplier with reconfigurable precision and bit width comprises three fitness operators of precision error, area and delay, and the calculation methods are respectively shown as the following formulas:
Figure BDA0003657256650000061
Figure BDA0003657256650000062
Figure BDA0003657256650000063
wherein, Y (C) acc I) denotes a full-precision multiplier C for input i acc Y (C, i) represents the output value of the approximate multiplier C for input i, w represents the input bit width of the multiplier, and the error function error (C) of circuit C is defined as the average of the absolute values of the differences between the outputs of the exact multiplier and the approximate multiplier; c. C i Denotes a node circuit in an approximate multiplier, area (c) i ) Area function area (C) of circuit C is defined as the sum of the areas of all node circuits in the normalized circuit, for the area of a single node circuit; t is t d (c i ) For the propagation delay of a single node circuit, the delay function delay (C) of circuit C is defined as the total delay over the normalized longest propagation path. The area and delay of each gate circuit is determined by the process library used. The fitness function is the weighted sum of error, area and delay operators, and a penalty coefficient is introduced into the error operator to carry out subsequent error compensation.
In the first stage of approximate multiplication, two 2 n-bit multipliers are input into an approximate multiplier, after high n bits and low n bits of the multiplier and a multiplicand are split, the most critical high-bit multiplication is performed by using an accurate multiplier, and multiplication of the rest bits is performed by using an approximate multiplier generated based on Cartesian genetic programming, so that higher energy efficiency is obtained at the expense of limited precision. The splitting scheme is shown in the following expression, and the selection scheme of the multiplier is shown in the following table. Wherein A, B respectively represent two multipliers, A H 、B H Represents its high n-position, A L 、B L Represents the lower n-bits, and "X < n" represents shifting X by n-bits to the left.
A={A H A L }
B={B H B L }
A×B=(A H ×B H )<<2n+(A H ×B L +A L ×B H )<<n+A L ×B L
Partial term multiplication Multiplier used Output product
A H ×B H Precision multiplier H
A H ×B L Approximate multiplier generated by Cartesian genetic programming M1
A L ×B H Approximate multiplier generated by Cartesian genetic programming M2
A L ×B L Approximate multiplier generated by Cartesian genetic programming L
The four 2 n-bit partial term products H, M1, M2, L generated by the first-stage approximate multiplication circuit are sequentially output to the second-stage approximate addition circuit. Before the approximate accumulation operation is performed, the product H, L is split, and the lower n bits of H and the upper n bits of L are recombined to obtain M3, as shown in the following expression:
H=A H ×B H ={H H H L }
L=A L ×B L ={L H L L }
M3={H L L H }
the second stage of the approximate addition circuit comprises two parts of approximate operations, wherein the first part is approximate addition of M1 and M2, the output result is M4, the second part is approximate addition of M3 and M4, and the output result is M5. The low-order OR gate adder configured with an approximate line is used in the two approximate additions, the k-order addition positioned behind the approximate line adopts an OR gate to replace a precise full adder, the 2n-k bits positioned in front of the approximate line are still added by the precise full adder, and the carry of the 2n-k bit partial product is generated by two addends from 1 bit behind the approximate line through an AND gate. For input multipliers A and A, the output of the approximate multiplier is { H } H ,M5,L L }。
The approximate addition scheme effectively reduces the overall power consumption and area of the circuit and can maintain higher precision. The approximate adder approximately adds the position of the approximate line by modifying the product of the two partial terms through the configuration of the approximate line so as to achieve the function of reconfigurable precision, and performs relative compensation on approximate addition errors by modifying the size of a fitness function penalty coefficient p in the Cartesian genetic programming. Low-order approximate multipliers with different bit widths are generated by using Cartesian genetic programming and are spliced into corresponding high-order approximate multipliers to realize the function of reconfigurable bit widths.
The invention also provides a high-energy-efficiency approximate multiplier realizing method with reconfigurable precision and bit width, which comprises the following steps:
the method comprises the following steps: an exact multiplier is constructed as the initial chromosome. The structure of the multiplier chromosome is a basic gate circuit connected in sequence. The partial product generating circuit formed by the AND gate is fixed, the number of input ends of the rear-stage circuit is n multiplied by n, the gate circuit of the rear-stage node is generated from a set formed by the AND gate, the OR gate, the NOT gate, the XOR gate, the XNOR gate and the NAND gate, the input of each node is connected with the output of two front-stage nodes, and the number of final output ends of the circuit is n multiplied by n. And constructing accurate adders of different Chinese classes, such as Wallace tree multipliers, multipliers accumulated by using traveling wave carry, multipliers accumulated by using carry bypass adders and the like as initial chromosomes.
Step two: an n × n approximation multiplier is generated by cartesian genetic programming. Configuring fitness function parameters, setting target precision error, area and delay of a circuit, setting an initial penalty coefficient p to be 1, continuously and iteratively searching a multiplier structure with a smaller fitness function until a structure meeting conditions appears or the maximum iteration times is reached, and stopping searching to obtain an n multiplied by n low-order approximate multiplier structure;
step three: a first stage approximation multiplication circuit of a high order multiplier is constructed. Dividing two 2 n-bit input multipliers into high n bits and low n bits respectively, arranging the high n bits and the low n bits according to the mode shown in figure 1, using an accurate multiplier in the high-bit multiplication with the largest influence on the precision, and using an approximate multiplier generated in the second step for the rest low-bit multiplications;
step four: a second stage approximate adding circuit of the high order multiplier is constructed. And arranging the 4 2 n-bit partial item products obtained in the step three in a bit sequence, and performing approximate addition operation on the middle 2 n. The position k of the approximation line is configured to adjust the precision, the number of bits before the approximation line is added using a precision full adder, the number of bits after the approximation line is added using an OR gate and the intermediate carry is discarded, the carry signal at the approximation line is the sum of the two addends of the next bit.
Step five: and obtaining the output result of the high-order approximate multiplier. And (4) arranging the high n bits of the output result obtained in the step (three), the 2n bit output obtained in the step (four) and the low n bits of the output result obtained in the step (three) in order according to bits to obtain the final 4n bit output of the high-order approximate multiplier.
Step six: the penalty factor p is adjusted for error compensation. And calculating and evaluating the error distribution of the approximate multiplier constructed in the steps, adjusting a penalty coefficient p in the fitness function, and repeating the third step to the fifth step to adjust error deviation and counteract a part of errors so that the error distribution is more accordant with normal distribution to obtain a new approximate multiplier structure with better precision.
Example (c):
the overall architecture diagram of an energy-efficient approximate multiplier with reconfigurable precision and bit width is shown in fig. 1 and comprises a cartesian genetic programming generation approximate multiplier module, a first-stage approximate multiplication module and a second-stage approximate multiplication module. And combining approximate multiplication and approximate addition, introducing a penalty coefficient into a fitness function, and combining the approximate line configuration of the approximate addition to perform error complementation so as to realize dynamic regulation of calculation precision and bit width.
An example of an 8 x 8 unsigned bit approximate multiplier configuration is chosen here.
A low-order 4 x 4 approximation multiplier is first constructed by a cartesian genetic programming program. The 16 partial product generating circuits constructed by AND gates are fixed, the number of input ends of the rear-stage circuit is 16, the gate circuit of the rear-stage node is generated from a set consisting of an AND gate, an OR gate, a NOT gate, an XOR gate, an XNOR gate and a NAND gate, the input of each node gate circuit is connected with the output of two front-stage nodes, and the number of final output ends of the circuit is 16. The exact 4 x 4 wallace tree multiplier structure is used in this example as the initial chromosome for evolution.
A low order 4 x 4 approximation multiplier is generated by a cartesian genetic programming module. Configuring fitness function parameters of Cartesian genetic programming, setting a target precision error of a circuit to be 0.05, an error operator weight alpha to be 0.4, an area operator weight beta to be 0.3, a delay operator weight gamma to be 0.3, a penalty coefficient p to be 1, and searching a multiplier structure with a smaller fitness function in a point mutation mode by continuous iteration until the maximum iteration frequency is up to 5000000 to obtain a 4 x 4 low-order approximate multiplier structure;
a first stage approximation multiplication circuit of a high order multiplier is constructed. Dividing two 8-bit unsigned bit input multipliers into a high 4 bit and a low 4 bit respectively, arranging according to a mode shown in figure 4, using an accurate multiplier in the high-bit multiplication with the greatest influence on precision, and using an approximate multiplier generated by Cartesian genetic programming for the rest low-bit multiplication;
a second stage approximate addition circuit of a high order multiplier is constructed. The resulting 4 partial product of 8 bits are arranged in bit order, and the intermediate 8 bits are subjected to approximate addition. The position k of the approximate line is configured to be 7, the approximate line is added by using an OR gate 7 later, the intermediate carry is cut off, the carry signal of the 8 th bit is the sum of two addends of the 7 th bit, and the 8 th bit is added by using a precise full adder. The products of the high 4 bits and the low 4 bits are arranged and directly output according to the mode shown in figure 4, and finally 16 bits of output of the integral approximate multiplier is obtained.
Calculating and evaluating the error distribution of the approximate multiplier constructed in the steps, adjusting the penalty coefficient p in the fitness function to be 1.6 according to the distribution deviation of the error to offset a part of the error, performing the Cartesian genetic programming search process again to obtain a 4 x 4 low-order approximate multiplier with error deviation shown in the circuit structure of FIG. 3, and splicing the approximate multiplier with an accurate 4 x 4 Wallace tree multiplier according to the method shown in FIG. 4 to obtain a finally constructed 8 x 8 unsigned bit approximate multiplier. Compared with an accurate multiplier, the 8 multiplied by 8 approximate multiplier with the error mutual compensation has 53.04% power consumption saving and 58.73% area improvement, and can be applied to a classification recognition task based on a neural network under a low-power-consumption scene.

Claims (7)

1. An energy-efficient approximate multiplier with reconfigurable precision and bit width is characterized by comprising an approximate multiplier generation module based on Cartesian genetic programming, a first-stage approximate multiplication circuit based on a mixed-precision multiplier, a second-stage approximate addition circuit based on a low-order OR gate adder and an approximate line configuration;
the approximate multiplier generation module based on the Cartesian genetic programming generates an approximate multiplier used for high-order multiplication, a specific precise multiplier structure generates an initial population of the Cartesian genetic programming, part of optimal populations are selected under the evaluation of a fitness function, a next generation population is generated through point mutation on the basis of the optimal population, and the evaluation, selection and mutation are continuously repeated in this way until the maximum iteration times is reached or a target condition is met, so that an n multiplied by n approximate multiplier is obtained;
the first-stage approximate multiplication circuit is used for generating products of input multiplier part items, for two 2 n-bit binary unsigned input multipliers, the two 2 n-bit binary unsigned input multipliers are respectively divided into a lower n-bit part and a higher n-bit part, for parts of multiplying the lower n bits by the lower n bits and the lower n bits by the upper n bits, the above Cartesian genetic programming module is used for generating an n multiplied by n, for parts of multiplying the higher n bits by the higher n bits, an n multiplied by n precision, and the generated 4 2 n-bit part item products are sequentially output to the second-stage approximate addition circuit;
the second-stage approximate addition circuit is used for approximately adding partial item products generated in a preceding-stage circuit, firstly, 4 products are sequentially arranged from low order to high order, an approximate line is set, an OR gate adder is used for the digit behind the approximate line, carry is abandoned, an accurate full adder is used for the digit before the approximate line, and the accumulation result obtained by the second-stage approximate addition circuit is the final output result of the designed 2n multiplied by 2n unsigned digit approximate multiplier.
2. The reconfigurable high-energy-efficiency approximate multiplier of claim 1, wherein different exact multipliers are used as initial chromosomes to generate populations in the cartesian genetic programming, and the area, delay and accuracy of the multipliers are optimized in a fitness function to achieve balance between energy consumption and performance.
3. An energy-efficient approximation multiplier reconfigurable in precision and bit width as claimed in claim 2, characterized in that said Fitness function Fitness is as shown in the following formula,
Figure FDA0003657256640000011
r(x)=max(0,x)
wherein, Error i For the set target average error, error (C) is the average error of the candidate approximate multiplier circuit C obtained by calculation, area (C), delay (C) is the normalized area and delay of the candidate approximate multiplier circuit C obtained by calculation in the interval (0, 1), the penalty coefficient p is a set value larger than 1, and α, β, γ are the set weights of the three fitness operators of error, area, and delay, respectively (α + β + γ ═ 1).
4. The high-energy-efficiency approximate multiplier with reconfigurable precision and bit width according to claim 3, characterized in that the penalty coefficient p is used for compensating errors introduced by approximate addition in the second-stage accumulation stage, since the overall error distribution of the second-stage approximate accumulation part tends to have partial negativity, and the cartesian genetic programming is always searched towards the direction with smaller fitness function, a penalty coefficient greater than 1 is set for the condition that the error of the candidate circuit structure in the error operator is smaller than the target error, so that the fitness of the structure is larger and more likely to be eliminated, and the cartesian genetic programming searches the approximate multiplier towards the direction with more positive errors, and the value of p is adjusted according to the partial negativity of the latter-stage approximate accumulation part, so as to offset the errors of the two parts to a certain extent, and improve the overall precision.
5. The high-energy-efficiency approximate multiplier with reconfigurable precision and bit width as claimed in claim 1, wherein, for the two 2n bit unsigned input multipliers which are split and then undergo approximate multiplication and addition operations to realize the reconfiguration of approximate multiplier bit width, the two 2n bit unsigned input multipliers A, B are first split into high n bits AH and BH and low n bits AL and BL respectively; AH multiplied by BH uses an accurate multiplier, and the output product is marked as H; the output products of AL multiplied by BH and AH multiplied by BL are respectively marked as M1 and M2 by using approximate multipliers; the output product of AL multiplied by BL is marked as L by using an approximate multiplier; splitting H into HH with high n bits and HL with low n bits, and splitting L into LH with high n bits and LL with low n bits; splicing HL and LH into M3, and carrying out approximate accumulation on M1, M2 and M3 by using approximate adders.
6. An energy-efficient approximate multiplier with reconfigurable precision and bit width as claimed in claim 5, characterized in that said approximate adder includes an approximate line configuration module for adjusting the bit width of the approximate accumulation by setting the position k of the approximate line to realize dynamic adjustment of the precision of the approximate multiplier, for the part of M1, M2 and M3 approximate accumulation, the k bits after the approximate line are rounded off by using OR gate instead of precision full adder, while the 2n-k bits before the approximate line are added by using precision full adder, the carry of the 2n-k bits is generated by AND gate from the two partial products of the 1 bit after the approximate line.
7. An implementation method of the high-efficiency approximate multiplier with reconfigurable precision and bit width according to claim 6, comprising the following steps:
the method comprises the following steps: constructing an accurate multiplier as an initial chromosome, wherein the chromosome structure of the multiplier is a basic gate circuit connected in sequence, fixing a partial product generating circuit formed by an AND gate, the input point number of a post-stage circuit is n multiplied by n, the gate circuit of the post-stage node is generated from a set formed by an AND gate, an OR gate, a NOT gate, an XOR gate, an XNOR gate and a NAND gate, the input of each node is connected with the output of two pre-stage nodes, and the final output point number of the circuit is n multiplied by n, constructing different types of accurate adders, wherein the accurate adders comprise Wallace tree multipliers, multipliers accumulated by traveling wave carry adders and multipliers accumulated by carry bypass adders as the initial chromosome;
step two: generating an n multiplied by n approximate multiplier through Cartesian genetic programming, configuring fitness function parameters, setting target precision error, area and delay of a circuit, setting an initial penalty coefficient p to be 1, continuously and iteratively searching a multiplier structure with a smaller fitness function until a structure meeting conditions appears or the maximum iteration times is reached, and stopping searching to obtain an n multiplied by n low-order approximate multiplier structure;
step three: and constructing a first-stage approximate multiplication circuit of the high-order multiplier, dividing two 2 n-bit unsigned input multipliers into high n bits and low n bits respectively, using an accurate multiplier in the multiplication of the high n bits and the high n bits which have the greatest influence on the precision, and using the approximate multipliers generated in the step two for the multiplication of the rest low n bits and the high n bits to obtain 4 partial product of the 2n bits.
Step four: constructing a second-stage approximate addition circuit of the high-order multiplier, arranging 4 partial product of 2n bits obtained in the third step according to bit sequence, and performing approximate accumulation operation on the middle 2n bits; configuring the position k of an approximate line to adjust the precision, adding the previous digits of the approximate line by using a precise full adder, adding the next digits of the approximate line by using an OR gate and truncating an intermediate carry, wherein a carry signal at the approximate line is the sum of two subsequent addends;
step five: obtaining the output result of the high-order approximate multiplier, and arranging the high n bits of the output result obtained in the step three, the 2n bit output obtained in the step four and the low n bits of the output result obtained in the step three in order according to bits to obtain the final 4n bit output of the high-order approximate multiplier;
step six: and adjusting the punishment coefficient p to compensate errors, calculating and evaluating the error distribution of the approximate multiplier constructed in the steps, adjusting the punishment coefficient p in the fitness function, and repeating the steps from three to five to adjust the error deviation and counteract a part of errors so as to obtain the approximate multiplier structure with higher precision and error distribution more in line with normal distribution.
CN202210564410.3A 2022-05-23 2022-05-23 High-energy-efficiency approximate multiplier with reconfigurable precision and bit width Pending CN115033204A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210564410.3A CN115033204A (en) 2022-05-23 2022-05-23 High-energy-efficiency approximate multiplier with reconfigurable precision and bit width

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210564410.3A CN115033204A (en) 2022-05-23 2022-05-23 High-energy-efficiency approximate multiplier with reconfigurable precision and bit width

Publications (1)

Publication Number Publication Date
CN115033204A true CN115033204A (en) 2022-09-09

Family

ID=83120199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210564410.3A Pending CN115033204A (en) 2022-05-23 2022-05-23 High-energy-efficiency approximate multiplier with reconfigurable precision and bit width

Country Status (1)

Country Link
CN (1) CN115033204A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117170623A (en) * 2023-11-03 2023-12-05 南京美辰微电子有限公司 Multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network calculation

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117170623A (en) * 2023-11-03 2023-12-05 南京美辰微电子有限公司 Multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network calculation
CN117170623B (en) * 2023-11-03 2024-01-30 南京美辰微电子有限公司 Multi-bit wide reconstruction approximate tensor multiplication and addition method and system for neural network calculation

Similar Documents

Publication Publication Date Title
Wang et al. A high-speed and low-complexity architecture for softmax function in deep learning
Yang et al. Design space exploration of neural network activation function circuits
Chen et al. A compact and configurable long short-term memory neural network hardware architecture
Zamirai et al. Revisiting BFfloat16 Training
Nojehdeh et al. Efficient hardware implementation of artificial neural networks using approximate multiply-accumulate blocks
Waris et al. AxRMs: Approximate recursive multipliers using high-performance building blocks
CN115033204A (en) High-energy-efficiency approximate multiplier with reconfigurable precision and bit width
Hemamithra et al. Fpga implementation of power efficient approximate multipliers
CN111882050B (en) Design method for improving BCPNN speed based on FPGA
Wang et al. High-performance mixed-low-precision cnn inference accelerator on fpga
Abdelhamid et al. Applying the residue number system to network inference
CN114115803B (en) Approximate floating-point multiplier based on partial product probability analysis
Temenos et al. A stochastic computing sigma-delta adder architecture for efficient neural network design
CN113191494B (en) Efficient LSTM accelerator based on FPGA
Jo et al. Bit-serial multiplier based neural processing element with approximate adder tree
Shao et al. An FPGA-based reconfigurable accelerator for low-bit DNN training
Niknia et al. Nanoscale Accelerators for Artificial Neural Networks
TW202234232A (en) Digital circuitry for normalization functions
Arnold et al. On the cost effectiveness of logarithmic arithmetic for backpropagation training on SIMD processors
Velonis et al. A comparison of Softmax proposals
US5602768A (en) Method and apparatus for reducing the processing time required to solve square root problems
Li et al. Accelerating position-aware top-k ListNet for ranking under custom precision regimes
Li et al. Multiple-Precision Floating-Point Dot Product Unit for Efficient Convolution Computation
Lu et al. Low Error-Rate Approximate Multiplier Design for DNNs with Hardware-Driven Co-Optimization
Wang et al. COSA: Co-Operative Systolic Arrays for Multi-head Attention Mechanism in Neural Network using Hybrid Data Reuse and Fusion Methodologies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination