CN114327640A - SIMD multiplier and digital processor - Google Patents

SIMD multiplier and digital processor Download PDF

Info

Publication number
CN114327640A
CN114327640A CN202111643953.6A CN202111643953A CN114327640A CN 114327640 A CN114327640 A CN 114327640A CN 202111643953 A CN202111643953 A CN 202111643953A CN 114327640 A CN114327640 A CN 114327640A
Authority
CN
China
Prior art keywords
multiplier
unit
partial product
product
partial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111643953.6A
Other languages
Chinese (zh)
Inventor
王震宇
赵芮
王刚
王平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Denglin Technology Co ltd
Original Assignee
Shanghai Denglin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Denglin Technology Co ltd filed Critical Shanghai Denglin Technology Co ltd
Priority to CN202111643953.6A priority Critical patent/CN114327640A/en
Publication of CN114327640A publication Critical patent/CN114327640A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

The present disclosure provides a new SIMD multiplier and digital processor in which an input unit assigns a multiplicand and a multiplier to each partial product generation unit or a constant multiplication unit corresponding thereto according to a control signal; each selector selects to provide the partial product generated by the partial product generating unit or the constant multiplying unit corresponding to the partial product generating unit to the partial product compressing unit according to the control signal; the partial product compressing unit outputs signals obtained by compressing the partial products received from each selector to an adder corresponding to each partial product generating unit to be combined to generate a first product, and outputs signals obtained by compressing all the partial products provided by each selector to a final product synthesizing unit to generate a second product; the output unit selects whether to output a plurality of first products or a second product according to the control signal. The scheme reuses the existing components and improves the operation speed of the multiplier through a small amount of hardware modification.

Description

SIMD multiplier and digital processor
Technical Field
The present invention relates to digital signal processing, and more particularly, to a multiplier and a digital processor including the same.
Background
The multiplier is an important basic component in a microprocessor and a digital signal processor, is the core of real-time signal processing such as image processing, and the performance of the multiplier greatly influences the performance of a system for processing data. The traditional hardware multiplier usually adopts a method of combining 'serial shift' and 'parallel addition', but the serial accumulation shift mode hardly meets the requirement of real-time signal processing. Therefore, many high performance digital signal processors have simd (single Instruction Multiple data) multiplier components added to improve the parallelism of data processing.
Typically, the high-order SIMD multiplier may be implemented using a plurality of low-order multipliers, for example, the 64-bit multiplier may be implemented by 2 32 × 32-bit multipliers or 4 64 × 16-bit multipliers, the 32 × 32-bit multiplier may be implemented by 4 32 × 8 multipliers or 2 32 × 16 multipliers, and the 64 × 16-bit multiplier may be implemented by 2 64 × 8-bit multipliers. But for an N-bit multiplier, the nxn-bit multiplication can still be completed only once in one cycle.
Disclosure of Invention
The inventors have found in their research that there are a large number of variable-constant multiplication operations in high performance operations, for example hash operations such as the FNV hash algorithm which involve a large number of operations that multiply data to be hashed by a certain constant. The conventional N-bit multiplier can only obtain the multiplication result of the N-bit variable and the Constant (Constant, hereinafter referred to as Con) 1 time per cycle, and the execution efficiency is not high. The embodiment of the invention provides a novel N-bit SIMD multiplier and a digital processor, which not only can realize 1 time of N × N bit multiplication operation in one cycle, but also can support the realization of multi-path parallel N × Con multiplication operation, wherein Con is an N-bit binary constant so as to improve the efficiency of the multiplier.
According to a first aspect of embodiments of the present invention, there is provided a SIMD multiplier comprising an input unit, an output unit, a plurality of partial product generation units, a partial product compression unit, a final product synthesis unit, a constant multiplication unit corresponding to each partial product generation unit, a selector, and an adder. Wherein the input unit is configured to assign the respective multiplicand and multiplier to either respective partial product generation units or respective constant multiplication units in dependence on the received control signal. Each selector is configured to select to provide the partial product generated by the partial product generation unit or the constant multiplication unit corresponding thereto to the partial product compression unit according to the control signal. The partial product compression unit is configured for compressing the partial products received from the selectors to obtain a sum signal and a carry signal and providing the sum signal and the carry signal to the final product synthesis unit, and for compressing the partial products provided by each selector to obtain a sum signal and a carry signal and providing the sum signal and the carry signal to the corresponding adder to be combined to generate the first product. The final product synthesis unit is configured to combine the sum signal from the partial product compression unit with the carry signal to produce a second product. The output unit is configured to output the first product generated by each adder or the second product generated by the final product synthesis unit according to the control signal.
In some embodiments of the present invention, the number of multiplicand and multiplier bits allocated to each constant multiplication unit is the same, the multiplier is constant and the number of bits in the multiplier being 1 does not exceed the ratio between the bit width of said multiplier and the number of partial product generation units.
In some embodiments of the invention, the input unit is configured to: in response to a control signal indicating that a normal multiplication operation is performed, segmenting the multiplier by the number of partial product generating units, and distributing the multiplicand and the segmented multiplier to each partial product generating unit; each multiplicand and corresponding multiplier are assigned to each constant multiplication unit in response to control signals indicating that constant multiplication operations are to be performed.
In some embodiments of the invention, the partial product compression unit may be a Wallace tree structure that utilizes carry-save adders for 4-2 compression.
In some embodiments of the present invention, the multiplier may be 32, 64, or 128 bits wide. The number of partial product generation units may be 2, 4, 8 or 16.
According to a second aspect of embodiments of the present invention, there is provided a digital processor comprising a controller and a multiplier according to the first aspect of embodiments of the present invention, wherein the multiplier performs respective multiplication operations according to a control signal received from the controller, a multiplicand and a multiplier.
The technical scheme provided by the embodiment of the invention can have the following beneficial effects:
the multiplier not only can multiplex the components of the existing multiplier to realize normal NxN bit multiplication, but also can support the parallel multiplication of multiple NxN bit constants in one period, thereby improving the operation speed of the multiplier to a certain extent, and obviously improving the execution efficiency of a specific algorithm with a large number of Nxconstant multiplication operations. For example, a large number of FNV hash operations are frequently called in a loop in the encryption and decryption algorithm, and by using the multiplier according to the embodiment of the present invention, the FNV hash operations can be executed quickly, thereby improving the utilization rate of computing resources.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
In the drawings:
FIG. 1 is a schematic diagram of a binary multiplication process;
FIG. 2 is a schematic diagram of a 32-bit SIMD multiplier comprising 4 32x 8-bit multipliers;
FIG. 3 is a schematic diagram of a 32-bit SIMD multiplier according to one embodiment of the present invention;
FIG. 4 is a block diagram of a digital processor including a multiplier according to an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
Fig. 1 shows a schematic diagram of a binary multiplication process. As shown in fig. 1, when performing multiplication, the first operand is a multiplicand (e.g. 101010), the second operand is a multiplier (1011), the result of the multiplication of each bit of the multiplier and the multiplicand is called a partial product, and after all the partial products are added, the result is called a product. In fig. 1, the multiplier is 4bits, so that 4 partial products 101010, 000000 and 101010 can be obtained, and the four partial products are subjected to shift addition to obtain a final product 111001110. As can be seen, the multiplication operation is mainly completed in three steps: producing a partial product, accumulating the partial product, and finally adding.
A common implementation circuit structure of the multiplier generally includes an input unit, a partial product generation unit, a partial product compression unit, a final product synthesis unit, and an output unit. Wherein the input unit is used for inputting the multiplier and the multiplicand into the partial product generating unit. The partial product generating unit is used for operating each bit of the multiplier and the multiplicand received from the input unit to generate a partial product. The number of partial products directly affects the performance of the multiplier, and a Booth (Booth) encoding algorithm may be generally employed to reduce the number of generated partial products. The partial product generating unit outputs the generated partial product to the partial product compressing unit, and the partial product compressing unit can improve the accumulation speed of the partial product by using a Wallace tree algorithm, wherein the accumulation speed of the generated partial product has a direct influence on the time delay of the multiplier. The Wallace tree structure is an adder array structure which is commonly used for multiplication operation at present. The Wallace tree algorithm commonly used in SIMD multipliers is a CSA &4-2 compressor, i.e. a Wallace tree structure with step-by-step 4-2 compression is constructed by using Carry-Save Adder (CSA) as a basic unit to perform accumulation compression on partial products. The sum signal and the carry signal obtained by carrying out carry-save adder accumulation compression on the partial products from the partial product generating unit through the partial product compressing unit are provided for the final product synthesizing unit. The final product synthesis unit combines the sum signal and the carry signal from the partial product compression unit to generate a product, and provides the product to the output unit. The output unit may output the signal received from the final product combining unit in an asynchronous operation or a synchronous operation.
A high-order SIMD multiplier may be implemented using multiple low-order multipliers. For example, a 32-bit SIMD multiplier may be implemented by 4 32 × 8-bit multipliers, a 64-bit SIMD multiplier may be implemented by 8 32 × 8-bit multipliers, 4 32 × 16-bit multipliers, or two 32 × 32-bit multipliers. For convenience of description, the specific structure of the high-order SIMD multiplier is described herein with a 32 × 8-bit multiplier as an example of the low-order multiplier, but is not intended to be limiting in any way. As shown in fig. 2, to improve the parallelism of data processing, a 32-bit SIMD multiplier can perform a 32 × 32-bit multiplication operation by simultaneously operating 4 32 × 8-bit multipliers in parallel. Wherein each 32x 8bit multiplier is used as a partial product generating unit, the input unit divides the multiplier B of 32 bits in the received multiplication instruction into a plurality of multiplier sections B [7:0], B [15:8], B [23:16] B [31:24] according to the number of the partial product generating units, and distributes the multiplier sections and the multiplicand A [31:0] to the partial product generating units. Since the number of bits per multiplier section is 8, each partial product generation unit can produce 8 partial products, and 4 partial product generation units produce 32 partial products in total. The partial product compression unit is composed of a four-stage 4-2 compression Wallace tree structure which is constructed by taking a carry-save adder CSA as a basic unit, each 4 partial products from the partial product generation unit at the first stage are processed by a CSA4:2 compressor to generate a sum signal and a carry signal, the sum signal and the carry signal are further transmitted to a CSA4:2 compressor at the next stage for processing, each CSA4:2 compressor at the second stage outputs two signals (the sum signal and the carry signal) after processing four signals from the previous stage, and the signals are continuously transmitted to a CSA4:2 compressor at the next stage until the last two signals are obtained at the fourth stage. As the number of partial products generated by the partial product generation unit is different, the number of stages of the Wallace tree type is increased or decreased correspondingly. Two signals (sum signal and carry signal) obtained by accumulating and compressing 32 partial products by the partial product compression unit are transmitted to the final product merging unit. The final product synthesis unit combines the sum signal and the carry signal from the partial product compression unit by a 64-bit adder to generate a product, and provides the product to the output unit.
For such a 32-bit SIMD multiplier, 1 32 × 32-bit multiplication operation or 4 32 × 8 multiplication operations may be performed within one cycle. However, in high performance operations there are a large number of multiplication operations in which a variable is multiplied by a constant, for example in hash operations such as the FNV hash algorithm, a large number of operations are involved in which data to be hashed is multiplied by a certain constant. The existing N-bit multiplier can only obtain the multiplication result of the N-bit variable and the N-bit constant for 1 time in each cycle, and the data processing efficiency is not high.
Therefore, in the embodiment of the present invention, a new N-bit SIMD multiplier is provided, which, in addition to including the input unit, the plurality of partial product generation units, the partial product compression unit, the final product synthesis unit, and the output unit described above, further includes a constant multiplication unit, a selector, and an adder corresponding to each partial product generation unit, and is capable of implementing 1-time nxn-bit multiplication operation or multiple parallel nxcon multiplication operations (hereinafter simply referred to as constant multiplication operation or operation for convenience) in one cycle to improve the efficiency of processing data by the multiplier. In the embodiment of the present application, Con is an N-bit binary constant, and N may be 2tAnd t is a natural number of 4 or more. N is preferably 32, 64, 128 or 256. For convenience of description and comparison with fig. 2, the structure of the SIMD multiplier will be illustrated below with N being 32, and the low-order multiplier will still use a 32 × 8-bit multiplier as an example. It should be understood that similar structures can be extended to 64, 128, 256 bit multipliers, and the lower bit multipliers are not limited to 32x 8bit multipliers.
FIG. 3 is a diagram of a 32-bit SIMD multiplier according to one embodiment of the present invention. Compared to the existing 32-bit SIMD multiplier shown in fig. 2, four constant multiplication units a, b, c, d are added in this embodiment. The four constant multiplication units have the same function, are all used for constant multiplication operation with one operand being constant, and respectively correspond to the original partial product generation units (namely, multipliers a, b, c and d). The operands of the multipliers a, b, c and d are 32-bit multiplicands and 8-bit multipliers; while the operands of constant multiplication units a, b, c, d may be 32-bit multiplicands and 32-bit constants. Compared with fig. 2, selectors a, b, c and d corresponding to each partial product generation unit are also added to the multiplier structure of fig. 3. Thus, when a normal multiplication operation is performed, the selectors a, b, c, d select the 8 partial products output from the multipliers a, b, c, d, respectively, to be supplied to the subsequent partial product compression unit. And when the constant multiplication operation is executed, the selectors a, b, c and d select 8 partial products output by the constant multiplication units a, b, c and d respectively and provide the selected products to the subsequent partial product compression unit.
Similar to fig. 2, the partial product compression unit is constructed using a four-level 4-2 compression Wallace tree structure constructed using a carry-save adder CSA as a basic unit. Through each selector, the 8 partial products generated by the corresponding partial product generation unit are sent to a4:2 compressor module (hereinafter, also referred to as CSA _4_2 module) constructed by taking the CSA as a basic unit at the first stage and a CSA _4_2 module at the second stage, and each CSA _4_2 module compresses the corresponding partial product into two signals, namely a sum signal and a carry signal. As shown in fig. 3, after the first and second stages CSA _4_2 are compressed, the partial products from each partial product generating unit are compressed into a sum signal and a carry signal, which may be denoted as { sum0, carry0}, { sum1, carry1}, { sum2, carry2}, { sum3, carry3 }.
In contrast to fig. 2, in the embodiment of the present invention, a corresponding adder is further provided for each partial product generation unit, and is used to generate the product corresponding to the partial product generation unit. As shown in fig. 3, signals { sum0, carry0}, { sum1, carry1}, { sum2, carry2}, and { sum3, carry3} corresponding to the partial product generating units are supplied to their corresponding 32-bit adders a, b, c, and d, respectively. Each 32-bit adder generates a first product a, a first product b, a first product c and a first product d, and provides the first product a, the first product b, the first product c and the first product d to the output unit directly. It is understood that, similar to fig. 2, the signals { sum0, carry0}, { sum1, carry1}, { sum2, carry2}, { sum3, carry3} corresponding to the respective partial product generation units may also be output to the third stage CSA _4_2, where { sum0, carry0} are output directly to the third stage CSA _4_2 module, { sum1, carry1} are left shifted by 8bits, respectively, { sum2, carry2} are left shifted by 16bits, respectively, { sum3, carry3} are left shifted by 24bits, respectively, and then output to the third stage CSA _4_2 module, and the third stage CSA _4_2 module continues to provide the received 8 signals to the fourth stage CSA _4_2 module. The fourth stage CSA _4_2 module compresses the received 4 signals into 2 signals. Thus, the 2 signals resulting from the fourth stage of compression, denoted as { final _ sum, final _ carry }, are output to a final product synthesis unit implemented as a 64-bit adder, resulting in a second product representing a normal multiplication operation. In this embodiment, the output unit may select whether to output one second product or four first products as the execution result of the SIMD multiplier, according to a control signal from the input unit. It can be seen that the multiplier architecture of fig. 3 can be used not only to perform normal 1N × N bit multiplication operations in one cycle, but also to implement multiple N × N bit constant multiplication operations (e.g., FNV operations) in one cycle. The multiplier architecture shown in fig. 2 can only perform 1 32 × 32-bit multiplication operation and cannot perform multiple FNV operations in one cycle. Compared with the multiplier architecture shown in fig. 2, with the multiplier architecture illustrated in fig. 3, by adding a small amount of hardware resources in the case of multiplexing a 32-bit multiplier architecture, multi-SIMD FNV operation can be realized in one cycle, and the FNV performance can be increased to 4 times that of a normal 32 × 32 multiplier.
It is understood that, based on the principle of the structure shown in fig. 3, in other embodiments, the specific number of each unit module in the multiplier architecture may be set according to actual requirements.
In an embodiment of the invention, the input unit receives, in addition to the multiplicand and the multiplier, a control signal indicating whether to perform a normal multiplication operation or a constant multiplication operation. The input unit may distribute the respective multiplicand and multiplier to either a partial product generation unit (e.g., the 32x8 multiplier shown in fig. 3) or a constant multiplication unit depending on the received control signal. For example, if the control signal is 1, the input unit assigns the received operand to the partial product generation unit, and if the control signal is 0, the input unit assigns the received operand to the constant multiplication unit, or vice versa. The following description will be made by taking an example in which a control signal of 1 indicates that a normal multiplication operation is performed, and a control signal of 0 indicates that a constant multiplication operation is performed. When receiving a control signal of 1, the input unit segments the multiplier according to the number of the partial product generating units, and distributes the multiplicand and the segmented multiplier to each partial product generating unit. When receiving a control signal of 0, the input unit distributes a plurality of multiplicands and constants Con received simultaneously to the constant multiplying units.
Taking the multiplier configuration shown in FIG. 3 as an example, if the received control signal is 1 (indicating that a normal multiplication operation is performed, i.e., 1 NxN-bit multiplication operation; e.g., A x B is performed), then 4 identical multiplicands (A, A, A) and segmented multipliers (B [31:24], B [23:16], B [15:8], B [7:0]) are assigned to 4 32x8 multipliers. If the received control signal is 0 (indicating that constant multiplication operations are performed, which may be performed simultaneously for 4N × N bit multiplications; e.g., a0 × B0, a1 × B1a2 × B2, A3 × B3), then 4 different multiplicands (a0, a1, a2, A3) and corresponding multipliers (where the corresponding multipliers B0, B1, B2, B3 are all constants) are applied to the 4 constant multiplication units, respectively. The multipliers B0, B1, B2 and B3 as constants may be different binary constants or the same binary constant. In some embodiments, the input unit may also set respective segments of the multiplier input to the 32x8 multiplier to 0 when it is determined that the control signal indicates that a constant multiplication operation is performed.
It should be appreciated that the above control signals are merely exemplary and may take a variety of forms to indicate whether to perform a normal multiplication operation or a constant multiplication operation. Optionally, the number of the control signals may be more than one, for example, two signals may be set: the first control signal and the second control signal instruct normal multiplication operation to be executed when the first control signal is effective, and instruct constant multiplication operation to be executed when the second control signal is effective.
Optionally, to better utilize the multiplier of the embodiment of the present invention, a new parallel constant multiplication SIMD instruction, which may include multiple multiplicands and multipliers therein, may be set in the instruction set of the processor.
In the embodiment of the present invention, when the constant multiplication operation is performed, the multiplier Con, which is a constant, has the same number of bits as the multiplicand, and the number of bits of the multiplier, which is 1, does not exceed the ratio between the bit width of the multiplier and the number of partial product generation units. Taking the 32-bit SIMD multiplier comprising 4 × 8-bit multipliers as an example above, when a 32-bit variable × 32-bit constant operation is performed, the number of bits of 1 in the 32-bit constant does not exceed 32/4, i.e., 8.
As mentioned above, the selector corresponding to each partial product generation unit may determine whether to output the result produced by the partial product generation unit or the constant multiplication unit to the subsequent partial product compression unit according to the control signal. Optionally, the partial-volume compression unit may also selectively deliver the result obtained after compression according to the control signal. Optionally, the output unit may also select whether to output one second product or a plurality of first products according to the control signal.
With continued reference to fig. 3, the input unit may assign multiplicands and multipliers to be multiplied to each partial product generation unit or constant multiplication unit according to the control signals. The input unit may also provide control signals to the respective selectors and the output unit. The output of each partial product generation unit and its corresponding constant multiplication unit are connected to a selector. The selector selects to provide the partial product generated by the corresponding partial product generating unit or the constant multiplying unit to the partial product compressing unit according to the control signal. Still taking the 32-bit SIMD multiplier comprising 4 × 8-bit multipliers as an example, where each partial product generation unit outputs 8 partial products, if a constant multiplication unit corresponding to the partial product generation unit is added to the multiplier to perform a 32-bit variable × 32-bit constant multiplication operation, since the number of bits of 1 in the 32-bit constant does not exceed 8, the number of partial products generated by the added constant multiplication unit also does not exceed 8, so that the partial product compression unit in the existing multiplier can be basically directly multiplexed.
In this embodiment, the partial product compression unit may perform carry-save adder accumulation compression on all partial products from each selector to obtain a sum signal and a carry signal, and output the sum signal and the carry signal to the final product synthesis unit. The partial product compression unit can also carry out carry-save adder accumulation compression on the partial products provided by each selector respectively to obtain a sum signal and a carry signal, and the sum signal and the carry signal are output to the adder corresponding to the selector to be combined to generate a first product, and then the first product is provided to the output unit. The final product synthesis unit combines the sum signal from the partial product compression unit and the carry signal to generate a second product, and provides the second product to the output unit. The output unit may selectively output the plurality of first products generated from the respective adders or output the second product generated by the final product combining unit according to a control signal from the input unit. For example, when the control signal is indicative that a normal multiplication operation is being performed, the output unit outputs the second product; the output unit outputs a plurality of first products when the control signal indicates that the multi-path parallel constant multiplication operation is being performed.
Compared with the conventional general 32 × 32 multiplier, the 32 × 32 multiplier according to the embodiment of the present invention shown in fig. 3 only needs to add 4 constant multiplication units and 4 32-bit adders, and simultaneously multiplexes the conventional partial product compression unit circuit, so that 1 time of 32 × 32-bit multiplication or 4-way parallel 32 × Con multiplication (e.g., FNV operation) can be realized in one cycle, thereby improving the efficiency of data processing of the multiplier.
The operation of the partial product in the multiplier according to the embodiment of the present invention is described in more detail below with reference to fig. 3. Taking the example for a 32x 32 bit multiply operation a [31:0] × B [31:0], the operation splits the multiplier B [31:0] into 4 portions of 8 bits: { B [31:24], B [23:16], B [15:8], B [7: 0}, generating partial products by four modules of multipliers a, B, c and d, wherein 8 partial products output after each module is coded are as follows:
mul0_ pp _ { i } - (src1[31:0 ]. src2[0+ i ]) < < i; wherein i is a natural number between [0,7 ]; mul1_ pp _ { i } - (src1[31:0 ]. src2[8+ i ]) < < i;
mul2_pp_{i}=(src1[31:0]*src2[16+i])<<i;
mul3_pp_{i}=(src1[31:0]*src2[24+i])<<i;
since the four modules of the multipliers a, b, c and d output 8 partial products respectively, and the partial products output by the constant multiplication units a, b, c and d corresponding to the multipliers a, b, c and d should be less than or equal to 8, so that the subsequent wallace compression tree structure can be completely multiplexed to complete the compression of the partial products, in the constant multiplication operation supported in the 32-bit SIMD multiplier including the 4 partial product generation units shown in fig. 3, the number of 1 in the operands as constants should be less than or equal to 32/4, which is 8. If the number of 1's in the constant is less than 8, the partial product of the output can be complemented to 8 with 0's.
In this example, the encoding process for each constant multiplication unit is as follows: assuming that a constant involved in the multiplication is Con, if bits of 1 in Con are Con [ j0], Con [ j1], Con [ j2], Con [ j3], Con [ j4], Con [ j5], Con [ j6], Con [ j7], where j0, j1, j2, j3, j4, j5, j6 and j7 represent bit numbers of bits with 1 in the constant Con, 8 partial products output after being encoded by the constant multiplication unit are as follows:
FNV1_pp0=src1[31:0]<<j0
FNV1_pp1=src1[31:0]<<j1;
FNV1_pp2=src1[31:0]<<j2;
FNV1_pp3=src1[31:0]<<j3
FNV1_pp4=src1[31:0]<<j4;
FNV1_pp5=src1[31:0]<<j5;
FNV1_pp6=src1[31:0]<<j6;
FNV1_pp7=src1[31:0]<<j7;
if the number of bits of 1 in the constant Con is smaller than 8, only the partial product of 1 corresponding to the bit in the constant Con is output, 8 partial products are complemented, and the complemented partial product is 0.
The description is given by taking one of the constants Con being 0x1000193 as an example, and it should be noted that the constant 0x1000193 is only for illustration and not for limitation. The bits with the constant 0x1000193 being 1 are coded, wherein the number of the bits with 1 is 6 and less than or equal to 8, and the multiplier structure is suitable for the invention. The constant Con has 1 bits of 6 Con [0], Con [1], Con [4], Con [7], Con [8], Con [24], the constant multiplication unit encodes 6 partial products, the other 2 complementary partial products are 0, and the partial products are as follows:
FNV1_pp0[31:0]={src1[31:0]};
FNV1_pp1[31:0]={src1[30:0],1’b0};
FNV1_pp2[31:0]={src1[27:0],4’b0};
FNV1_pp3[31:0]={src1[24:0],7’b0};
FNV1_pp4[31:0]={src1[23:0],8’b0};
FNV1_pp5[31:0]={src1[7:0],24’b0};
FNV1_pp6[31:0]=32’b0;
FNV1_pp7[31:0]=32’b0。
in yet another embodiment of the present invention, there is also provided a digital processor supporting the above-described multi-way parallel multiplication of N x N bit constants. As shown in fig. 4, the digital processor 400 includes a controller 401 and a multiplier 402. Where the multiplier 402 is a SIMD multiplier as described above in connection with fig. 3. The multiplier 402 performs a corresponding multiplication operation according to the control signal, the multiplicand and the multiplier received from the controller 401. To better utilize the multiplier of embodiments of the present invention, a new parallel constant multiplication SIMD instruction, which may include multiple multiplicands and multipliers therein, may be placed in the instruction set of the processor.
It should be appreciated that the above-described embodiments may be implemented in any number of ways. For example, the embodiments may be implemented using hardware, software, or a combination of hardware and software. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers.
Reference in the specification to "various embodiments," "some embodiments," "one embodiment," or "an embodiment," etc., means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases "in various embodiments," "in some embodiments," "in one embodiment," or "in an embodiment," or the like, in various places throughout this specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Thus, a particular feature, structure, or characteristic illustrated or described in connection with one embodiment may be combined, in whole or in part, with a feature, structure, or characteristic of one or more other embodiments without limitation, as long as the combination is not logical or operational.
The terms "comprises," "comprising," and "having," and similar referents in this specification, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The word "a" or "an" does not exclude a plurality. Additionally, the various elements of the drawings of the present application are merely schematic illustrations and are not drawn to scale.
Although the present invention has been described by the above embodiments, the present invention is not limited to the embodiments described herein, and various changes and modifications may be made without departing from the scope of the present invention.

Claims (8)

1. A SIMD multiplier, comprising: an input unit, an output unit, a plurality of partial product generation units, a partial product compression unit, a final product synthesis unit, a constant multiplication unit corresponding to each partial product generation unit, a selector, and an adder, wherein:
the input unit is used for distributing corresponding multiplicands and multipliers to each partial product generation unit or each constant multiplication unit according to received control signals;
each selector is used for selecting to provide the partial product generated by the partial product generating unit or the corresponding constant multiplying unit thereof to the partial product compressing unit according to the control signal;
a partial product compressing unit for compressing the partial products received from the selectors to obtain sum signals and carry signals and supplying the sum signals and the carry signals to the final product synthesizing unit, and for compressing the partial products supplied by each selector to obtain sum signals and carry signals and supplying the sum signals and the carry signals to the corresponding adder to be combined to generate a first product;
a final product synthesis unit for combining the sum signal from the partial product compression unit and the carry signal to generate a second product;
and the output unit is used for outputting the first product generated by each adder or the second product generated by the final product synthesis unit according to the control signal.
2. The multiplier of claim 1, wherein the multiplicand and the multiplier assigned to each constant multiplication unit have the same number of bits, the multiplier is constant and the number of bits in the multiplier of 1 does not exceed the ratio between the bit width of the multiplier and the number of partial product generation units.
3. The multiplier of claim 1, wherein the input unit is configured to:
in response to a control signal indicating that a normal multiplication operation is performed, segmenting the multiplier by the number of partial product generating units, and distributing the multiplicand and the segmented multiplier to each partial product generating unit;
each multiplicand and corresponding multiplier are assigned to each constant multiplication unit in response to control signals indicating that constant multiplication operations are to be performed.
4. The multiplier of claim 1, wherein the partial product compression unit is a wallace tree structure that is 4-2 compressed using a carry-save adder.
5. The multiplier of any of claims 1-4, characterized in that it is 32, 64 or 128 bits wide.
6. The multiplier of claim 5, wherein the number of partial product generation units is 2, 4, 8 or 16.
7. The multiplier of claim 3, wherein the constant multiplication operation is an FNV operation.
8. A digital processor comprising a controller and a multiplier according to any of claims 1-7, wherein the multiplier performs a corresponding multiplication operation in accordance with a control signal received from the controller, a multiplicand and a multiplier.
CN202111643953.6A 2021-12-30 2021-12-30 SIMD multiplier and digital processor Pending CN114327640A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111643953.6A CN114327640A (en) 2021-12-30 2021-12-30 SIMD multiplier and digital processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111643953.6A CN114327640A (en) 2021-12-30 2021-12-30 SIMD multiplier and digital processor

Publications (1)

Publication Number Publication Date
CN114327640A true CN114327640A (en) 2022-04-12

Family

ID=81016655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111643953.6A Pending CN114327640A (en) 2021-12-30 2021-12-30 SIMD multiplier and digital processor

Country Status (1)

Country Link
CN (1) CN114327640A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115796239A (en) * 2022-12-14 2023-03-14 北京登临科技有限公司 AI algorithm architecture implementation device, convolution calculation unit and related method and equipment
WO2024027226A1 (en) * 2022-08-04 2024-02-08 深圳市中兴微电子技术有限公司 Multiplying unit

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024027226A1 (en) * 2022-08-04 2024-02-08 深圳市中兴微电子技术有限公司 Multiplying unit
CN115796239A (en) * 2022-12-14 2023-03-14 北京登临科技有限公司 AI algorithm architecture implementation device, convolution calculation unit and related method and equipment
CN115796239B (en) * 2022-12-14 2023-10-31 北京登临科技有限公司 Device for realizing AI algorithm architecture, convolution computing device, and related methods and devices

Similar Documents

Publication Publication Date Title
KR100264961B1 (en) Parallel multiplier that supports multiple numbers with different bit lengths
US7769797B2 (en) Apparatus and method of multiplication using a plurality of identical partial multiplication modules
Wang et al. Adder based residue to binary number converters for (2/sup n/-1, 2/sup n/, 2/sup n/+ 1)
CN114327640A (en) SIMD multiplier and digital processor
CN110221808B (en) Vector multiply-add operation preprocessing method, multiplier-adder and computer readable medium
CN110413254B (en) Data processor, method, chip and electronic equipment
US6601077B1 (en) DSP unit for multi-level global accumulation
US20210182026A1 (en) Compressing like-magnitude partial products in multiply accumulation
CN112540743B (en) Reconfigurable processor-oriented signed multiply accumulator and method
CN109753268B (en) Multi-granularity parallel operation multiplier
US6704762B1 (en) Multiplier and arithmetic unit for calculating sum of product
US20210011686A1 (en) Arithmetic operation device and arithmetic operation system
EP0248166A2 (en) Binary multibit multiplier
CN110673823B (en) Multiplier, data processing method and chip
JPH0713741A (en) Alpha resultant conputing element
CN115982528A (en) Booth algorithm-based approximate precoding convolution operation method and system
CN110554854A (en) Data processor, method, chip and electronic equipment
CN111258544B (en) Multiplier, data processing method, chip and electronic equipment
CN111258633B (en) Multiplier, data processing method, chip and electronic equipment
KR20220031098A (en) Signed Multi-Word Multiplier
CN114063972A (en) Multiplier device
CN210006031U (en) Multiplier and method for generating a digital signal
CN209879493U (en) Multiplier and method for generating a digital signal
JPH05204602A (en) Method and device of control signal
CN210006030U (en) Data processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination