TWI763079B

TWI763079B - Multiplier and method for floating-point arithmetic, integrated circuit chip, and computing device

Info

Publication number: TWI763079B
Application number: TW109135585A
Authority: TW
Inventors: 張堯; 劉少禮
Original assignee: 大陸商安徽寒武紀信息科技有限公司
Priority date: 2019-10-14
Filing date: 2020-10-14
Publication date: 2022-05-01
Also published as: TW202115560A; CN112732220A; CN112732221A

Abstract

The present invention relates to a multiplier and a method for floating-point arithmetic, an integrated circuit chip, and a computing device. The computing device can be included in a combination processing device. The combination processing device may also include a universal interconnection interface and other processing device(s). The computing device interacts with the other processing device(s) to together accomplish an arithmetic operation prescribed by a user. The combination processing device may also include a storage device. The storage device is connected to the computing device and the other processing device(s) and used for data of the computing device and the other processing device(s). Schemes of the present invention can be widely applied to various floating-point data arithmetic operations.

Description

Multiplier, method, integrated circuit chip, and computing device for floating-point operations

本揭示係關於浮點運算技術領域，特別是關於一種用於浮點運算的方法、乘法器、積體電路晶片和計算裝置。The present disclosure relates to the technical field of floating-point operations, and more particularly, to a method, a multiplier, an integrated circuit chip, and a computing device for floating-point operations.

在當前的各種信號處理演算法中，如向量之間的內積操作和矩陣的卷積運算中，使用到大量的乘加操作，而這些乘加操作的效率往往取決於乘法器的執行速度。儘管當前的乘法器在執行效率方面獲得了顯著提高，但在處理浮點類型資料方面，其還存在提升的空間。因此，如何獲得一種高效率、低功耗和低成本的乘法器來執行浮點型資料的乘法操作成為習知技術中需要解決的問題。In various current signal processing algorithms, such as the inner product operation between vectors and the convolution operation of matrices, a large number of multiplication and addition operations are used, and the efficiency of these multiplication and addition operations often depends on the execution speed of the multiplier. Although the current multipliers have achieved significant improvements in execution efficiency, there is still room for improvement in handling floating-point type data. Therefore, how to obtain a multiplier with high efficiency, low power consumption and low cost to perform the multiplication operation of floating-point data becomes a problem to be solved in the prior art.

為了至少部分地解決先前技術中提到的技術問題，本披露的方案提供了一種用於浮點運算的乘法器、方法、包括該乘法器的積體電路晶片和計算裝置。In order to at least partially solve the technical problems mentioned in the prior art, the solutions of the present disclosure provide a multiplier for floating-point operations, a method, an integrated circuit chip including the multiplier, and a computing device.

在一個方面中，本披露提供一種乘法器，用於進行浮點數的乘法運算，其中，所述乘法器包括：尾數處理單元，用於根據所述浮點數的尾數來獲得所述乘法運算後的尾數，所述尾數處理單元包括控制電路，所述控制電路用於在兩個浮點數中的至少一個的尾數位寬大於所述尾數處理單元一次可處理的資料位寬時，多次調用所述尾數處理單元。In one aspect, the present disclosure provides a multiplier for performing a multiplication operation of floating-point numbers, wherein the multiplier includes: a mantissa processing unit for obtaining the multiplication operation according to the mantissa of the floating-point number After the mantissa, the mantissa processing unit includes a control circuit, and the control circuit is used for multiple times when the mantissa bit width of at least one of the two floating-point numbers is larger than the data bit width that the mantissa processing unit can process at one time. The mantissa processing unit is called.

在另一方面中，本披露提供一種使用乘法器執行浮點數乘法運算的方法，其中，利用所述乘法器的尾數處理單元根據所述浮點數的尾數來獲得所述乘法運算後的尾數，所述尾數處理單元包括控制電路，所述控制電路用於在兩個浮點數中的至少一個的尾數位寬大於所述尾數處理單元一次可處理的資料位寬時，多次調用所述尾數處理單元。In another aspect, the present disclosure provides a method for performing a floating-point multiplication operation using a multiplier, wherein the multiplied mantissa is obtained according to the mantissa of the floating-point number using a mantissa processing unit of the multiplier , the mantissa processing unit includes a control circuit, the control circuit is configured to call the Mantissa processing unit.

在又一方面中，本披露提供一種積體電路晶片，包括所述的乘法器。在一個或多個實施例中，本披露的乘法器可以構成一個獨立的積體電路晶片或佈置在一塊積體電路晶片或計算裝置上，實現對多種不同資料格式的浮點數的運算。In yet another aspect, the present disclosure provides an integrated circuit chip including the multiplier. In one or more embodiments, the multipliers of the present disclosure may constitute a separate IC chip or be disposed on an IC chip or computing device to perform operations on floating-point numbers in a variety of different data formats.

利用本披露的乘法器、相應的運算方法、積體電路晶片和計算裝置，可以支援對多種浮點類型的資料進行運算而無需針對不同的浮點類型資料而提供多個單獨的乘法器。由此，本披露的乘法器適用靈活，可以廣泛應用於各類浮點資料運算。另外，在處理位寬較大的輸入資料時，本披露的乘法器支持迴圈複用操作，從而無需佈置更多的處理晶片，由此也減小了積體電路的佈置面積。Using the multipliers, corresponding operation methods, integrated circuit chips, and computing devices of the present disclosure, operations on multiple floating-point type data can be supported without providing multiple separate multipliers for different floating-point type data. Therefore, the multiplier of the present disclosure is flexible in application, and can be widely used in various floating-point data operations. In addition, when processing input data with a larger bit width, the multiplier of the present disclosure supports loop multiplexing operations, so that more processing wafers need not be arranged, thereby reducing the layout area of the integrated circuit.

本披露的技術方案在整體上提供一種用於浮點數運算的乘法器、方法、積體電路晶片和計算裝置。不同於習知技術的浮點運算乘法器，本披露提供了一種支援多種運算模式的乘法器，從而克服習知乘法器只能支持一種類型浮點運算的缺陷。特別地，本披露利用多種運算模式來指示不同的浮點資料類型，並且在浮點數的乘法計算過程中，基於運算模式之一來執行資料的各類操作，包括例如編碼、壓縮、求和、規格化和捨入操作，從而實現與多種浮點資料類型之一關聯的操作。由此，本披露的乘法器可以支援多模式下的操作，進一步提高浮點運算的靈活性並降低運算的成本。The technical solutions of the present disclosure generally provide a multiplier, method, integrated circuit chip, and computing device for floating-point number operations. Different from the floating-point operation multiplier of the prior art, the present disclosure provides a multiplier that supports multiple operation modes, thereby overcoming the defect that the conventional multiplier can only support one type of floating-point operation. In particular, the present disclosure utilizes multiple operation modes to indicate different floating-point data types, and during multiplication of floating-point numbers, various operations on the data are performed based on one of the operation modes, including, for example, encoding, compressing, summing , normalization, and rounding operations to implement operations associated with one of several floating-point data types. Thus, the multiplier of the present disclosure can support operations in multiple modes, further improving the flexibility of floating-point operations and reducing the cost of operations.

下面將結合附圖對本披露的技術方案及其多個實施例進行詳細的描述。應當理解的是，將關於浮點運算闡述許多具體細節以便提供對本披露所述多個實施例的透徹理解。然而，本領域具有通常知識者在本披露公開內容的教導下可以在沒有這些具體細節的情況下實踐本披露描述的多個實施例。在其他情況下，本披露公開的內容並沒有詳細描述公知的方法、過程和元件，以避免不必要地模糊本披露描述的實施例。另外，該描述也不應被視為限制本披露的多個實施例的範圍。The technical solution of the present disclosure and its multiple embodiments will be described in detail below with reference to the accompanying drawings. It should be understood that numerous specific details will be set forth with regard to floating point operations in order to provide a thorough understanding of the various embodiments of the present disclosure. However, one of ordinary skill in the art, given the teachings of this disclosure, may practice the various embodiments described in this disclosure without these specific details. In other instances, the disclosure has not described well-known methods, procedures and elements in detail in order to avoid unnecessarily obscuring the embodiments described in this disclosure. Additionally, this description should not be construed as limiting the scope of the various embodiments of the present disclosure.

第1圖是示出根據本披露實施例的浮點資料格式100的示意圖。如第1圖中所示，可以應用本披露技術方案的浮點數可以包括三個部分，例如符號(或符號位102)、指數(或指數位104)和尾數(或尾數位106)，其中對於無符號的浮點數則可以不存在符號或符號位。在一些實施例中，適用於本披露乘法器的浮點數可以包括半精度浮點數、單精確度浮點數、腦浮點數、雙精度浮點數、自訂浮點數中的至少一種。具體來說，在一些實施例中，可以應用本披露技術方案的浮點數格式可以是符合IEEE754標準的浮點格式，例如雙精度浮點數(float64，簡寫為“FP64”)、單精確度浮點數(float32，簡寫“FP32”)或半精度浮點數(float16，簡寫“FP16”)。在另外一些實施例中，浮點數格式也可以是現有的16位腦浮點數(bfloat16，簡寫“BF16”)，也可以是自訂的浮點數格式，例如8位腦浮點數(bfloat8，簡寫“BF8”)、無符號半精度浮點數(unsigned float16，簡寫“UFP16”)、無符號16位腦浮點數(unsigned bfloat16，簡寫“UBF16”)。為了便於理解，下面的表1示出上述的部分資料格式，其中的符號位寬、指數位寬和尾數位寬僅用於示例性的說明目的。表1 資料類型符號位寬指數位寬尾數位寬 FP16 1 5 10 BF16 1 8 7 FP32 1 8 23 BF8 1 5 3 UFP16 0 5(或6) 11(或10) UBF16 0 8 8 FIG. 1 is a schematic diagram illustrating a floating point data format 100 according to an embodiment of the present disclosure. As shown in Figure 1, a floating-point number to which the technical solution of the present disclosure can be applied may include three parts, such as sign (or sign bit 102), exponent (or exponent bit 104) and mantissa (or mantissa bit 106), wherein For unsigned floating point numbers there may be no sign or sign bit. In some embodiments, floating-point numbers suitable for use in the multipliers of the present disclosure may include at least half-precision floating-point numbers, single-precision floating-point numbers, brain floating-point numbers, double-precision floating-point numbers, custom floating-point numbers A sort of. Specifically, in some embodiments, the floating-point number format to which the technical solutions of the present disclosure can be applied may be a floating-point number format conforming to the IEEE754 standard, such as double-precision floating-point number (float64, abbreviated as "FP64"), single-precision floating-point number A floating point number (float32, abbreviated "FP32") or a half-precision floating point number (float16, abbreviated "FP16"). In other embodiments, the floating-point number format may also be an existing 16-bit brain floating-point number (bfloat16, abbreviated as "BF16"), or a custom floating-point number format, such as an 8-bit brain floating-point number ( bfloat8, abbreviated "BF8"), unsigned half-precision floating point number (unsigned float16, abbreviated "UFP16"), unsigned 16-bit brain floating point number (unsigned bfloat16, abbreviated "UBF16"). For ease of understanding, the following Table 1 shows some of the above-mentioned data formats, wherein the sign bit width, the exponent bit width and the mantissa bit width are only used for exemplary illustration purposes. Table 1 data type symbol width Exponential bit width mantissa bit width FP16 1 5 10 BF16 1 8 7 FP32 1 8 twenty three BF8 1 5 3 UFP16 0 5 (or 6) 11 (or 10) UBF16 0 8 8

對於上面所提到的各種浮點數格式，本披露的乘法器在操作中至少可以支援具有任意上述格式的兩個浮點數之間的相乘操作，其中兩個浮點數可以具有相同或不同的浮點資料格式。例如，兩個浮點數之間的相乘操作可以是FP16*FP16、BF16*BF16、FP32*FP32、FP32*BF16、FP16*BF16、FP32*FP16、BF8*BF16、UBF16*UFP16或UBF16*FP16等兩個浮點數之間的相乘操作。For the various floating-point number formats mentioned above, the multiplier of the present disclosure can at least support the multiplication operation between two floating-point numbers in any of the above-mentioned formats, wherein the two floating-point numbers can have the same or Different floating point data formats. For example, a multiply operation between two floating point numbers can be FP16*FP16, BF16*BF16, FP32*FP32, FP32*BF16, FP16*BF16, FP32*FP16, BF8*BF16, UBF16*UFP16, or UBF16*FP16 Equal multiplication operation between two floating point numbers.

第2圖是示出根據本披露實施例的乘法器200的示意性結構框圖。如前所述，本披露的乘法器支援各種資料格式的浮點數的相乘操作，而這些資料格式可以通過本披露的運算模式來指示，以使得乘法器工作在多種運算模式之一。FIG. 2 is a block diagram illustrating a schematic structure of a multiplier 200 according to an embodiment of the present disclosure. As mentioned above, the multiplier of the present disclosure supports the multiplication of floating-point numbers in various data formats, and these data formats can be indicated by the operation mode of the present disclosure, so that the multiplier operates in one of the various operation modes.

如第2圖中所示，本披露的乘法器總體上可以包括指數處理單元202和尾數處理單元204，其中指數處理單元用於處理浮點數的指數位，而尾數處理單元用於處理浮點數的尾數位。可選地或附加地，在一些實施例中，當乘法器處理的浮點數具有符號位時，乘法器還可以包括符號處理單元206，該符號處理單元可以用於處理包括符號位的浮點數。As shown in Figure 2, the multiplier of the present disclosure may generally include an exponent processing unit 202 for processing the exponent bits of a floating point number and a mantissa processing unit 204 for processing the floating point The mantissa digits of the number. Alternatively or additionally, in some embodiments, when the floating point number processed by the multiplier has a sign bit, the multiplier may further include a sign processing unit 206, which may be used to process the floating point number including the sign bit. number.

在操作中，所述乘法器可以根據運算模式之一對接收、輸入或緩存的第一浮點數和第二浮點數執行浮點運算，該第一浮點數和第二浮點數具有如前所討論的浮點資料格式之一。例如，當乘法器處於第一運算模式中，其可以支援兩個浮點數FP16*FP16的乘法運算，而當乘法器處於第二運算模式中，其可以支援兩個浮點數BF16*BF16的乘法運算。類似地，當乘法器處於第三運算模式中，其可以支援兩個浮點數FP32*FP32的乘法運算，而當乘法器處於第四運算模式中，其可以支援兩個浮點數FP32*BF16的乘法運算。這裡，示例的運算模式和浮點數對應關係如下表2所示。表2 運算模式編號運算浮點數類型 1 FP16*FP16 2 BF16*BF16 3 FP32*FP32 4 FP32*BF16 In operation, the multiplier may perform floating-point operations on received, input or buffered first and second floating-point numbers according to one of the operation modes, the first and second floating-point numbers having One of the floating point data formats as discussed earlier. For example, when the multiplier is in the first operation mode, it can support the multiplication of two floating-point numbers FP16*FP16, and when the multiplier is in the second operation mode, it can support the multiplication of two floating-point numbers BF16*BF16 Multiplication operation. Similarly, when the multiplier is in the third operation mode, it can support the multiplication of two floating-point numbers FP32*FP32, and when the multiplier is in the fourth operation mode, it can support two floating-point numbers FP32*BF16 the multiplication operation. Here, the corresponding relationship between the operation mode of the example and the floating point number is shown in Table 2 below. Table 2 Operation mode number Operation floating point type 1 FP16*FP16 2 BF16*BF16 3 FP32*FP32 4 FP32*BF16

在一個實施例中，上述的表2可以存儲於乘法器的一個存儲器中，並且乘法器根據從外部設備接收到的指令來選擇表中的運算模式之一，而該外部設備例如可以是第10圖中示出的外部設備1012。在另一個實施例中，該運算模式的輸入也可以經由如第3圖中所示的模式選擇單元308來自動地實現。例如，當兩個FP16型的浮點數輸入到本披露的乘法器時，模式選擇單元可以根據該兩個浮點數的資料格式而選擇乘法器工作於第一運算模式中。又例如，當一個FP32型浮點數和一個BF16型浮點數輸入到本披露的乘法器時，模式選擇單元可以根據該兩個浮點數的資料格式而選擇乘法器工作於第四運算模式中。In one embodiment, the above-mentioned table 2 may be stored in a memory of the multiplier, and the multiplier selects one of the operation modes in the table according to an instruction received from an external device, and the external device may be, for example, the tenth External device 1012 shown in the figure. In another embodiment, the input of the operation mode can also be realized automatically via the mode selection unit 308 as shown in FIG. 3 . For example, when two FP16 floating-point numbers are input to the multiplier of the present disclosure, the mode selection unit can select the multiplier to operate in the first operation mode according to the data formats of the two floating-point numbers. For another example, when an FP32 type floating point number and a BF16 type floating point number are input to the multiplier of the present disclosure, the mode selection unit can select the multiplier to work in the fourth operation mode according to the data format of the two floating point numbers. middle.

可以看出，本披露的不同運算模式與對應的浮點型資料相關聯。也就是說，本披露的運算模式可以用於指示第一浮點數的資料格式和第二浮點數的資料格式。在另一個實施例中，本披露的運算模式不僅可以指示第一浮點數的資料格式和第二浮點數的資料格式，還可以用於指示乘法運算後的資料格式。結合表2擴展的運算模式在下表3中示出。表3 運算模式編號運算浮點數類型輸出結果類型 11 FP16*FP16 FP16 12 BF16 13 FP32 21 BF16*BF16 FP16 22 BF16 23 FP32 31 FP32*FP32 FP16 32 BF16 33 FP32 41 FP32*BF16 FP16 42 BF16 43 FP32 It can be seen that the different operation modes of the present disclosure are associated with corresponding floating point data. That is, the operation mode of the present disclosure can be used to indicate the data format of the first floating point number and the data format of the second floating point number. In another embodiment, the operation mode of the present disclosure can not only indicate the data format of the first floating-point number and the data format of the second floating-point number, but also can be used to indicate the data format after the multiplication operation. The operation modes extended in conjunction with Table 2 are shown in Table 3 below. table 3 Operation mode number Operation floating point type output result type 11 FP16*FP16 FP16 12 BF16 13 FP32 twenty one BF16*BF16 FP16 twenty two BF16 twenty three FP32 31 FP32*FP32 FP16 32 BF16 33 FP32 41 FP32*BF16 FP16 42 BF16 43 FP32

與表2中所示的運算模式編號不同，表3中的運算模式擴展一位以用於指示浮點乘法運算後的資料格式。例如，當乘法器工作於運算模式21中，其對輸入的BF16*BF16兩個浮點數執行浮點運算，並且將浮點乘法運算後以FP16的資料格式輸出。Different from the operation mode numbers shown in Table 2, the operation modes in Table 3 are extended by one bit to indicate the data format after the floating-point multiplication operation. For example, when the multiplier works in operation mode 21, it performs floating-point operations on the input two floating-point numbers, BF16*BF16, and outputs the floating-point multiplication in the FP16 data format.

上面以編號形式的運算模式來指示浮點資料格式僅僅是示例性的而非限制性的，根據本披露的教導，也可以想到根據運算模式建立索引以確定乘數和被乘數的格式。例如，運算模式包括兩個索引，第一個索引用於指示第一浮點數的類型，第二個索引用於指示第二浮點數的類型，例如運算模式13中的第一索引“1”指示第一浮點數(或稱被乘數)為第一浮點格式，即FP16，而第二索引“3”指示第二浮點數(或稱乘數)為第二浮點格式，即FP32。進一步，也可以對運算模式增加第三索引，該第三索引指示輸出結果的資料格式，例如對於運算模式131中的第三索引“1”，其可以指示輸出結果的資料格式是第一浮點格式，即FP16。當運算模式數目增加時，可以根據需要增加相應的索引或索引的層級，以便於對運算模式和資料格式之間關係的確立。The above numbered operation modes to indicate floating point data formats are merely exemplary and non-limiting, and in accordance with the teachings of the present disclosure, it is also conceivable to index according to the operation modes to determine the format of the multiplier and multiplicand. For example, the operation mode includes two indexes, the first index is used to indicate the type of the first floating point number, and the second index is used to indicate the type of the second floating point number, for example, the first index "1" in operation mode 13 "indicates that the first floating-point number (or multiplicand) is in the first floating-point format, namely FP16, and the second index "3" indicates that the second floating-point number (or multiplier) is in the second floating-point format, Namely FP32. Further, a third index can also be added to the operation mode, the third index indicates the data format of the output result, for example, for the third index "1" in the operation mode 131, it can indicate that the data format of the output result is the first floating point format, i.e. FP16. When the number of operation modes increases, corresponding indexes or levels of indexes can be added as required, so as to facilitate the establishment of the relationship between operation modes and data formats.

另外，儘管這裡示例性地以數字編號來指代運算模式，在其他的例子中，也可以根據應用需要以其他的符號或編碼來對運算模式進行指代，例如通過字母、符號或數字及其結合等等，並且通過這樣的字母、數字、符號或其組合的表達來指代運算模式並標識出第一浮點數、第二浮點數和輸出結果的資料格式。另外，當這些表達以指令形式形成時，該指令可以包括三個域或欄位，第一域用於指示第一浮點數的資料格式，第二域用於指示第二浮點數的資料格式，而第三域用於指示輸出結果的資料格式。當然，這些域也可以被合併於一個域，或增加新的域以用於指示更多的與浮點資料格式相關的內容。可以看出，本披露的運算模式不僅可以與輸入的浮點數資料格式相關聯，也可以用於規格化輸出結果，以獲得期望資料格式的乘積結果。In addition, although the operation mode is exemplarily referred to by number numbers, in other examples, the operation mode can also be referred to by other symbols or codes according to application requirements, such as letters, symbols or numbers and their Combining, etc., and expressing through such letters, numbers, symbols, or combinations thereof, refer to the operation mode and identify the first floating point number, the second floating point number, and the data format of the output result. In addition, when these expressions are formed in the form of an instruction, the instruction may include three fields or fields, the first field for indicating the data format of the first floating point number and the second field for indicating the data of the second floating point number format, and the third field is used to indicate the data format of the output result. Of course, these fields can also be combined into one field, or new fields can be added to indicate more content related to the floating-point data format. It can be seen that the operation mode of the present disclosure can not only be associated with the input floating point data format, but also can be used to normalize the output result to obtain the product result in the desired data format.

第3圖是示出根據本披露實施例的乘法器300的更多細節結構框圖。從第3圖所示內容可以看出，其不僅包括第2圖中所示出的指數處理單元202、尾數處理單元204和可選的符號處理單元206，還示出這些單元可以包括的內部元件以及與這些單元操作相關的單元，下面結合第3圖來具體描述這些單元的示例性操作。FIG. 3 is a block diagram illustrating a more detailed structure of the multiplier 300 according to an embodiment of the present disclosure. As can be seen from the content shown in Figure 3, it not only includes the exponent processing unit 202, the mantissa processing unit 204 and the optional sign processing unit 206 shown in Figure 2, but also shows the internal elements that these units may include As well as the units related to the operations of these units, exemplary operations of these units will be described in detail below with reference to FIG. 3 .

為了執行浮點數的乘法運算，指數處理單元可以用於根據前述的運算模式、第一浮點數的指數和第二浮點數的指數獲得乘法運算後的指數。在一個實施例中，該指數處理單元可以通過加減法電路來實現。例如，此處的指數處理單元可以用於將第一浮點數的指數、第二浮點數的指數和各自對應的輸入浮點資料格式的偏移值相加，並且接著減去輸出浮點資料格式的偏移值，以獲得第一浮點數和第二浮點數的乘法運算後的指數。In order to perform the multiplication operation of the floating point number, the exponent processing unit may be configured to obtain the exponent after the multiplication operation according to the aforementioned operation mode, the exponent of the first floating point number and the exponent of the second floating point number. In one embodiment, the exponent processing unit may be implemented by an addition and subtraction circuit. For example, the exponent processing unit here may be used to add the exponent of the first floating point number, the exponent of the second floating point number, and the respective offset values of the input floating point data format, and then subtract the output floating point value The offset value of the data format to obtain the exponent of the multiplication of the first floating point number and the second floating point number.

進一步，乘法器的尾數處理單元可以用於根據前述的運算模式、第一浮點數和所述第二浮點數來獲得乘法運算後的尾數。在一個實施例中，尾數處理單元可以包括部分積運算單元312和部分積求和單元314，其中所述部分積運算單元用於根據第一浮點數的尾數和第二浮點數的尾數獲得中間結果。在一些實施例中，該中間結果可以是第一浮點數和第二浮點數在相乘操作過程中所獲得的多個部分積(如第5圖和第6圖中所示意性示出的)。所述部分積求和單元用於將所述中間結果進行加和運算以獲得加和結果，並將所述加和結果作為所述乘法運算後的尾數。Further, the mantissa processing unit of the multiplier may be configured to obtain the mantissa after the multiplication operation according to the aforementioned operation mode, the first floating point number and the second floating point number. In one embodiment, the mantissa processing unit may include a partial product operation unit 312 and a partial product summation unit 314, wherein the partial product operation unit is used to obtain the mantissa of the first floating point number and the mantissa of the second floating point number Intermediate results. In some embodiments, the intermediate result may be a plurality of partial products obtained during the multiplication operation of the first floating-point number and the second floating-point number (as schematically shown in FIGS. 5 and 6 ). of). The partial product summation unit is configured to perform an addition operation on the intermediate results to obtain an addition result, and use the addition result as the mantissa after the multiplication operation.

為了獲得中間結果，在一個實施例中，本披露利用布斯(“Booth”)編碼電路對第二浮點數(如充當浮點運算中的乘數)的尾數的高低位補0(其中對高位補0是將尾數作為無符號數轉為有符號數)，以便獲得所述中間結果。需要理解的是，根據編碼方法的不同，也可以對第一浮點數(如充當浮點運算中的被乘數)的尾數進行編碼(如高低位補0)，或者對二者都進行編碼，以獲得多個部分積。關於部分積的更多描述，稍後將結合附圖來說明。In order to obtain an intermediate result, in one embodiment, the present disclosure utilizes a Booth (“Booth”) encoding circuit to fill the high and low bits of the mantissa of the second floating-point number (eg, serving as a multiplier in a floating-point operation) with zeros (wherein The high-order 0-fill is to convert the mantissa as an unsigned number to a signed number) in order to obtain the intermediate result. It should be understood that, depending on the encoding method, the mantissa of the first floating-point number (for example, serving as the multiplicand in floating-point operations) can also be encoded (for example, the high and low bits are filled with 0), or both are encoded. , to obtain multiple partial products. More descriptions about the partial product will be explained later in conjunction with the accompanying drawings.

在另一個實施例中，所述部分積求和單元可以包括加法器，其用於對所述中間結果進行加和，以獲得所述加和結果。在又一個實施例中，部分積求和單元包括華萊士樹和加法器，其中所述華萊士樹用於對所述中間結果進行加和，以獲得第二中間結果，所述加法器用於對所述第二中間結果進行加和，以獲得所述加和結果。在這些實施例中，加法器可以包括全加器、串列加法器和超前進位加法器中的至少一種。In another embodiment, the partial product summation unit may include an adder for summing the intermediate results to obtain the summation result. In yet another embodiment, the partial product summation unit includes a Wallace tree and an adder, wherein the Wallace tree is used to add the intermediate results to obtain a second intermediate result, and the adder uses and summing the second intermediate result to obtain the summation result. In these embodiments, the adder may include at least one of a full adder, a tandem adder, and a carry-lookahead adder.

在一個實施例中，本披露的乘法器還包括規則化單元318和捨入單元320。該規則化單元可以用於對乘法運算後的尾數和指數進行浮點數規則化處理，以獲得規則化指數結果和規則化尾數結果，並且將所述規則化指數結果和所述規則化尾數結果作為所述乘法運算後的指數和乘法運算後的尾數。例如，根據運算模式所指示的資料格式，規則化單元可以調整指數和尾數的位寬，以使其符合前述指示的資料格式的要求。另外，規則化單元還可以對指數或尾數做其他方面的調整。例如，在一些應用場景中，當尾數的值不為0時，尾數位的最高有效位應為1；否則，可以修改指數位並同時對尾數位進行移位，使其變為規格化數的形式。在另一個實施例中，該規則化單元還可以根據乘法運算後的尾數對所述乘法運算後的指數進行調整。例如，當乘法運算後的尾數的最高位為1時，可以將乘法運算後所獲得的指數加1。與之相應，捨入單元可以用於根據捨入模式對所述規則化尾數結果執行捨入操作，並將執行了捨入操作後的尾數作為所述乘法運算後的尾數。根據不同的應用場景，該捨入單元可以執行例如包括向下捨入、向上捨入、向最近的有效數捨入等的捨入操作。在一些應用場景中，捨入單元也可以對尾數右移過程中移出的1進行捨入。In one embodiment, the multiplier of the present disclosure further includes a regularization unit 318 and a rounding unit 320 . The regularization unit can be used to perform floating point regularization processing on the mantissa and exponent after the multiplication operation, so as to obtain a regularized exponent result and a regularized mantissa result, and combine the regularized exponent result and the regularized mantissa result as the exponent after the multiplication operation and the mantissa after the multiplication operation. For example, according to the data format indicated by the operation mode, the regularization unit can adjust the bit widths of the exponent and the mantissa to make them meet the requirements of the data format indicated above. In addition, the regularization unit can also make other adjustments to the exponent or mantissa. For example, in some application scenarios, when the value of the mantissa is not 0, the most significant bit of the mantissa bit should be 1; otherwise, the exponent bit can be modified and the mantissa bit can be shifted at the same time to make it normalized. form. In another embodiment, the regularization unit may further adjust the exponent after the multiplication operation according to the mantissa after the multiplication operation. For example, when the highest bit of the mantissa after the multiplication operation is 1, the exponent obtained after the multiplication operation can be increased by 1. Correspondingly, the rounding unit may be configured to perform a rounding operation on the result of the regularized mantissa according to a rounding mode, and use the mantissa after performing the rounding operation as the mantissa after the multiplication operation. According to different application scenarios, the rounding unit may perform rounding operations including rounding down, rounding up, rounding to the nearest significant digit, and the like. In some application scenarios, the rounding unit can also round the 1 shifted out during the right shift of the mantissa.

除了指數處理單元和尾數處理單元，本披露的乘法器還可選地包括符號處理單元，當輸入的浮點數是帶有符號位的浮點數時，該符號處理單元可以用於根據第一浮點數的符號和第二浮點數的符號獲得乘法運算後的符號。例如，在一個實施例中，該符號處理單元可以包括異或邏輯電路322，所述異或邏輯電路用於根據所述第一浮點數的符號和所述第二浮點數的符號進行異或運算，獲得所述乘法運算後的符號。在另一個實施例中，該符號處理單元也可以通過真值表或邏輯判斷來實現。In addition to the exponent processing unit and the mantissa processing unit, the multiplier of the present disclosure may optionally include a sign processing unit, which can be used to The sign of the floating point number and the sign of the second floating point number obtain the sign after the multiplication operation. For example, in one embodiment, the sign processing unit may include an exclusive-OR logic circuit 322, the exclusive-OR logic circuit is configured to perform XOR according to the sign of the first floating-point number and the sign of the second floating-point number OR operation to obtain the symbol after the multiplication operation. In another embodiment, the symbol processing unit can also be implemented by a truth table or logical judgment.

另外，為了使輸入或接收到的第一和第二浮點數符合規定的格式，在一個實施例中，本披露的乘法器還可以包括規格化處理單元324，用於當所述第一浮點數或第二浮點數為非規格化的非零浮點數時，根據所述運算模式，對所述第一浮點數或第二浮點數進行規格化處理，以獲得對應的指數和尾數。例如，當選擇的運算模式是表2中所示出的第2種運算模式，而輸入的第一和第二浮點數是FP16型資料，則可以利用規格化處理單元將FP16型資料規格化為BF16型資料，以便乘法器以第2種運算模式進行操作。在一個或多個實施例中，規格化處理單元還可以用於對存在隱式的1的規格化浮點數和不存在隱式的1的非規格化浮點數的尾數進行預處理(例如尾數的擴充)，以便於後續的尾數處理單元的操作。基於上文的描述，可以理解的是這裡的規格化處理單元324和前述的規則化單元318在一些實施例中也可以執行相同或相類似的操作，不同的是規格化處理單元324針對於輸入的浮點資料進行規格化處理而規則化單元318針對於將要輸出的尾數和指數進行規則化處理。In addition, in order to make the input or received first and second floating-point numbers conform to a specified format, in one embodiment, the multiplier of the present disclosure may further include a normalization processing unit 324 for when the first floating-point number is When the point number or the second floating point number is a denormalized non-zero floating point number, normalize the first floating point number or the second floating point number according to the operation mode to obtain the corresponding exponent and mantissa. For example, when the selected operation mode is the second operation mode shown in Table 2, and the input first and second floating-point numbers are FP16 type data, the normalization processing unit can be used to normalize the FP16 type data For BF16 type data, so that the multiplier operates in the second operation mode. In one or more embodiments, the normalization processing unit may also be used to preprocess the mantissas of normalized floating point numbers with implicit 1s and denormalized floating point numbers without implicit 1s (eg, Mantissa expansion) to facilitate subsequent mantissa processing unit operations. Based on the above description, it can be understood that the normalization processing unit 324 here and the aforementioned regularization unit 318 can also perform the same or similar operations in some embodiments, the difference is that the normalization processing unit 324 is for the input The floating-point data is normalized and the regularization unit 318 is regularized for the mantissa and exponent to be output.

以上結合第3圖對本披露的乘法器及其多個實施例進行了描述。基於上面的描述，本領域具有通常知識者可以理解本披露的方案通過乘法器的執行來獲得乘法運算後的結果(包括指數、尾數和可選的符號)。根據應用場景的不同，例如在不需要前述的規則化處理和捨入處理時，通過尾數處理單元和指數處理單元所獲得的結果即可以視為最終的運算結果。進一步，對於需要前述的規則化處理和捨入處理時，則經過該規則化處理和捨入處理後所獲得的指數和尾數可以視為最終的運算結果，或最終的運算結果的一部分(當考慮最終的符號時)。進一步，本披露的方案通過多種運算模式來使得乘法器支援不同類型或資料格式的浮點數的運算，從而可以實現乘法器的複用，由此節省了晶片設計的開銷並節約了計算成本。另外，通過多次調用機制，本披露的乘法器也支援高位寬的浮點數的計算。鑒於在浮點數乘法操作中，尾數(或稱尾數位或尾數部分)的相乘操作對於整個浮點運算的性能至關重要，下面將結合第4圖來描述本披露的尾數操作。The multiplier and its various embodiments of the present disclosure are described above in conjunction with FIG. 3 . Based on the above description, those skilled in the art can understand that the solution of the present disclosure obtains the multiplication result (including the exponent, the mantissa and the optional sign) through the execution of the multiplier. According to different application scenarios, for example, when the aforementioned regularization processing and rounding processing are not required, the result obtained by the mantissa processing unit and the exponent processing unit can be regarded as the final operation result. Further, when the aforementioned regularization processing and rounding processing are required, the exponent and mantissa obtained after the regularization processing and rounding processing can be regarded as the final operation result, or a part of the final operation result (when considering final symbol). Further, the solution of the present disclosure enables the multiplier to support the operation of floating-point numbers of different types or data formats through multiple operation modes, so as to realize the multiplexing of the multiplier, thereby saving the cost of chip design and the calculation cost. In addition, the multiplier of the present disclosure also supports the computation of high-bit-width floating-point numbers through the multiple-call mechanism. In view of the fact that in the floating-point multiplication operation, the multiplication operation of the mantissa (or the mantissa bits or the mantissa part) is very important to the performance of the whole floating-point operation, the mantissa operation of the present disclosure will be described below with reference to FIG. 4 .

第4圖是示出根據本披露實施例的尾數處理單元操作400的示意性框圖。如第4圖中所示，本披露的尾數處理操作可以主要涉及兩個單元，即前述結合如第3圖所討論的部分積運算單元和部分積求和單元。從操作時序上來看，該尾數處理操作大體可以分為第一階段和第二階段，在第一階段中該尾數處理操作將獲得中間結果，而在第二階段中該尾數處理操作將獲得從加法器408輸出的尾數結果。Figure 4 is a schematic block diagram illustrating mantissa processing unit operations 400 in accordance with an embodiment of the present disclosure. As shown in FIG. 4 , the mantissa processing operations of the present disclosure may primarily involve two units, the partial product operation unit and the partial product summation unit discussed above in connection with FIG. 3 . In terms of operation timing, the mantissa processing operation can be roughly divided into the first stage and the second stage. In the first stage, the mantissa processing operation will obtain an intermediate result, and in the second stage, the mantissa processing operation will obtain the result from the addition. The mantissa result output by the device 408.

在示例性的具體操作中，由乘法器接收到的第一浮點數和第二浮點數可以被劃分成多個部分，即前述的符號(可選的)、指數和尾數。可選地，在經過規格化處理後，兩個浮點數的尾數部分將作為輸入進入到尾數處理單元(如第2圖或第3圖中的尾數處理單元)，並且具體地進入到部分積運算單元。如第4圖中所示，本披露利用布斯編碼電路402對第二浮點數(即浮點運算中的乘數)的尾數的高低位補0，並進行布斯編碼處理，從而在部分積產生電路404中獲得所述中間結果。當然，這裡的第一浮點數和第二浮點數僅僅用於說明性而非限制性的目的，因此在一些應用場景中，第一浮點數可以是乘數而第二浮點數可以是被乘數。相應地，在一些編碼處理中，也可以對充當被乘數的浮點數執行編碼操作。In an exemplary specific operation, the first floating point number and the second floating point number received by the multiplier may be divided into a plurality of parts, ie the aforementioned sign (optional), exponent and mantissa. Optionally, after normalization processing, the mantissa part of the two floating-point numbers will be input into the mantissa processing unit (such as the mantissa processing unit in Figure 2 or Figure 3), and specifically into the partial product. operation unit. As shown in Fig. 4, the present disclosure uses the Booth encoding circuit 402 to add 0s to the high and low bits of the mantissa of the second floating-point number (ie, the multiplier in the floating-point operation), and performs Booth encoding processing, so that the partial The intermediate result is obtained in the product generation circuit 404 . Of course, the first floating-point number and the second floating-point number here are only used for illustrative rather than restrictive purposes, so in some application scenarios, the first floating-point number may be a multiplier and the second floating-point number may be is the multiplicand. Accordingly, in some encoding processes, encoding operations may also be performed on floating-point numbers serving as multiplicands.

為了更好的理解本披露的技術方案，下面對布斯編碼進行簡要地介紹。一般地，當兩個二進位數字進行相乘操作時，通過乘法操作會產生大量的稱之為部分積的中間結果，然後在對這些部分積進行累加操作進而得到兩個二進位數字相乘的最終結果。其中部分積數量越多，陣列乘法器的面積和功耗就會越大，執行速度就會越慢，其實現電路也就越困難。而布斯編碼的目的就是為了有效地減少部分積的求和項的數量，從而減小電路面積。其演算法在於首先對輸入的乘數進行相應規則的編碼，在一個實施例中，編碼規則例如可以是下表4所示的規則：表4 待編碼資料編碼信號 y_2i+1 y_2i y_2i-1 PPi 0 0 0 0 0 0 1 X 0 1 0 X 0 1 1 2X 1 0 0 -2X 1 0 1 - X 1 1 0 - X 1 1 1 -0(=0) In order to better understand the technical solutions of the present disclosure, the Booth coding is briefly introduced below. Generally, when two binary numbers are multiplied, a large number of intermediate results called partial products are generated through the multiplication operation, and then these partial products are accumulated to obtain the multiplication of the two binary numbers. Final result. The larger the number of partial products, the larger the area and power consumption of the array multiplier will be, the slower the execution speed will be, and the more difficult it will be to implement the circuit. The purpose of Booth coding is to effectively reduce the number of summation terms of partial products, thereby reducing the circuit area. Its algorithm is to first encode the corresponding rules for the input multipliers. In one embodiment, the encoding rules can be, for example, the rules shown in Table 4 below: Table 4 data to be encoded encoded signal y _2i+1 y _2i y _2i-1 PPi 0 0 0 0 0 0 1 X 0 1 0 X 0 1 1 2X 1 0 0 -2X 1 0 1 - X 1 1 0 - X 1 1 1 -0(=0)

其中表4中的y_2i+1 ，y_2i 和y_2i-1 可以表示每一組待編碼子資料(即乘數)對應的數值，X可以表示第一浮點數(即被乘數)中的尾數。對每一組對應的待編碼資料進行布斯編碼處理後，得到對應的編碼信號PPi(i=0，1，2，...，n)。如表4中所示意性示出的，布斯編碼後得到的編碼信號可以包括五類，分別為-2X、2X、-X、X和0。示例性地，基於上述的編碼規則，若接收到的被乘數為8位資料“

”，則可以獲得下述的部分積：Wherein y _2i+1 , y _2i and y _2i-1 in Table 4 can represent the values corresponding to each group of sub-data to be encoded (ie multiplier), and X can represent the first floating point number (ie multiplicand) the mantissa. After each group of corresponding data to be encoded is subjected to Booth encoding, a corresponding encoded signal PPi (i=0, 1, 2, . . . , n) is obtained. As schematically shown in Table 4, the encoded signal obtained after Booth encoding may include five types, namely -2X, 2X, -X, X and 0, respectively. Exemplarily, based on the above encoding rules, if the received multiplicand is 8-bit data "

”, the following partial product can be obtained:

1)當乘數位中包括上表中的連續三位資料“001”時，部分積為X，可以表示為“

”，第9位是符號位，即

；1) When the multiplier digit includes the continuous three data "001" in the above table, the partial product is X, which can be expressed as "

", the 9th bit is the sign bit, that is

;

2)當乘數位中包括上表中的連續三位資料“011”時，部分積為2X，可以表示為X左移一位，得到“

0”，即

；2) When the multiplier bits include the consecutive three-digit data "011" in the above table, the partial product is 2X, which can be expressed as X shifted to the left by one place to get "

0", i.e.

;

3)當乘數位中包括上表中的連續三位資料“101”時，部分積為-X，可以表示為“

”，表示對“

”按位取反再加1，即

；3) When the multiplier bits include the consecutive three data "101" in the above table, the partial product is -X, which can be expressed as "

, indicating that "

"Invert the bit and add 1, i.e.

;

4)當乘數位中包括上表中的連續三位資料“100”時，部分積為-2X，可以表示為

，表示對“

”左移一位後取反再加1，即

+1；4) When the multiplier bits include the three consecutive data "100" in the above table, the partial product is -2X, which can be expressed as

, indicating that "

"Shift one bit to the left, invert and add 1, that is

+1;

5)當乘數位中包括上表中的連續三位資料“111”或“000”時，部分積為0，即

。5) When the multiplier bits include the consecutive three data "111" or "000" in the above table, the partial product is 0, that is

.

應當理解的是上面結合表4對獲得部分積的過程的描述僅僅是示例性的而非限制性的，本領域具有通常知識者在本披露的教導下，可以對表4中的規則進行改變，以獲得不同於表4所示出的部分積。例如，在乘數位中存在連續多位(例如3位或3位以上)的特定數時，得到的部分積可以是被乘數的補數，或者例如在對部分積進行加和之後再執行上述3)和4)項中的“加1”操作。It should be understood that the above description of the process of obtaining the partial product in conjunction with Table 4 is only exemplary and not restrictive, and those with ordinary knowledge in the art can make changes to the rules in Table 4 under the teaching of the present disclosure, to obtain partial products other than those shown in Table 4. For example, when there is a specific number of consecutive multiple digits (such as 3 digits or more) in the multiplier digits, the obtained partial product can be the complement of the multiplicand, or the above-mentioned partial product can be added after adding the partial product. "Add 1" operation in items 3) and 4).

根據上述介紹性描述可以理解，通過對第二浮點數的尾數利用布斯編碼電路進行編碼，並且利用第一浮點數的尾數，可以從部分積產生電路產生多個部分積作為中間結果，並且將中間結果輸送入到部分積求和單元中的華萊士樹(“Wallace Tree”)壓縮器406。應當理解的是，此處利用布斯編碼獲得部分積僅是本披露得到部分積的一種優選方式，而本領域具有通常知識者也可以通過其他的方式來獲得該部分積。例如，還可以通過移位操作來獲得，即根據乘數的位值為1還是0來選擇移位加被乘數還是加0而獲得相應的部分積。類似地，利用華萊士樹壓縮器以實現部分積的加法操作也僅僅是示例性的而非限制性的，本領域具有通常知識者也可以想到利用其他類型的加法器來實現這樣的部分積相加操作。該加法器例如可以是一個或多個全加器、半加器或二者的各種組合形式。As can be understood from the above introductory description, by encoding the mantissa of the second floating-point number using the Booth encoding circuit, and using the mantissa of the first floating-point number, a plurality of partial products can be generated from the partial product generating circuit as intermediate results, And the intermediate result is fed into a Wallace Tree ("Wallace Tree") compressor 406 in the partial product summation unit. It should be understood that the use of Booth coding to obtain the partial product here is only a preferred way of obtaining the partial product in the present disclosure, and those with ordinary knowledge in the art can also obtain the partial product in other ways. For example, it can also be obtained by a shift operation, that is, according to whether the bit value of the multiplier is 1 or 0, the corresponding partial product is obtained by selecting the shift plus the multiplicand or the plus 0. Similarly, the use of the Wallace tree compressor to realize the addition operation of the partial product is only exemplary and non-limiting, and those with ordinary knowledge in the art can also think of using other types of adders to realize such partial products. Addition operation. The adder may be, for example, one or more full adders, half adders, or various combinations of the two.

關於華萊士樹壓縮器(或簡稱為華萊士樹)，其主要用於對上述的中間結果(即多個部分積)進行求和，以減少部分積的累加次數(即，壓縮)。通常，華萊士樹壓縮器可以採用進位保存CAS(carry-save)架構和Wallace樹算法，其利用華萊士樹陣列的計算速度比傳統進位傳遞的加法快得多。Regarding the Wallace Tree Compressor (or simply Wallace Tree), it is mainly used to sum the above-mentioned intermediate results (ie, multiple partial products) to reduce the number of accumulations (ie, compression) of the partial products. Generally, a Wallace tree compressor can adopt a carry-save CAS (carry-save) architecture and a Wallace tree algorithm, which utilizes a Wallace tree array to compute much faster than traditional carry-pass addition.

具體地，華萊士樹壓縮器能平行計算各行部分積之和，例如可以將N個部分積的累加次數從N-1次減少到Log₂ N次，從而提高了乘法器的速度，對資源的有效利用具有重要意義。根據不同的應用需要，可以將華萊士樹壓縮器設計成多種類型，例如7-2華萊士樹、4-2華萊士樹以及3-2華萊士樹等。在一個或多個實施例中，本披露使用7-2華萊士樹作為實現本披露的各種浮點運算的示例，稍後將結合第5圖和第6圖對其進行詳細的描述。Specifically, the Wallace tree compressor can calculate the sum of the partial products of each row in parallel, for example, the accumulation times of N partial products can be reduced from N-1 times to Log ₂ N times, thereby improving the speed of the multiplier and saving resources. effective use is of great significance. According to different application needs, the Wallace tree compressor can be designed into various types, such as 7-2 Wallace tree, 4-2 Wallace tree and 3-2 Wallace tree, etc. In one or more embodiments, the present disclosure uses a 7-2 Wallace tree as an example of implementing the various floating point operations of the present disclosure, which will be described in detail later in conjunction with FIGS. 5 and 6 .

在一些實施例中，本披露所公開的華萊士樹壓縮操作可以佈置為具有M個輸入，N個輸出，其數目可以不小於K，其中N為預設的小於M的正整數，K為不小於中間結果的最大位寬的正整數。例如，M可以是7，N可以是2，即下文將詳細描述的7-2華萊士樹。當中間結果的最大位寬是48時，K可以取正整數48，也就是說華萊士樹的數目可以是48個。In some embodiments, the Wallace tree compression operation disclosed in the present disclosure may be arranged to have M inputs and N outputs, the number of which may not be less than K, where N is a preset positive integer smaller than M, and K is A positive integer not less than the maximum bit width of the intermediate result. For example, M could be 7 and N could be 2, a 7-2 Wallace tree as described in detail below. When the maximum bit width of the intermediate result is 48, K can take a positive integer of 48, that is to say, the number of Wallace trees can be 48.

在一些實施例中，根據運算模式，可以選用一組或多組所述華萊士樹對所述中間結果進行加和，其中每組有X個華萊士樹，X為所述中間結果的位數。進一步，各組內的華萊士樹之間可以存在依次進位的關係，而各組間並不存在進位的關係。在示例性的連接中，華萊士樹壓縮器可以通過進位進行連接，例如來自于低位華萊士樹壓縮器的進位輸出(如第6圖中Cin)至高位華萊士樹，而高位華萊士樹壓縮器的進位輸出(Cout)又可以成為更高位華萊士樹壓縮器接收來自低位華萊士樹壓縮器的進位輸入。另外，當從多個華萊士樹壓縮器中選擇一個或多個華萊士時，可以進行任意的選擇，例如既可以按0、1、2和3編號的順序來選擇，也可以按0、2、4和6編號的順序來連接，只要選擇的華萊士樹壓縮器是按上述的進位關係來選擇即可。In some embodiments, according to the operation mode, one or more groups of the Wallace trees may be selected to add the intermediate results, wherein each group has X Wallace trees, and X is the sum of the intermediate results. digits. Further, there may be a sequential carry relationship between the Wallace trees in each group, but there is no carry relationship between each group. In an exemplary connection, the Wallace Tree Compressor may be connected via carry, such as the carry output from the lower Wallace Tree Compressor (such as Cin in Figure 6) to the upper Wallace Tree, and the upper Wallace Tree Compressor The carry output (Cout) of the Race tree compressor can in turn become a higher order Wallace tree compressor receiving the carry input from the lower order Wallace tree compressor. In addition, when selecting one or more Wallace tree compressors from multiple Wallace tree compressors, any selection can be made, such as either in the order of numbers 0, 1, 2, and 3, or 0 , 2, 4 and 6 are connected in the order of numbering, as long as the selected Wallace tree compressor is selected according to the above-mentioned carry relationship.

下面結合一個說明性的示例來介紹上文的華萊士樹及其操作。假設第一浮點數和第二浮點數的是16位資料（例如FP16*FP16），乘法器支援的資料位寬是32位 (由此支援兩組16位數的並行相乘操作)，華萊士樹是7個(即上述M的一個示例值)輸入和2個(即上述N的一個示例值)輸出的7-2華萊士樹壓縮器。在該示例場景下，可以採用48個(即上述K的一個示例值)華萊士樹來並行完成兩組資料的乘法運算。The above Wallace tree and its operations are described below in conjunction with an illustrative example. Assuming that the first floating point number and the second floating point number are 16-bit data (such as FP16*FP16), the data bit width supported by the multiplier is 32 bits (thereby supporting two sets of 16-bit parallel multiplication operations), A Wallace tree is a 7-2 Wallace tree compressor with 7 (ie an example value of M above) inputs and 2 (ie an example value of N above) outputs. In this example scenario, 48 Wallace trees (that is, an example value of K above) can be used to complete the multiplication operation of the two sets of data in parallel.

在上述的48個華萊士樹中，第0~23個華萊士樹(即第一組華萊士樹中的24個華萊士樹)可以完成第一組乘法的部分積加和運算，並且該組內的各華萊士樹可以依次通過進位連接。進一步，第24~47個華萊士樹(即第二組華萊士樹中的24個華萊士樹)可以完成第二組乘法的部分積加和運算，其中該組內的各華萊士樹依次通過進位連接。另外，第一組中的第23個華萊士樹和第二組中的第24個華萊士樹之間不存在進位關係，即不同組的華萊士樹之間不存在進位關係。Among the above 48 Wallace trees, the 0th to 23rd Wallace trees (that is, 24 Wallace trees in the first group of Wallace trees) can complete the partial product addition and operation of the first group of multiplications , and each Wallace tree in the group can be sequentially connected by carry. Further, the 24th to 47th Wallace trees (that is, 24 Wallace trees in the second group of Wallace trees) can complete the partial product addition and sum operation of the second group of multiplications, wherein each Wallace tree in this group The trees are in turn connected by carry. In addition, there is no carry relationship between the 23rd Wallace tree in the first group and the 24th Wallace tree in the second group, that is, there is no carry relationship between different groups of Wallace trees.

返回到第4圖，在通過華萊士樹壓縮器對部分積進行加和壓縮後，將經過壓縮後的部分積通過加法器進行求和，以獲得尾數乘法操作的結果。關於加法器，在本披露的一個或多個實施例中，其可以包括全加器、串列加法器和超前進位加法器中的一種，用於對華萊士樹壓縮器進行加和所得到的最後兩行部分積進行求和操作，以獲得尾數乘法操作的結果。Returning to Figure 4, after adding and compressing the partial products through the Wallace tree compressor, the compressed partial products are summed through the adder to obtain the result of the mantissa multiplication operation. Regarding the adder, in one or more embodiments of the present disclosure, it may include one of a full adder, a tandem adder, and a carry-lookahead adder for summing the Wallace tree compressor. The resulting last two rows of partial products are summed to obtain the result of the mantissa multiplication operation.

可以理解，通過第4圖所示出的尾數乘法操作，特別是示例性地使用布斯編碼和華萊士樹，可以有效地獲得尾數乘法操作的結果。具體地，布斯編碼處理能有效減少部分積求和項的數目，從而減小電路面積，而華萊士壓縮樹能平行計算各行部分積之和，從而提高了乘法器的速度。It will be appreciated that the result of the mantissa multiplication operation can be efficiently obtained by the mantissa multiplication operation shown in FIG. 4, in particular using Booth coding and Wallace tree exemplarily. Specifically, the Booth coding process can effectively reduce the number of partial product summation terms, thereby reducing the circuit area, while the Wallace compressed tree can calculate the sum of partial products of each row in parallel, thereby increasing the speed of the multiplier.

下面將結合第5圖和第6圖對部分積和7-2華萊士樹的示例操作過程作詳細的描述。可以理解的是這裡的描述僅僅是示例性的而非限制性的，目的僅在於對本披露方案的更好理解。An example operation process of the partial product sum 7-2 Wallace tree will be described in detail below with reference to Figures 5 and 6. It is to be understood that the description herein is merely exemplary and not restrictive, and is only for a better understanding of the present disclosure.

第5圖示出在經過前述結合第2圖-第4圖所描述的尾數處理單元中的部分積產生電路後所獲得的部分積500，如圖中的兩個虛線之間四行白色圓點，其中每行白色圓點標識出一個部分積。為了便於後續的華萊士樹壓縮器的執行，可以預先對位數進行擴展。例如，第5圖中的黑點為複製的每個9位部分積的最高位數值，可以看出部分積被擴展對齊至16(8+8)bit(即，被乘數尾數的位寬8bit+乘數尾數的位寬8bit)。在另一個實施例中，例如對於25*13二進位乘法的部分積，其部分積被擴展至38(25+13)bit(即，被乘數尾數的位寬25bit+乘數尾數的位寬13bit)。Fig. 5 shows the partial product 500 obtained after passing through the partial product generating circuit in the mantissa processing unit described above in conjunction with Fig. 2-Fig. 4, as shown in the figure, there are four rows of white dots between the two dotted lines in the figure , where each row of white dots identifies a partial product. To facilitate subsequent Wallace tree compressor implementation, the number of bits may be expanded in advance. For example, the black dot in Figure 5 is the highest value of each 9-bit partial product copied, it can be seen that the partial product is extended and aligned to 16(8+8)bit (that is, the bit width of the multiplicand mantissa is 8bit+ The bit width of the mantissa of the multiplier is 8 bits). In another embodiment, for example, for the partial product of 25*13 binary multiplication, the partial product is extended to 38(25+13) bits (that is, the bit width of the multiplicand mantissa is 25 bits + the bit width of the multiplier mantissa is 13 bits. ).

第6圖是示出根據本披露實施例的華萊士樹壓縮器的操作流程和示意框圖600。FIG. 6 is a schematic block diagram 600 illustrating an operational flow and a schematic block diagram 600 of a Wallace tree compressor according to an embodiment of the present disclosure.

如第6圖中所示，在對兩個浮點數的尾數執行相乘操作後，例如如前所述，通過將乘數進行布斯編碼並且通過被乘數可以獲得第6圖中所示出的7個部分積。由於布斯編碼算法的使用，減小了產生的部分積的數目。為了便於理解，圖中在部分積部分用虛線框標識出一個包括7個元素的華萊士樹，並且進一步以箭頭示出其從7個元素壓縮至2個元素的過程。在一個實施例中，該壓縮過程(或稱加和過程)可以借助於全加器來實現，即輸入三個元素輸出兩個元素(即一個和“sum”以及向高位的進位“carry”)。7-2華萊士樹壓縮器的示意框圖在第6圖的右側示出，可以理解該華萊士樹壓縮器包括7個來自一列部分積的輸入(如第6圖左側虛線框中標識的七個元素)。在操作中，第0列華萊士樹的進位輸入為0，每列華萊士樹的進位輸出Cout作為下一列華萊士樹的進位輸入Cin。As shown in Figure 6, after performing a multiplication operation on the mantissas of two floating-point numbers, for example, as previously described, by Booth-coding the multiplier and passing the multiplicand as shown in Figure 6 out of the 7 partial products. Due to the use of the Booth coding algorithm, the number of partial products generated is reduced. For ease of understanding, a Wallace tree including 7 elements is marked with a dashed box in the partial product part in the figure, and the process of compressing it from 7 elements to 2 elements is further shown with arrows. In one embodiment, the compression process (or the summation process) can be implemented by means of a full adder, that is, three elements are input and two elements are output (ie, a sum "sum" and a carry to the high order "carry") . The schematic block diagram of the 7-2 Wallace tree compressor is shown on the right side of Figure 6, it can be understood that the Wallace tree compressor includes 7 inputs from a column of partial products (as identified in the dashed box on the left side of Figure 6). seven elements). In operation, the carry input of the 0th column of Wallace trees is 0, and the carry output Cout of each column of Wallace trees is used as the carry input Cin of the next column of Wallace trees.

從第6圖左側部分中可以看到，經過四次壓縮後可以將包括7個元素的華萊士樹壓縮為包括2個元素。如前所提到，本披露利用7-2華萊士樹壓縮器將7行的部分積最終壓縮成具有兩行的部分積(即本披露的第二中間結果)，並且利用加法器(例如超前進位加法器)來獲得尾數結果。As you can see from the left part of Figure 6, a Wallace tree with 7 elements can be compressed to include 2 elements after four compressions. As mentioned earlier, the present disclosure utilizes a 7-2 Wallace tree compressor to finally compress a partial product of 7 rows into a partial product with two rows (ie, the second intermediate result of the present disclosure), and utilizes an adder (eg, carry lookahead adder) to get the mantissa result.

為了進一步闡述本披露方案的原理，下面將示例性地描述本披露的乘法器如何完成FP16*FP16、FP16*FP16、FP32*FP32和FP32*BF16四種運算模式下在第一階段的操作，即直到華萊士樹壓縮器完成中間結果的求和以獲得第二中間結果：In order to further illustrate the principle of the disclosed solution, the following will exemplarily describe how the multiplier of the present disclosure completes the operations in the first stage under the four operation modes of FP16*FP16, FP16*FP16, FP32*FP32 and FP32*BF16, namely, Until the Wallace tree compressor finishes summing the intermediate results to get the second intermediate result:

(1)FP16*FP16(1)FP16*FP16

在乘法器的該運算模式下，浮點數的尾數位為10bit，考慮IEEE754標準下非規格化非零數，可以擴展1bit位，從而尾數位為11bit。另外，由於尾數位為無符號數，採用布斯編碼算法時可以在高位擴展1bit的0，因此總的尾數位數為12bit。當對作為第二浮點數即乘數進行布斯編碼，並且參照第一浮點數時，則通過部分積產生電路可以在高低部分分別獲得7個部分積，其中第七個部分積為0，每個部分積的位寬為24bit，此時可以通過48個7-2華萊士樹進行壓縮處理，並且第23個到第24個華萊士樹的進位為0。In this operation mode of the multiplier, the mantissa of the floating-point number is 10 bits. Considering the denormalized non-zero number under the IEEE754 standard, it can be extended by 1 bit, so that the mantissa is 11 bits. In addition, since the mantissa bits are unsigned numbers, 1-bit 0 can be extended in the high order when using Booth coding algorithm, so the total number of mantissa digits is 12 bits. When Booth coding is performed on the multiplier as the second floating-point number, and the first floating-point number is referred to, 7 partial products can be obtained in the high and low parts respectively through the partial product generating circuit, and the seventh partial product is 0 , the bit width of each partial product is 24 bits, at this time, 48 7-2 Wallace trees can be used for compression, and the carry from the 23rd to 24th Wallace trees is 0.

(2)BF16*BF16(2)BF16*BF16

在乘法器的該運算模式下，浮點數的尾數位為7bit，考慮IEEE754標準下非規格化非零數及擴展為有符號數，則尾數可以擴展為9bit。當對作為第二浮點數即乘數進行布斯編碼，並且參照第一浮點數時，則通過部分積產生電路可以在高低部分分別獲得7個有效部分積，其中第6、7個部分積為0，每個部分積位寬為18bit，通過使用第0~17個和第24~41個兩組的7-2華萊士樹進行壓縮處理，其中第23到第24個華萊士樹的進位為0。In this operation mode of the multiplier, the mantissa of the floating-point number is 7 bits. Considering the denormalized non-zero number and extension to a signed number under the IEEE754 standard, the mantissa can be extended to 9 bits. When Booth coding is performed on the multiplier as the second floating-point number, and the first floating-point number is referred to, then 7 effective partial products can be obtained respectively in the high and low parts through the partial product generating circuit, among which the sixth and seventh parts are The product is 0, and the bit width of each partial product is 18bit. It is compressed by using the 0~17th and 24th~41st two groups of 7-2 Wallace trees, of which the 23rd to 24th Wallace trees The tree has a carry of 0.

(3)FP32*FP32(3)FP32*FP32

在乘法器的該運算模式下，浮點數的尾數位可以為23bit，考慮IEEE754標準下非規格化非零數及擴展為有符號數，則尾數可以擴展為25bit。為節省乘法單元的面積，例如乘法器所支援的位寬可以被設計得較小，並且使得本披露的乘法器在該運算模式下可以被調用兩次以完成一次運算。為此，每次尾數位進行的乘法為25bit*13bit，即將第一浮點數ina擴展1比特0成為25bit的有符號數，將第二浮點數inb的24bit尾數位分高低兩部分12bit分別擴展1比特0得到兩個13bit的乘數，表示為inb_high13和inb_low13高低兩部分。具體操作中，第一次調用本披露的乘法器計算ina*inb_low13，第二次調用乘法器計算ina*inb_high13。在每一次的計算中，通過布斯編碼生成7個有效部分積，每個部分積的位寬為38bit，通過第0~37個的7-2華萊士樹進行壓縮。In this operation mode of the multiplier, the mantissa of the floating-point number can be 23 bits. Considering the denormalized non-zero number and the extension to a signed number under the IEEE754 standard, the mantissa can be extended to 25 bits. To save the area of the multiplication unit, for example, the bit width supported by the multiplier can be designed to be smaller, and the multiplier of the present disclosure can be called twice in this operation mode to complete one operation. For this reason, the multiplication of each mantissa bit is 25bit*13bit, that is, the first floating-point number ina is expanded by 1 bit 0 to become a 25-bit signed number, and the 24-bit mantissa of the second floating-point number inb is divided into two parts, high and low, 12 bits respectively. Extend 1 bit 0 to get two 13bit multipliers, which are expressed as inb_high13 and inb_low13 high and low parts. In the specific operation, the multiplier of the present disclosure is called for the first time to calculate ina*inb_low13, and the multiplier is called for the second time to calculate ina*inb_high13. In each calculation, 7 effective partial products are generated by Booth coding, and the bit width of each partial product is 38 bits, which is compressed by the 0~37th 7-2 Wallace tree.

(4)FP32*BF16(4)FP32*BF16

該乘法器的該運算模式下，第一浮點數ina的尾數位為23bit，第二浮點數的inb的尾數位為7bit，考慮IEEE754標準下非規格化非零數和擴展為有符號數，則尾數可以分別擴展為25bit和9bit，進行25bit×9bit的乘法，獲得7個有效部分積，其中第6、7個部分積為0，每個部分積的位寬為34bit，通過第0~33個華萊士樹進行壓縮。In this operation mode of the multiplier, the mantissa bits of the first floating-point number ina are 23 bits, and the mantissa bits of the second floating-point number inb are 7 bits. Considering the denormalized non-zero number and extension to a signed number under the IEEE754 standard , then the mantissa can be expanded to 25bit and 9bit respectively, multiplication of 25bit×9bit is performed, and 7 effective partial products are obtained, of which the 6th and 7th partial products are 0, and the bit width of each partial product is 34bit. 33 Wallace trees for compression.

以上通過具體示例描述了本披露的乘法器如何在四種運算模式下完成第一階段的操作，其中優選的使用了布斯編碼算法和7-2華萊士樹。基於上述的描述，本領域具有通常知識者可以理解本披露使用7個部分積，使得可以在不同的運算模式中複用7-2華萊士樹。The above describes how the multiplier of the present disclosure completes the operation of the first stage in four operation modes through specific examples, among which the Booth coding algorithm and the 7-2 Wallace tree are preferably used. Based on the above description, those skilled in the art can understand that the present disclosure uses 7 partial products, so that the 7-2 Wallace tree can be reused in different operation modes.

下面將更具體地描述本公開的乘法器（尾數處理單元和指數處理單元）被多次調用的情況。The case where the multipliers (mantissa processing unit and exponent processing unit) of the present disclosure are called multiple times will be described in more detail below.

根據本公開的另一方面，如第3圖所示，尾數處理單元可以包括控制電路316，並且該控制電路316可以用於在兩個浮點數中的至少一個的尾數位寬大於所述尾數處理單元一次可處理的資料位寬時，多次調用所述尾數處理單元。所述尾數處理單元一次可處理的資料位寬是指尾數處理單元所支援的兩個位寬（例如乘數位寬和被乘數位寬）。因此，可以理解，所述控制電路用於根據所述兩個浮點數中的一個的尾數位寬和所述尾數處理單元所支援的兩個位寬中的一個，或者根據所述兩個浮點數的尾數位寬和所述尾數處理單元所支援的兩個位寬來確定多次調用所述尾數處理單元以獲得所述乘法運算後的尾數。因此，乘法器中的尾數處理單元的這種反復調用避免了佈置大面積的乘法器部件來處理大位寬尾數運算並且避免了佈置小面積的乘法器部件無法處理大位寬尾數運算，從而在適用性更強的同時有利於減小晶片面積。According to another aspect of the present disclosure, as shown in FIG. 3, the mantissa processing unit may include a control circuit 316, and the control circuit 316 may be used for at least one of the two floating point numbers whose mantissa is wider than the mantissa. When the processing unit can process the data bit width at a time, the mantissa processing unit is called multiple times. The data bit width that the mantissa processing unit can process at one time refers to two bit widths (eg, the multiplier bit width and the multiplicand bit width) supported by the mantissa processing unit. Therefore, it can be understood that the control circuit is configured to operate according to the mantissa bit width of one of the two floating-point numbers and one of the two bit widths supported by the mantissa processing unit, or according to the two floating-point numbers. The mantissa bit width of the point number and the two bit widths supported by the mantissa processing unit determine that the mantissa processing unit is called multiple times to obtain the mantissa after the multiplication operation. Therefore, this repeated invocation of the mantissa processing unit in the multiplier avoids arranging large-area multiplier components to handle large-bit-wide mantissa operations and avoids arranging small-area multiplier components that cannot handle large-bit-wide mantissa operations, resulting in The applicability is stronger and the wafer area is reduced.

根據本公開的第一實施例，所述兩個浮點數包括第一浮點數和第二浮點數，所述尾數處理單元支援第一位寬和第二位寬，所述第一浮點數的尾數作為與所述第一位寬對應的第一輸入，所述第二浮點數的尾數作為與所述第二位寬對應的第二輸入，所述第一輸入的位寬小於或等於所述第一位寬，所述控制電路用於當所述第二輸入的位寬大於所述第二位寬時，多次調用所述尾數處理單元來獲得所述乘法運算後的尾數。根據該實施例，已知兩個輸入中的一個的位寬固定小於或等於與其對應的尾數處理單元所支援的一個位寬，由此，只需判斷另一個輸入與對應的尾數處理單元所支援位寬的大小關係，即可確定是否多次調用尾數處理單元。According to the first embodiment of the present disclosure, the two floating point numbers include a first floating point number and a second floating point number, the mantissa processing unit supports a first bit width and a second bit width, the first floating point number The mantissa of the point number is used as the first input corresponding to the first bit width, the mantissa of the second floating point number is used as the second input corresponding to the second bit width, and the bit width of the first input is less than or equal to the first bit width, the control circuit is configured to call the mantissa processing unit multiple times to obtain the mantissa after the multiplication operation when the bit width of the second input is greater than the second bit width . According to this embodiment, it is known that the bit width of one of the two inputs is fixed less than or equal to the one supported by the corresponding mantissa processing unit. Therefore, it is only necessary to determine whether the other input is supported by the corresponding mantissa processing unit. The size relationship of the bit width can determine whether to call the mantissa processing unit multiple times.

根據本公開的第二實施例，所述兩個浮點數包括第一浮點數和第二浮點數，所述尾數處理單元支援第一位寬和第二位寬，所述第一浮點數的尾數作為與所述第一位寬對應的第一輸入，所述第二浮點數的尾數作為與所述第二位寬對應的第二輸入，所述控制電路用於當所述第一輸入的位寬大於所述第一位寬且所述第二輸入的位寬小於或等於所述第二位寬時、當所述第二輸入的位寬大於所述第二位寬且所述第一輸入的位寬小於或等於所述第一位寬時或者當所述第一輸入的位寬大於所述第一位寬且所述第二輸入的位寬大於所述第二位寬時，多次調用所述尾數處理單元來獲得所述乘法運算後的尾數。根據該實施例，兩個輸入的位寬與尾數處理單元所支援的兩個位寬的大小關係不確定，需要判斷兩個輸入與各自對應的尾數處理單元所支援位寬的大小關係，來確定是否多次調用尾數處理單元。According to the second embodiment of the present disclosure, the two floating point numbers include a first floating point number and a second floating point number, the mantissa processing unit supports a first bit width and a second bit width, the first floating point number The mantissa of the point number is used as the first input corresponding to the first bit width, the mantissa of the second floating point number is used as the second input corresponding to the second bit width, and the control circuit is used for when the When the bit width of the first input is greater than the first bit width and the bit width of the second input is less than or equal to the second bit width, when the bit width of the second input is greater than the second bit width and When the bit width of the first input is less than or equal to the first bit width or when the bit width of the first input is greater than the first bit width and the bit width of the second input is greater than the second bit width When wide, the mantissa processing unit is called multiple times to obtain the mantissa after the multiplication operation. According to this embodiment, the size relationship between the bit widths of the two inputs and the two bit widths supported by the mantissa processing unit is uncertain, and it is necessary to determine the size relationship between the two inputs and the bit widths supported by the corresponding mantissa processing units to determine Whether to call the mantissa processing unit multiple times.

根據該第二實施例，當所述第一浮點數的尾數位寬小於所述第二浮點數的尾數位寬並且所述第一位寬大於所述第二位寬時，或者當所述第一浮點數的尾數位寬大於所述第二浮點數的尾數位寬並且所述第一位寬小於所述第二位寬時，所述控制電路選擇所述第一浮點數的尾數作為與所述第二位寬對應的所述第二輸入並且選擇所述第二浮點數的尾數作為與所述第一位寬對應的第一輸入。應當理解，在兩個浮點數的尾數無規則輸入時，可以先將輸入的兩個浮點數的尾數根據大位寬對大位寬、小位寬對小位寬的策略與尾數處理單元支援的兩個位寬進行匹配，以避免本可一次處理完成兩個浮點數的尾數運算，卻進行了多次調用。According to this second embodiment, when the mantissa bit width of the first floating point number is smaller than the mantissa bit width of the second floating point number and the first bit width is larger than the second bit width, or when all the When the mantissa bit width of the first floating point number is greater than the mantissa bit width of the second floating point number and the first bit width is smaller than the second bit width, the control circuit selects the first floating point number The mantissa of is used as the second input corresponding to the second bit width and the mantissa of the second floating point number is selected as the first input corresponding to the first bit width. It should be understood that when the mantissas of two floating-point numbers are input irregularly, the mantissas of the two inputted floating-point numbers can be processed according to the strategy of large-bit width to large-bit width and small-bit width to small-bit width and the mantissa processing unit. The supported two-bit widths are matched to avoid multiple calls to the mantissa of two floating-point numbers that can be processed at one time.

進一步地，當所述第一輸入的位寬大於所述第一位寬且所述第二輸入的位寬小於或等於所述第二位寬時，所述控制電路根據所述第一輸入的位寬和所述第一位寬來確定調用所述尾數處理單元的次數以及在每次調用中輸入所述尾數處理單元的資料。當所述第二輸入的位寬大於所述第二位寬且所述第一輸入的位寬小於或等於所述第一位寬時，所述控制電路根據所述第二輸入的位寬和所述第二位寬來確定調用所述尾數處理單元的次數以及在每次調用中輸入所述尾數處理單元的資料。當所述第一輸入的位寬大於所述第一位寬且所述第二輸入的位寬大於所述第二位寬時，所述控制電路根據所述第一輸入的位寬和所述第一位寬以及所述第二輸入的位寬和所述第二位寬來確定調用所述尾數處理單元的次數以及在每次調用中輸入所述尾數處理單元的資料。Further, when the bit width of the first input is greater than the first bit width and the bit width of the second input is less than or equal to the second bit width, the control circuit will The bit width and the first bit width determine the number of times the mantissa processing unit is called and the data input to the mantissa processing unit in each call. When the bit width of the second input is greater than the second bit width and the bit width of the first input is less than or equal to the first The second bit width determines the number of calls to the mantissa processing unit and the data input to the mantissa processing unit in each call. When the bit width of the first input is larger than the first bit width and the bit width of the second input is larger than the second bit width, the control circuit can determine the The first bit width and the bit width of the second input and the second bit width determine the number of calls to the mantissa processing unit and the data input to the mantissa processing unit in each call.

在本公開中，關於第一浮點數和第二浮點數的描述只是為了區分兩個浮點數，其中“第一”和“第二”不具有限定作用。同樣地，關於第一位寬和第二位寬的描述只是為了區分尾數處理單元所支援的兩個最大處理位寬，並且關於第一輸入和第二輸入的描述只是為了區分所述尾數處理單元的與所述兩個最大處理位寬對應的兩個輸入，因此其中“第一”和“第二”都不具有限定作用。In the present disclosure, the description about the first floating-point number and the second floating-point number is only for distinguishing the two floating-point numbers, wherein "first" and "second" have no limiting effect. Likewise, the description of the first and second bit widths is only to distinguish the two maximum processing bit widths supported by the mantissa processing unit, and the description of the first and second inputs is only to distinguish the mantissa processing unit The two inputs corresponding to the two maximum processing bit widths, so neither "first" nor "second" has a limiting effect.

值得注意的是，以上實施例描述的輸入乘法器的浮點數是符合運算要求格式以及適用乘法器內部部件和外部部件的浮點數，即經過例如規格化等預處理的浮點數。應當理解，輸入乘法器的浮點數可以是規格化或非規格化的浮點數，結合以上關於規格化單元的描述可知，如果輸入的兩個浮點數中的至少一個浮點數為非規格化的非零浮點數，可以首先通過規格化單元對所述至少一個浮點數進行規格化處理，以獲得規格化後的指數和尾數，然後使用規格化後的尾數作為尾數處理單元的輸入來進行上述的浮點數乘法運算。另外，本公開之前提到的布斯編碼電路進行有符號定點數乘法計算，因此還需要對尾數前面擴展1位0，即將尾數變為有符號正數，然後使用擴展後的有符號尾數作為尾數處理單元的輸入來進行上述的浮點數乘法運算。當然，還可以對浮點數進行其他的預處理，並將預處理後的浮點數的尾數作為尾數處理單元的輸入來進行上述的浮點數乘法運算，例如以上關於規格化單元的描述中提到的為了適用運算模式而對浮點數進行的規格化，本公開的第一實施例和第二實施例同樣適用於如上所述的根據運算模式進行浮點數的運算。It is worth noting that the floating-point numbers input to the multiplier described in the above embodiments are floating-point numbers that conform to the format required by the operation and are applicable to the internal and external components of the multiplier, that is, the floating-point numbers that have undergone preprocessing such as normalization. It should be understood that the floating-point number input to the multiplier may be a normalized or denormalized floating-point number. In combination with the above description of the normalization unit, it can be known that if at least one of the two input floating-point numbers is a non-normalized floating-point number For the normalized non-zero floating-point number, the at least one floating-point number may be normalized by a normalization unit to obtain a normalized exponent and a mantissa, and then the normalized mantissa may be used as the value of the mantissa processing unit. input to perform the floating-point multiplication operation described above. In addition, the Booth coding circuit mentioned earlier in this disclosure performs signed fixed-point multiplication calculation, so it is also necessary to extend 1-bit 0 in front of the mantissa, that is, to change the mantissa into a signed positive number, and then use the extended signed mantissa as the mantissa for processing The input of the unit is used to perform the floating-point multiplication operation described above. Of course, other preprocessing can also be performed on the floating-point number, and the mantissa of the pre-processed floating-point number is used as the input of the mantissa processing unit to perform the above-mentioned floating-point number multiplication operation. For example, in the above description of the normalization unit Regarding the normalization of floating-point numbers in order to apply the operation mode, the first and second embodiments of the present disclosure are also applicable to the above-mentioned operation of floating-point numbers according to the operation mode.

下面將詳細說明根據本公開的上述第二實施例的多次調用尾數處理單元的三個示例。為了更清楚直觀地理解這三個示例，上述第一輸入例如可以是乘數，第二輸入例如可以是被乘數，第一位寬例如可以是尾數處理單元所支援的最大乘數位寬，第二位寬例如可以是尾數處理單元所支援的最大被乘數位寬。Three examples of calling the mantissa processing unit multiple times according to the above-described second embodiment of the present disclosure will be described in detail below. In order to understand these three examples more clearly and intuitively, the first input can be, for example, the multiplier, the second input can be, for example, the multiplicand, and the first bit width can be, for example, the maximum multiplier bit width supported by the mantissa processing unit, The two-bit width can be, for example, the maximum multiplicand bit-width supported by the mantissa processing unit.

根據本公開的多次調用尾數處理單元的第一示例，結合以上描述的根據運算模式的浮點數乘法運算，以輸入到本公開乘法器的兩個浮點數為非規格化的非零浮點數為例，並結合本公開使用的布斯編碼電路進行有符號定點數乘法運算的情況，首先將兩個浮點數規格化，因此兩個浮點數的尾數擴展1位，另外為了適用於本公開實施例中的布斯編碼電路，再將兩個尾數擴展1位而形成有符號數。在經過這些預處理後，將兩個浮點數的尾數和尾數處理單元的輸入進行匹配。因此，當乘數的位寬大於最大乘數位寬且被乘數的位寬小於或等於最大被乘數時，所述控制電路將該乘數對應的原始尾數僅規格化後形成的尾數作為待截取尾數，並且為了適用於本公開實施例中的布斯編碼電路，對每次截取的部分擴展符號位。為了使得尾數處理單元可以處理該待截取尾數，在每次調用中從該待截取尾數中截取位寬為A-1的部分，其中，A代表尾數處理單元所支援的最大乘數位寬，對每次截取的位寬為A-1的部分在高位補充一位0作為符號形成位寬為A的乘數部分，該乘數部分作為在每次調用中輸入尾數處理單元的一個輸入。另外，所述被乘數（在該實施例中，該被乘數是規格化且擴展符號位的尾數）在每次調用中作為另一個輸入而輸入尾數處理單元。由此，可以使用以下公式來確定尾數處理單元的調用次數：According to the first example of the multi-call mantissa processing unit of the present disclosure, combined with the floating-point multiplication operation according to the operation mode described above, the two floating-point numbers input to the multiplier of the present disclosure are denormalized non-zero floating-point numbers. Taking points as an example, and combining with the Booth coding circuit used in the present disclosure to perform signed fixed-point multiplication, the two floating-point numbers are first normalized, so the mantissas of the two floating-point numbers are extended by 1 bit. In the Booth encoding circuit in the embodiment of the present disclosure, the two mantissas are further extended by 1 bit to form a signed number. After these preprocessing, the mantissa of the two floats is matched with the input of the mantissa processing unit. Therefore, when the bit width of the multiplier is greater than the maximum multiplier bit width and the bit width of the multiplicand is less than or equal to the maximum multiplicand, the control circuit only normalizes the original mantissa corresponding to the multiplier as the mantissa to be formed. The mantissa is truncated, and in order to be suitable for the Booth coding circuit in the embodiment of the present disclosure, the sign bit is extended for each truncated part. In order to enable the mantissa processing unit to process the mantissa to be truncated, the part with a bit width of A-1 is truncated from the mantissa to be truncated in each call, where A represents the maximum multiplier bit width supported by the mantissa processing unit. The truncated bit-width A-1 part is supplemented with a 0 in the upper bit as the sign to form a bit-width A multiplier part, which is used as an input to the mantissa processing unit in each call. In addition, the multiplicand (in this embodiment, the mantissa is the normalized and sign-extended mantissa) is entered into the mantissa processing unit as another input in each call. From this, the following formula can be used to determine the number of invocations of the mantissa processing unit:

n=ceil((B+1)/(A-1))，n=ceil((B+1)/(A-1)),

其中，n代表調用尾數處理單元的次數，B代表未規格化且未擴展符號位的尾數的位寬，B+1代表對尾數規格化後的位寬，B+1也可理解為B+2-1，即乘數的位寬減去符號位的位寬，A代表乘數部分的位寬（尾數處理單元所支援的最大乘數位寬），A-1代表每次調用中從待截取尾數中截取的部分的位寬。Among them, n represents the number of calls to the mantissa processing unit, B represents the bit width of the unnormalized and unextended sign bit of the mantissa, B+1 represents the normalized bit width of the mantissa, and B+1 can also be understood as B+2 -1, that is, the bit width of the multiplier minus the bit width of the sign bit, A represents the bit width of the multiplier part (the maximum multiplier bit width supported by the mantissa processing unit), and A-1 represents the mantissa to be truncated in each call The bit width of the truncated portion in .

舉例來說，尾數處理單元所支援的最大乘數位寬例如為8bit，最大被乘數位寬例如為32bit，輸入乘法器的兩個浮點數分別是FP32類型和BF16類型的浮點數，因此選擇在FP32*BF16運算模式中進行乘法運算，並且兩個浮點數是非規格化非零數，因此兩個浮點數的尾數分別具有23bit和7bit的位寬，考慮IEEE754標準，則兩個尾數的位寬可以擴展為24bit和8bit。為了適用於本公開實施例中的布斯編碼電路，再將兩個尾數擴展1比特0成為25bit和9bit的有符號數。因此控制電路將位寬為9bit的尾數作為與最大乘數位寬對應的乘數並且將位寬為25bit的尾數作為與最大被乘數位寬對應的被乘數，由於僅乘數的位寬（9bit）大於最大乘數位寬（8bit），而被乘數的位寬（25bit）小於最大被乘數位寬（32bit），因此將該乘數所對應的原始尾數僅規格化後形成的尾數作為待截取尾數inb，則被乘數作為輸入尾數處理單元的被乘數ina。根據以上公式，ceil((7+1)/(8-1))=2，因此，需要調用兩次尾數處理單元，並且在每次調用時，在inb中每次截取7bit資料，最後一次調用（第二次調用）時，不足7bit資料，則將剩餘資料全部截取並在前面補0湊齊7bit，並且每次截取的7bit資料擴展1比特0（符號位）成為8bit作為乘數部分inb_m，因此，在每次調用時進行的計算為ina*inb_m，即位寬為25bit的被乘數與位寬為8bit的乘數部分的乘法運算，從而可以計算得出該次調用所獲得的尾數結果。值得注意的是，對待截取尾數的截取可以按照從高位到低位的順序進行，也可以按照從低位到高位的順序進行。值得注意的是，該示例同樣適用於本公開上述第一實施例。For example, the maximum multiplier bit width supported by the mantissa processing unit is, for example, 8 bits, and the maximum multiplicand bit width is, for example, 32 bits. The two floating-point numbers input to the multiplier are FP32 and BF16 floating-point numbers respectively. Therefore, choose The multiplication operation is performed in the FP32*BF16 operation mode, and the two floating-point numbers are denormalized non-zero numbers, so the mantissas of the two floating-point numbers have a bit width of 23bit and 7bit respectively. Considering the IEEE754 standard, the two mantissas are The bit width can be extended to 24bit and 8bit. In order to be applicable to the Booth coding circuit in the embodiment of the present disclosure, the two mantissas are further extended by 1 bit 0 to become signed numbers of 25 bits and 9 bits. Therefore, the control circuit takes the mantissa with a bit width of 9 bits as the multiplier corresponding to the maximum multiplier bit width and the mantissa with a bit width of 25 bits as the multiplicand corresponding to the maximum multiplicand bit width, since only the bit width of the multiplier (9 bits ) is larger than the maximum multiplier bit width (8bit), and the multiplicand’s bit width (25bit) is smaller than the maximum multiplicand bit width (32bit), so the original mantissa corresponding to the multiplier is only the mantissa formed after normalization as the to-be-truncated mantissa If the mantissa inb, the multiplicand is used as the multiplicand ina of the input mantissa processing unit. According to the above formula, ceil((7+1)/(8-1))=2, therefore, the mantissa processing unit needs to be called twice, and in each call, 7bit data is intercepted in inb each time, and the last call (the second call), if the data is less than 7bit, then all the remaining data will be intercepted and 0 is added in front to make up 7bit, and the intercepted 7bit data will be extended by 1 bit 0 (sign bit) to become 8bit as the multiplier part inb_m, Therefore, the calculation performed in each call is ina*inb_m, that is, the multiplication operation of the multiplicand with a bit width of 25 bits and the multiplier part with a bit width of 8 bits, so that the mantissa result obtained by this call can be calculated. It is worth noting that the truncation of the mantissa to be truncated can be performed in the order from the high order to the low order, or in the order from the low order to the high order. It is worth noting that this example is also applicable to the above-mentioned first embodiment of the present disclosure.

根據本公開的多次調用尾數處理單元的第二示例，結合以上描述的根據運算模式的浮點數乘法運算，以輸入到本公開乘法器的兩個浮點數為非規格化的非零浮點數為例，並結合本公開使用的布斯編碼電路進行有符號定點數乘法運算的情況，首先將兩個浮點數規格化，因此兩個浮點數的尾數擴展1位，另外為了適用於本公開實施例中的布斯編碼電路，再將兩個尾數擴展1位而形成有符號數。在經過這些預處理後，將兩個浮點數的尾數和尾數處理單元的輸入進行匹配。因此，當被乘數的位寬大於最大被乘數位寬且乘數的位寬小於或等於最大乘數位寬時，所述控制電路將該被乘數對應的原始尾數僅規格化後形成的尾數作為待截取尾數，並且為了適用於本公開實施例中的布斯編碼電路，對每次截取的部分擴展符號位。為了使得尾數處理單元可以處理該待截取尾數，在每次調用中從該尾數中截取位寬為C-1的部分，其中，C代表尾數處理單元所支援的最大被乘數位寬，對每次截取的位寬為C-1的部分在高位補充一位0作為符號形成位寬為C的被乘數部分，該被乘數部分作為在每次調用中輸入尾數處理單元的一個輸入。另外，所述乘數（在該實施例中，該乘數是規格化且擴展符號位的尾數）在每次調用中作為另一個輸入而輸入尾數處理單元。由此，可以使用以下公式來確定尾數處理單元的調用次數：According to the second example of the multi-call mantissa processing unit of the present disclosure, combined with the floating-point multiplication operation according to the operation mode described above, the two floating-point numbers input to the multiplier of the present disclosure are denormalized non-zero floating-point numbers. Taking points as an example, and combining with the Booth coding circuit used in the present disclosure to perform signed fixed-point multiplication, the two floating-point numbers are first normalized, so the mantissas of the two floating-point numbers are extended by 1 bit. In the Booth encoding circuit in the embodiment of the present disclosure, the two mantissas are further extended by 1 bit to form a signed number. After these preprocessing, the mantissa of the two floats is matched with the input of the mantissa processing unit. Therefore, when the bit width of the multiplicand is greater than the maximum bit width of the multiplicand and the bit width of the multiplier is less than or equal to the maximum bit width of the multiplier, the control circuit only normalizes the original mantissa corresponding to the multiplicand and forms the mantissa. As the mantissa to be truncated, and in order to be suitable for the Booth coding circuit in the embodiment of the present disclosure, the sign bit is extended for each truncated part. In order to enable the mantissa processing unit to process the mantissa to be truncated, the part with a bit width of C-1 is truncated from the mantissa in each call, where C represents the maximum bit width of the multiplicand supported by the mantissa processing unit. The truncated part with a bit width of C-1 is supplemented with a 0 in the upper bit as a sign to form a multiplicand part with a bit width of C, which is used as an input to the mantissa processing unit in each call. In addition, the multiplier (in this embodiment, the multiplier is the normalized and sign-extended mantissa) is entered into the mantissa processing unit as another input in each call. From this, the following formula can be used to determine the number of invocations of the mantissa processing unit:

n=ceil((D+1)/(C-1))，n=ceil((D+1)/(C-1)),

其中，n代表調用尾數處理單元的次數，D代表未規格化且未擴展符號位的尾數的位寬，D+1代表對尾數規格化後的位寬，D+1也可理解為D+2-1，即被乘數的位寬減去符號位的位寬，C代表被乘數部分的位寬（尾數處理單元所支援的最大被乘數位寬），C-1代表每次調用中從待截取尾數中截取的部分的位寬。Among them, n represents the number of times the mantissa processing unit is called, D represents the bit width of the unnormalized and unextended sign bit of the mantissa, D+1 represents the normalized bit width of the mantissa, and D+1 can also be understood as D+2 -1, that is, the bit width of the multiplicand minus the bit width of the sign bit, C represents the bit width of the multiplicand part (the maximum multiplicand bit width supported by the mantissa processing unit), and C-1 represents the The bit width of the truncated part of the mantissa to be truncated.

舉例來說，尾數處理單元所支援的最大乘數位寬例如為12bit，最大被乘數位寬例如為16bit，輸入乘法器的兩個浮點數分別是FP32類型和BF16類型的浮點數，因此選擇在FP32*BF16運算模式中進行乘法運算，並且兩個浮點數是非規格化非零數，因此兩個浮點數的尾數分別具有23bit和7bit的位寬，考慮IEEE754標準，則兩個尾數的位寬可以擴展為24bit和8bit。為了適用於本公開實施例中的布斯編碼電路，再將兩個尾數擴展1比特0成為25bit和9bit的有符號數。因此控制電路將位寬為9bit的尾數作為與最大乘數位寬對應的乘數並且將位寬為25bit的尾數作為與最大被乘數位寬對應的被乘數，由於僅被乘數的位寬（25bit）大於尾數處理單元所支援的最大被乘數位寬（16bit），而乘數的位寬（9bit）小於最大乘數位寬（12bit），因此將該被乘數所對應的原始尾數僅規格化後形成的尾數作為待截取尾數ina，則乘數作為輸入尾數處理單元的乘數inb。根據以上公式，ceil((23+1)/(16-1))=2，因此，需要調用兩次尾數處理單元，並且在每次調用時，在ina中每次截取15bit資料，最後一次調用（第二次調用）時，不足15bit數據則在前面補0湊齊15bit，並且每次截取的15bit資料擴展1比特0（符號位）成為16bit作為被乘數部分ina_m，因此，在每次調用時進行的計算為ina_m*inb，即位寬為16bit的被乘數部分與位寬為9bit的乘數的乘法運算，從而可以計算得出該次調用所獲得的尾數結果。值得注意的是，對待截取尾數的截取可以按照從高位到低位的順序進行，也可以按照從低位到高位的順序進行。值得注意的是，該示例同樣適用於本公開上述第一實施例。For example, the maximum multiplier bit width supported by the mantissa processing unit is, for example, 12 bits, and the maximum multiplicand bit width is, for example, 16 bits. The two floating-point numbers input to the multiplier are FP32 type and BF16 type floating-point numbers respectively. Therefore, choose The multiplication operation is performed in the FP32*BF16 operation mode, and the two floating-point numbers are denormalized non-zero numbers, so the mantissas of the two floating-point numbers have a bit width of 23bit and 7bit respectively. Considering the IEEE754 standard, the two mantissas are The bit width can be extended to 24bit and 8bit. In order to be applicable to the Booth coding circuit in the embodiment of the present disclosure, the two mantissas are further extended by 1 bit 0 to become signed numbers of 25 bits and 9 bits. Therefore, the control circuit takes the mantissa with a bit width of 9 bits as the multiplier corresponding to the maximum multiplier bit width and the mantissa with a bit width of 25 bits as the multiplicand corresponding to the maximum multiplicand bit width, since only the bit width of the multiplicand ( 25bit) is larger than the maximum multiplicand bit width (16bit) supported by the mantissa processing unit, and the bit width of the multiplier (9bit) is smaller than the maximum multiplier bit width (12bit), so the original mantissa corresponding to the multiplicand is only normalized The mantissa formed later is used as the mantissa to be truncated ina, and the multiplier is used as the multiplier inb of the input mantissa processing unit. According to the above formula, ceil((23+1)/(16-1))=2, therefore, the mantissa processing unit needs to be called twice, and in each call, 15bit data is intercepted in ina, and the last call (the second call), if the data is less than 15bit, 0 is added in front to make up 15bit, and the 15bit data intercepted each time is expanded by 1 bit 0 (sign bit) to become 16bit as the multiplicand part ina_m, therefore, in each call The calculation performed at the time is ina_m*inb, that is, the multiplication operation of the multiplicand with a bit width of 16 bits and a multiplier with a bit width of 9 bits, so that the mantissa result obtained by this call can be calculated. It is worth noting that the truncation of the mantissa to be truncated can be performed in the order from the high order to the low order, or in the order from the low order to the high order. It is worth noting that this example is also applicable to the above-mentioned first embodiment of the present disclosure.

根據本公開的多次調用尾數處理單元的第三示例，結合以上描述的根據運算模式的浮點數乘法運算，以輸入到本公開乘法器的兩個浮點數為非規格化的非零浮點數為例，並結合本公開使用的布斯編碼電路進行有符號定點數乘法運算的情況，首先將兩個浮點數規格化，因此兩個浮點數的尾數擴展1位，另外為了適用於本公開實施例中的布斯編碼電路，再將兩個尾數擴展1位而形成有符號數。在經過這些預處理後，將兩個浮點數的尾數和尾數處理單元的輸入進行匹配。因此，當所述乘數的位寬大於所述最大乘數位寬且所述被乘數（在該實施例中，該被乘數是規格化且擴展符號位的尾數）的位寬大於所述最大被乘數位寬時，所述控制電路將該乘數對應的原始尾數僅規格化後形成的尾數和該被乘數對應的原始尾數僅規格化後形成的尾數作為待截取尾數，並且為了適用於本公開實施例中的布斯編碼電路，對每次截取的部分擴展符號位。為了使得尾數處理單元可以處理這兩個待截取尾數，在每次調用中分別從與乘數對應的待截取尾數中截取位寬為A-1的部分並且從與被乘數對應的待截取尾數中截取位寬為C-1的部分，其中，A代表尾數處理單元所支援的最大乘數位寬，C代表尾數處理單元所支援的最大被乘數位寬，對每次截取的位寬為A-1的部分在高位補充一位0作為符號形成位寬為A的乘數部分，該乘數部分作為在每次調用中輸入尾數處理單元的一個輸入，並且對每次截取的位寬為C-1的部分在高位補充一位0作為符號形成位寬為C的被乘數部分，該被乘數部分作為在每次調用中輸入尾數處理單元的另一個輸入。由此，可以使用以下公式來確定尾數處理單元的調用次數：According to the third example of the multi-call mantissa processing unit of the present disclosure, combined with the floating-point multiplication operation according to the operation mode described above, the two floating-point numbers input to the multiplier of the present disclosure are denormalized non-zero floating-point numbers. Taking points as an example, and combining with the Booth coding circuit used in the present disclosure to perform signed fixed-point multiplication, the two floating-point numbers are first normalized, so the mantissas of the two floating-point numbers are extended by 1 bit. In the Booth encoding circuit in the embodiment of the present disclosure, the two mantissas are further extended by 1 bit to form a signed number. After these preprocessing, the mantissa of the two floats is matched with the input of the mantissa processing unit. Therefore, when the bit width of the multiplier is larger than the maximum multiplier bit width and the multiplicand (in this embodiment, the multiplicand is the mantissa of normalized and extended sign bits) is larger than the When the bit width of the multiplicand is the largest, the control circuit only takes the mantissa formed after normalization of the original mantissa corresponding to the multiplier and the mantissa formed only after the normalization of the original mantissa corresponding to the multiplicand as the mantissa to be truncated, and in order to apply In the Booth coding circuit in the embodiment of the present disclosure, the sign bit is extended for each truncated part. In order to enable the mantissa processing unit to process the two mantissas to be truncated, in each call, the part with a bit width of A-1 is truncated from the mantissa to be truncated corresponding to the multiplier, and the mantissa to be truncated corresponding to the multiplicand is truncated in each call. The truncated bit width is C-1, where A represents the maximum multiplier bit width supported by the mantissa processing unit, C represents the maximum multiplicand bit width supported by the mantissa processing unit, and the bit width for each interception is A- The part of 1 is supplemented with a 0 as a sign to form a multiplier part with a bit width of A. The multiplier part is used as an input to the mantissa processing unit in each call, and the bit width of each truncation is C- The part of 1 is supplemented with a 0 in the upper bit as the sign to form the multiplicand part of bit width C, which is used as another input to the mantissa processing unit in each call. From this, the following formula can be used to determine the number of invocations of the mantissa processing unit:

n=ceil((B+1)/(A-1))* ceil((D+1)/(C-1))n=ceil((B+1)/(A-1))*ceil((D+1)/(C-1))

其中，n代表調用尾數處理單元的次數，B代表未規格化且未擴展符號位的尾數的位寬，B+1代表對尾數規格化後的位寬，B+1也可理解為B+2-1，即乘數的位寬減去符號位的位寬，A代表乘數部分的位寬（尾數處理單元所支援的最大乘數位寬），A-1代表每次調用中從與乘數對應的待截取尾數中截取的部分的位寬，D代表未規格化且未擴展符號位的尾數的位寬，D+1代表對尾數規格化後的位寬，D+1也可理解為D+2-1，即被乘數的位寬減去符號位的位寬，C代表被乘數部分的位寬（尾數處理單元所支援的最大被乘數位寬），C-1代表每次調用中從待截取尾數中截取的部分的位寬。Among them, n represents the number of calls to the mantissa processing unit, B represents the bit width of the unnormalized and unextended sign bit of the mantissa, B+1 represents the normalized bit width of the mantissa, and B+1 can also be understood as B+2 -1, that is, the bit width of the multiplier minus the bit width of the sign bit, A represents the bit width of the multiplier part (the maximum multiplier bit width supported by the mantissa processing unit), and A-1 represents the sum of the multiplier in each call The corresponding bit width of the truncated part of the mantissa to be truncated, D represents the bit width of the unnormalized and unextended sign bit of the mantissa, D+1 represents the normalized bit width of the mantissa, D+1 can also be understood as D +2-1, that is, the bit width of the multiplicand minus the bit width of the sign bit, C represents the bit width of the multiplicand part (the maximum multiplicand bit width supported by the mantissa processing unit), and C-1 represents each call The bit width of the part truncated from the mantissa to be truncated.

舉例來說，尾數處理單元所支援的最大乘數位寬例如為8bit，最大被乘數位寬例如為16bit，輸入乘法器的兩個浮點數都是FP32類型的浮點數，因此選擇在FP32*FP32運算模式中進行乘法運算，並且兩個浮點數是非規格化非零數，因此兩個浮點數的尾數位寬都為23bit，考慮IEEE754標準，則兩個尾數的位寬可以擴展為24bit。為了適用於本公開實施例中的布斯編碼電路，再將兩個尾數擴展1比特0成為25bit的有符號數。因此控制電路將兩個浮點數的尾數分別選擇作為與最大乘數位寬對應的乘數和與最大被乘數位寬對應的被乘數（由於兩個浮點數的尾數在擴展後位寬相同，因此任選一個作為乘數，另一個作為被乘數），由於所述乘數的位寬（25bit）大於所述最大乘數位寬（8bit）且所述被乘數的位寬（25bit）大於所述最大被乘數位寬（16bit），因此將乘數所對應的原始尾數規格化後形成的尾數作為待截取尾數inb並且將被乘數所對應的原始尾數規格化後形成的尾數作為待截取尾數ina。根據以上公式，ceil((23+1)/(8-1))* ceil((23+1)/(16-1))=8，因此，需要調用八次尾數處理單元。在每次調用時，在inb中每次截取7bit資料，最後一次調用時，不足7bit資料，則將剩餘資料全部截取並在前面補0湊齊7bit，並且每次截取的7bit資料擴展1比特0（符號位）成為8bit作為乘數部分inb_m，由於將inb截取為四個部分，因此可以具有四個乘數部分inb_m1、inb_m2、inb_m3、inb_m4。另外在每次調用時，在ina中每次截取15bit資料，最後一次調用時，不足15bit資料，則將剩餘資料全部截取並在前面補0湊齊15bit，並且每次截取的15bit資料擴展1比特0（符號位）成為16bit作為被乘數部分ina_m，由於將ina截取為兩個部分，因此可以具有兩個被乘數部分ina_m1、ina_m2。因此，例如在八次調用尾數處理單元時可以依次進行以下計算：ina_m1*inb_m1、ina_m1*inb_m2、ina_m1*inb_m3、ina_m1*inb_m4、ina_m2*inb_m1、ina_m2*inb_m2、ina_m2*inb_m3、ina_m2*inb_m4，當然也可以依次進行以下計算：inb_m1*ina_m1、inb_m1*ina_m2、inb_m2*ina_m1、inb_m2*ina_m2、inb_m3*ina_m1、inb_m3*ina_m2、inb_m4*ina_m1、inb_m4*ina_m2。每次調用進行的計算為位寬為16bit的被乘數部分與位寬為8bit的乘數部分的乘法運算，從而可以計算得出該次調用所獲得的尾數結果。值得注意的是，對待截取尾數的截取可以按照從高位到低位的順序進行，也可以按照從低位到高位的順序進行。For example, the maximum multiplier bit width supported by the mantissa processing unit is, for example, 8 bits, and the maximum multiplicand bit width is, for example, 16 bits. The multiplication operation is performed in the FP32 operation mode, and the two floating-point numbers are denormalized non-zero numbers, so the mantissa bit width of the two floating-point numbers is 23 bits. Considering the IEEE754 standard, the bit width of the two mantissas can be extended to 24 bits. . In order to be applicable to the Booth coding circuit in the embodiment of the present disclosure, the two mantissas are further extended by 1-bit 0 to become a 25-bit signed number. Therefore, the control circuit selects the mantissas of the two floating-point numbers as the multiplier corresponding to the maximum multiplier bit width and the multiplicand corresponding to the maximum multiplicand bit width (since the mantissas of the two floating-point numbers have the same bit width after expansion) , so choose one as the multiplier and the other as the multiplicand), since the bit width of the multiplier (25bit) is larger than the maximum multiplier bit width (8bit) and the bit width of the multiplicand (25bit) is larger than the maximum bit width of the multiplicand (16 bits), so the mantissa formed after normalizing the original mantissa corresponding to the multiplier is taken as the mantissa to be truncated inb and the mantissa formed after normalizing the original mantissa corresponding to the multiplicand is taken as the mantissa to be truncated. Truncate the mantissa ina. According to the above formula, ceil((23+1)/(8-1))* ceil((23+1)/(16-1))=8, therefore, the mantissa processing unit needs to be called eight times. In each call, 7bit data is intercepted in inb each time. When the last call is less than 7bit data, all remaining data are intercepted and 0 is added in front to make up 7bit, and the intercepted 7bit data is extended by 1 bit 0 each time. (sign bit) is 8 bits as the multiplier part inb_m, and since inb is cut into four parts, it can have four multiplier parts inb_m1, inb_m2, inb_m3, and inb_m4. In addition, in each call, 15bit data is intercepted in ina each time. When the last call is less than 15bit data, all the remaining data will be intercepted and filled with 0 in front to make up 15bit, and the 15bit data intercepted each time will be extended by 1 bit. 0 (sign bit) becomes 16 bits as the multiplicand part ina_m, and since ina is truncated into two parts, it is possible to have two multiplicand parts ina_m1 and ina_m2. Thus, for example, the following calculations can be performed in sequence with eight calls to the mantissa processing unit: ina_m1*inb_m1, ina_m1*inb_m2, ina_m1*inb_m3, ina_m1*inb_m4, ina_m2*inb_m1, ina_m2*inb_m2, ina_m2*inb_m3, ina_m2*inb_m4, of course The following calculations can also be performed in sequence: inb_m1*ina_m1, inb_m1*ina_m2, inb_m2*ina_m1, inb_m2*ina_m2, inb_m3*ina_m1, inb_m3*ina_m2, inb_m4*ina_m1, inb_m4*ina_m2. The calculation performed for each call is the multiplication operation of the multiplicand part with a bit width of 16 bits and the multiplier part with a bit width of 8 bits, so that the mantissa result obtained by the call can be calculated. It is worth noting that the truncation of the mantissa to be truncated can be performed in the order from the high order to the low order, or in the order from the low order to the high order.

以上示例僅僅用於說明性而非限制性的目的，根據這些示例，本領域具有通常知識者可以想到在其它運算模式下多次調用最大支援任意位寬的尾數處理單元所進行的浮點數乘法運算。The above examples are only for illustrative and non-limiting purposes. According to these examples, those skilled in the art can think of floating-point multiplication performed by multiple calls to the mantissa processing unit that supports the maximum arbitrary bit width in other operation modes. operation.

針對以上多次調用尾數處理單元，所述尾數處理單元還可以包括移位加法電路，所述移位加法電路用於根據每次調用所述尾數處理單元所獲得的尾數結果來獲得所述乘法運算後的尾數。For the above multiple calls to the mantissa processing unit, the mantissa processing unit may further include a shift and addition circuit, and the shift and addition circuit is configured to obtain the multiplication operation according to the mantissa result obtained by calling the mantissa processing unit each time after the mantissa.

進一步，所述移位加法電路包括移位器、中間存儲器和加法器，當所述控制電路根據所述運算模式多次調用所述尾數處理單元時，在第一次調用後，所述移位器將第一次調用獲得的尾數結果進行移位獲得移位後尾數結果並將所述移位後尾數結果存入所述中間存儲器中，從第二次調用開始，所述移位器將當次調用中獲得的尾數結果進行移位獲得當次尾數結果，所述加法器將所述當次尾數結果與存儲在所述中間存儲器中的結果相加並且將相加後的結果存儲在所述中間存儲器中來更新所述中間存儲器，並且在最後一次調用後存儲在所述中間存儲器中的結果作為所述乘法運算後的尾數。Further, the shift and addition circuit includes a shifter, an intermediate memory and an adder. When the control circuit calls the mantissa processing unit multiple times according to the operation mode, after the first call, the shift The shifter shifts the mantissa result obtained by the first call to obtain the shifted mantissa result and stores the shifted mantissa result in the intermediate memory. Starting from the second call, the shifter will The mantissa result obtained in the second call is shifted to obtain the current mantissa result, and the adder adds the current mantissa result and the result stored in the intermediate memory and stores the added result in the The intermediate storage is updated in the intermediate storage, and the result stored in the intermediate storage after the last call is used as the mantissa after the multiplication operation.

在該實施例中，例如，對待截取尾數的截取按照從高位到低位的順序進行。在每次調用所述尾數處理單元時，所述移位器將當次調用中獲得的尾數結果按照以下公式進行移位：In this embodiment, for example, the truncation of the mantissa to be truncated is performed in order from high order to low order. Each time the mantissa processing unit is called, the shifter shifts the mantissa result obtained in the current call according to the following formula:

Y=k+jY=k+j

其中，Y代表當次調用中獲得的尾數結果所需進行的移位數，k代表在與乘數對應的待截取尾數中在當次調用所使用的截取部分後面的全部資料的位數之和，j代表在與被乘數對應的待截取尾數中在當次調用所使用的截取部分後面的全部資料的位數之和。應當理解，如果僅乘數的位寬大於最大乘數位寬或者僅被乘數的位寬大於最大被乘數位寬，則只需要對與乘數對應的待截取尾數或與被乘數對應的待截取尾數進行截取，而不需要截取的尾數每次調用時使用的是其全部資料，因此後面不存在資料，從而k或j的取值為0，由此可知對於僅乘數的位寬大於最大乘數位寬的情況，以上計算移位數的公式可以寫為：Y=k，對於僅被乘數的位寬大於最大被乘數位寬的情況，以上計算移位數的公式可以寫為：Y= j。Among them, Y represents the number of shifts required for the mantissa result obtained in the current call, and k represents the sum of the digits of all the data behind the truncated part used in the current call in the mantissa to be truncated corresponding to the multiplier , j represents the sum of the digits of all the data following the truncated part used in the current call in the mantissa to be truncated corresponding to the multiplicand. It should be understood that if only the bit width of the multiplier is larger than the maximum multiplier bit width or only the bit width of the multiplicand is larger than the maximum multiplicand bit width, it is only necessary to truncate the mantissa corresponding to the multiplier or the to-be-truncated mantissa corresponding to the multiplicand. Intercept the mantissa for interception, and the mantissa that does not need to be intercepted uses all its data each time it is called, so there is no data behind, so the value of k or j is 0, it can be seen that the bit width of only the multiplier is greater than the maximum In the case of the bit width of the multiplier, the above formula for calculating the number of shifts can be written as: Y=k. For the case where only the bit width of the multiplicand is greater than the maximum bit width of the multiplicand, the above formula for calculating the number of shifts can be written as: Y = j.

舉例來說，如前所述，在FP32*BF16運算模式中，當僅所述乘數的位寬大於所述最大乘數位寬時，兩次調用尾數處理單元，並且例如對待截取尾數的截取按照從高位到低位的順序進行。具體地，例如兩次調用中的乘數部分分別為inb_m1和inb_m2，在第一次調用後，所述移位器將ina*inb_m1的結果向左移位，由於在第一次調用中截取7bit資料，因此在該次調用所使用的這7bit資料之後的全部資料的位數之和為k=8-7=1bit，根據上述公式可知，Y=1，因此，向左移位的位數為1位，從而獲得移位1位後的結果R1，所述加法器將該R1存入所述中間存儲器中；在第二次調用（最後一次調用）後，所述移位器將ina*inb_m2的結果向左移位，由於在第二次調用中已將最後1位資料截取，因此在該次調用的所使用的1bit資料之後不存在資料，根據上述公式可知，Y=0，因此，向左移位的位數為0位，即不移位，從而獲得結果R2，所述加法器將該R2與存儲在所述中間存儲器中的R1相加，並且將相加後的結果存儲在所述中間存儲器中來更新所述中間存儲器中，由於該第二次調用為最後一次調用，因此在第二次調用後存儲在所述中間存儲器中的結果為所述乘法運算後的尾數。對上述當僅所述被乘數的位寬大於所述最大被乘數位寬時的情況，移位加法電路可以同樣如此工作。For example, as mentioned above, in the FP32*BF16 operation mode, when only the bit width of the multiplier is greater than the maximum multiplier bit width, the mantissa processing unit is called twice, and for example, the mantissa to be truncated is truncated according to in order from high to low. Specifically, for example, the multipliers in the two calls are inb_m1 and inb_m2 respectively. After the first call, the shifter shifts the result of ina*inb_m1 to the left, because 7 bits are intercepted in the first call data, so the sum of the digits of all the data after the 7bit data used in this call is k=8-7=1bit. According to the above formula, Y=1, so the number of digits shifted to the left is 1 bit to obtain the result R1 shifted by 1 bit, the adder stores this R1 in the intermediate memory; after the second call (the last call), the shifter will ina*inb_m2 The result of is shifted to the left. Since the last 1-bit data has been intercepted in the second call, there is no data after the 1-bit data used in this call. According to the above formula, Y=0, therefore, to The number of bits shifted to the left is 0 bits, that is, no shift is made, so that the result R2 is obtained, and the adder adds this R2 to R1 stored in the intermediate memory, and stores the added result in the The intermediate memory is updated by updating the intermediate memory. Since the second invocation is the last invocation, the result stored in the intermediate memory after the second invocation is the mantissa after the multiplication operation. For the above-mentioned case when only the bit width of the multiplicand is larger than the bit width of the maximum multiplicand, the shift-add circuit can also work in the same way.

舉例來說，如前所述，在FP32*FP32運算模式中，當所述乘數的位寬大於所述最大乘數位寬且所述被乘數的位寬大於所述最大被乘數位寬時，八次調用尾數處理單元，並且例如對待截取尾數的截取按照從高位到低位的順序進行。具體地，例如八次調用中的乘數部分分別為inb_m1、inb_m2、inb_m3和inb_m4，被乘數部分分別為ina_m1、ina_m2，例如在八次調用尾數處理單元時依次進行以下計算：ina_m1*inb_m1、ina_m1*inb_m2、ina_m1*inb_m3、ina_m1*inb_m4、ina_m2*inb_m1、ina_m2*inb_m2、ina_m2*inb_m3、ina_m2*inb_m4。在第一次調用中，所述移位器將ina_m1*inb_m1的結果向左移位，由於在第一次調用中在與乘數對應的待截取尾數中截取7bit資料，因此在該待截取尾數中在該次調用所使用的7bit資料之後的全部資料的位數之和為k=24-7=17bit，並且在與被乘數對應的待截取尾數中截取15bit資料，因此在該待截取尾數中在該次調用所使用的15bit資料之後的全部資料的位數之和為j=24-15=9bit，根據上述公式可知，Y=17+9=26，因此，向左移位的位數為26位，從而獲得移位26位後的結果S1，所述加法器將該S1存入所述中間存儲器中；在第二次調用後，所述移位器將ina_m1*inb_m2的結果向左移位，由於在第二次調用中在與乘數對應的待截取尾數中截取下一個7bit資料，因此在該待截取尾數中在該次調用所使用的7bit資料之後的全部資料的位數之和為k=24-7-7=10bit，而在與被乘數對應的待截取尾數中截取與上一次調用時同樣的7bit資料（使用與上一次調用同樣的7bit資料），因此在該待截取尾數中在該次調用所使用的7bit資料之後的全部資料的位數之和仍為j=24-15=9bit，根據上述公式可知，Y=10+9=19，因此，向左移位的位數為19位，從而獲得移位19位後的結果S2，所述加法器將該S2與存儲在所述中間存儲器中S1相加，並且將相加後的結果存儲在所述中間存儲器中來更新所述中間存儲器；如此重複調用尾數處理單元直到第四次調用，在第四次調用中，所述移位器將ina_m1*inb_m4的結果向左移位，由於在第四次調用中截取與乘數對應的待截取尾數中的最後3bit資料，因此在該待截取尾數中在該次調用所使用的3bit資料之後不存在資料，從而k=0，而在與被乘數對應的待截取尾數中截取與上一次調用時同樣的7bit資料，因此在該待截取尾數中在該次調用所使用的7bit資料之後的全部資料的位數之和仍為j=24-15=9bit，根據上述公式可知，Y=0+9=9，因此，向左移位的位數為9位，從而獲得移位9位後的結果S4，所述加法器將該S4與存儲在所述中間存儲器中的結果相加，並且將相加後的結果存儲在所述中間存儲器中來更新所述中間存儲器；由於在第五次至第八次調用中，都是截取與被乘數對應的待截取尾數中最後9bit資料，而在該9bit資料之後不再有資料，因此在第五次至第八次調用中，j=0，在第五次調用中，所述移位器將ina_m2*inb_m1的結果向左移位，由於在第五次調用中在與乘數對應的待截取尾數中截取與在第一次調用中同樣的7bit資料，因此k=24-7=17bit，根據上述公式可知，Y=17+0=17，因此，向左移位的位數為17位，從而獲得移位17位後的結果S5，所述加法器將該結果S5與存儲在所述中間存儲器中的結果相加，並且將相加後的結果存儲在所述中間存儲器中來更新所述中間存儲器；如此重複調用尾數處理單元直到第八次調用，在第八次調用中，所述移位器將ina_m2*inb_m4的結果向左移位，由於在第八次調用中截取與乘數對應的待截取尾數中的最後3bit資料，因此在該待截取尾數中在該次調用所使用的3bit資料之後不存在資料，從而k=0，根據上述公式可知，Y=0+0=0，因此，向左移位的位數為0位，即不移位，從獲得不移位的結果S8，所述加法器將該S8與存儲在所述中間存儲器中的結果相加，並且將相加後的結果存儲在所述中間存儲器中來更新所述中間存儲器；由於該第八次調用為最後一次調用，因此在第八次調用後存儲在所述中間存儲器中的結果為所述乘法運算後的尾數。For example, as described above, in the FP32*FP32 operation mode, when the bit width of the multiplier is larger than the maximum multiplier bit width and the bit width of the multiplicand is larger than the maximum multiplicand bit width , the mantissa processing unit is called eight times, and, for example, the truncation of the mantissa to be truncated is performed in order from high to low. Specifically, for example, the multiplier parts in the eight calls are inb_m1, inb_m2, inb_m3 and inb_m4 respectively, and the multiplicand parts are ina_m1 and inb_m2 respectively. For example, when the mantissa processing unit is called eight times, the following calculations are performed in sequence: ina_m1*inb_m1, ina_m1*inb_m2, ina_m1*inb_m3, ina_m1*inb_m4, ina_m2*inb_m1, ina_m2*inb_m2, ina_m2*inb_m3, ina_m2*inb_m4. In the first call, the shifter shifts the result of ina_m1*inb_m1 to the left. Since 7-bit data is intercepted in the mantissa to be truncated corresponding to the multiplier in the first call, the mantissa to be truncated in the first call The sum of the digits of all the data after the 7bit data used in this call is k=24-7=17bit, and 15bit data is truncated in the mantissa to be truncated corresponding to the multiplicand, so in the mantissa to be truncated The sum of the digits of all the data after the 15bit data used in this call is j=24-15=9bit. According to the above formula, Y=17+9=26, therefore, the number of digits shifted to the left is 26 bits, so the result S1 after shifting by 26 bits is obtained, and the adder stores the S1 in the intermediate memory; after the second call, the shifter shifts the result of ina_m1*inb_m2 to the left Shift, because in the second call, the next 7bit data is truncated in the mantissa to be truncated corresponding to the multiplier, so in the mantissa to be truncated, the sum of the digits of all the data after the 7bit data used in this call is The sum is k=24-7-7=10bit, and the same 7bit data as the last call (using the same 7bit data as the last call) is intercepted in the mantissa to be intercepted corresponding to the multiplicand, so in this pending The sum of the digits of all the data after the 7bit data used in this call in the truncated mantissa is still j=24-15=9bit. According to the above formula, Y=10+9=19, therefore, shift to the left The number of bits is 19, so that the result S2 shifted by 19 bits is obtained, the adder adds this S2 to S1 stored in the intermediate memory, and stores the added result in the intermediate memory to update the intermediate memory; the mantissa processing unit is called repeatedly until the fourth call, in which the shifter shifts the result of ina_m1*inb_m4 to the left, since in the fourth call Intercept the last 3bit data in the mantissa to be truncated corresponding to the multiplier, so there is no data in the mantissa to be truncated after the 3bit data used in this call, so k=0, and in the mantissa corresponding to the multiplicand The same 7bit data as the last call is intercepted in the truncated mantissa, so the sum of the digits of all the data after the 7bit data used in this call in the mantissa to be truncated is still j=24-15=9bit, according to The above formula shows that Y=0+9=9, therefore, the number of bits shifted to the left is 9 bits, so that the result S4 after shifting by 9 bits is obtained, and the adder stores the S4 and S4 in the intermediate memory. Add the results in , and store the added result in the intermediate memory to update the intermediate memory; because in the fifth to eighth calls, the to-be-intercepted corresponding to the multiplicand is intercepted The last 9bit data in the mantissa, and there is no more data after the 9bit data, so in the fifth to eighth calls, j=0, in the fifth call, all The shifter shifts the result of ina_m2*inb_m1 to the left, since in the fifth call, the same 7bit data as in the first call is truncated in the mantissa to be truncated corresponding to the multiplier, so k=24- 7=17bit, according to the above formula, Y=17+0=17, therefore, the number of bits shifted to the left is 17 bits, so the result S5 after shifting by 17 bits is obtained, and the adder combines the result S5 with The results stored in the intermediate memory are added, and the added results are stored in the intermediate memory to update the intermediate memory; so repeatedly call the mantissa processing unit until the eighth call, at the eighth call , the shifter shifts the result of ina_m2*inb_m4 to the left, since the last 3 bits of data in the mantissa to be truncated corresponding to the multiplier are truncated in the eighth call, so in the mantissa to be truncated this time There is no data after calling the 3-bit data used, so k=0. According to the above formula, Y=0+0=0. Therefore, the number of bits shifted to the left is 0 bits, that is, no shift is obtained. The shifted result S8, the adder adds this S8 to the result stored in the intermediate memory, and updates the intermediate memory by storing the added result in the intermediate memory; Eight calls are the last call, so the result stored in the intermediate memory after the eighth call is the mantissa after the multiplication.

另一方面，為了進一步減小乘法器的面積，所述指數處理單元包括第二控制電路（圖中未示出），所述第二控制電路用於根據所述兩個浮點數中的一個的指數位寬和所述指數處理單元所支援的兩個位寬中的一個或者根據所述兩個浮點數的指數位寬和所述指數處理單元所支援的兩個位寬來確定多次調用所述指數處理單元以獲得所述乘法運算後的指數。On the other hand, in order to further reduce the area of the multiplier, the exponent processing unit includes a second control circuit (not shown in the figure), the second control circuit is used for according to one of the two floating-point numbers and one of the two bit widths supported by the exponent processing unit or determined multiple times according to the exponent bit widths of the two floating-point numbers and the two bit widths supported by the exponent processing unit The exponent processing unit is called to obtain the multiplied exponent.

根據本公開的第三實施例，所述兩個浮點數包括第一浮點數和第二浮點數，所述指數處理單元支援第三位寬和第四位寬，所述第一浮點數的指數作為與所述第三位寬對應的第三輸入，所述第二浮點數的指數作為與所述第四位寬對應的第四輸入，所述第三輸入的位寬小於或等於所述第三位寬，所述第二控制電路用於當所述第四輸入的位寬大於所述第四位寬時，多次調用所述指數處理單元來獲得所述乘法運算後的指數。根據該實施例，已知兩個輸入中的一個的位寬固定小於或等於與其對應的指數處理單元所支援的一個位寬，由此，只需判斷另一個輸入與對應的指數處理單元所支援位寬的大小關係，即可確定是否多次調用指數處理單元。According to a third embodiment of the present disclosure, the two floating-point numbers include a first floating-point number and a second floating-point number, the exponent processing unit supports a third bit width and a fourth bit width, the first floating point number The exponent of the point number is used as the third input corresponding to the third bit width, the exponent of the second floating point number is used as the fourth input corresponding to the fourth bit width, and the bit width of the third input is less than or equal to the third bit width, the second control circuit is configured to call the exponent processing unit multiple times to obtain the multiplication operation when the bit width of the fourth input is greater than the fourth bit width index. According to this embodiment, it is known that the bit width of one of the two inputs is fixed less than or equal to the one supported by the corresponding exponent processing unit. Therefore, it is only necessary to determine whether the other input is supported by the corresponding exponent processing unit. The size relationship of the bit width can determine whether to call the exponential processing unit multiple times.

根據本公開的第四實施例，所述兩個浮點數包括第一浮點數和第二浮點數，所述指數處理單元支援第三位寬和第四位寬，所述第一浮點數的指數作為與所述第三位寬對應的第三輸入，所述第二浮點數的指數作為與所述第四位寬對應的第四輸入，所述第二控制電路用於當所述第三輸入的位寬大於所述第三位寬且所述第四輸入的位寬小於或等於所述第四位寬時、當所述第四輸入的位寬大於所述第四位寬且所述第三輸入的位寬小於或等於所述第三位寬時或者當所述第三輸入的位寬大於所述第三位寬且所述第四輸入的位寬大於所述第四位寬時，多次調用所述指數處理單元來獲得所述乘法運算後的指數。根據該實施例，兩個輸入的位寬與指數處理單元所支援的兩個位寬的大小關係不確定，需要判斷兩個輸入與各自對應的指數處理單元所支援位寬的大小關係，來確定是否多次調用指數處理單元。According to a fourth embodiment of the present disclosure, the two floating-point numbers include a first floating-point number and a second floating-point number, the exponent processing unit supports a third bit width and a fourth bit width, the first floating point number The exponent of the point number is used as the third input corresponding to the third bit width, the exponent of the second floating point number is used as the fourth input corresponding to the fourth bit width, and the second control circuit is used when the When the bit width of the third input is greater than the third bit width and the bit width of the fourth input is less than or equal to the fourth bit width, when the bit width of the fourth input is greater than the fourth bit width width and the bit width of the third input is less than or equal to the third bit width or when the bit width of the third input is greater than the third bit width and the bit width of the fourth input is greater than the When the width is four bits, the exponent processing unit is called multiple times to obtain the exponent after the multiplication operation. According to this embodiment, the size relationship between the bit widths of the two inputs and the two bit widths supported by the exponent processing unit is uncertain, and it is necessary to determine the size relationship between the two inputs and the bit widths supported by the corresponding exponent processing units to determine Whether to call the exponential processing unit multiple times.

根據該第四實施例，當所述第一浮點數的指數位寬小於所述第二浮點數的指數位寬並且所述第三位寬大於所述第四位寬時，或者當所述第一浮點數的指數位寬大於所述第二浮點數的指數位寬並且所述第三位寬小於所述第四位寬時，所述第二控制電路選擇所述第一浮點數的指數作為與所述第四位寬對應的所述第四輸入並且選擇所述第二浮點數的指數作為與所述第三位寬對應的第三輸入。應當理解，在兩個浮點數的指數無規則輸入時，可以先將輸入的兩個浮點數的指數根據大位寬對大位寬、小位寬對小位寬的策略與指數處理單元支援的兩個位寬進行匹配，以避免本可一次處理完成兩個浮點數的指數運算，卻進行了多次調用。According to the fourth embodiment, when the exponent bit width of the first floating point number is smaller than the exponent bit width of the second floating point number and the third bit width is larger than the fourth bit width, or when all the When the exponent bit width of the first floating point number is greater than the exponent bit width of the second floating point number and the third bit width is smaller than the fourth bit width, the second control circuit selects the first floating point number. The exponent of the point number is selected as the fourth input corresponding to the fourth bit width and the exponent of the second floating point number is selected as the third input corresponding to the third bit width. It should be understood that when the exponents of two floating-point numbers are input irregularly, the exponents of the input two floating-point numbers can be firstly processed according to the strategy of large bit width to large bit width and small bit width to small bit width and the exponent processing unit. The supported two-bit widths are matched to avoid exponentiation of two floating-point numbers that can be processed at one time, but multiple calls are made.

進一步地，當所述第三輸入的位寬大於所述第三位寬且所述第四輸入的位寬小於或等於所述第四位寬時、當所述第四輸入的位寬大於所述第四位寬且所述第三輸入的位寬小於或等於所述第三位寬時或者當所述第三輸入的位寬大於所述第三位寬且所述第四輸入的位寬大於所述第四位寬時，所述第二控制電路用於當所述第三輸入的位寬小於或等於所述第四輸入的位寬且所述第三位寬小於或等於所述第四位寬時，根據所述第四輸入的位寬和所述第三位寬來確定調用所述指數處理單元的次數以及在每次調用中輸入所述指數處理單元的資料。值得注意的是，以上三種情況下，指數處理單元的調用次數以及在每次調用中輸入所述指數處理單元的資料都是根據第三輸入和第四輸入的位寬中的較大者與第三位寬和第四位寬中的較小者來確定。當然，當第三輸入和第四輸入的位寬相同或者第三位寬和第四位寬相同時，可以在相同位寬的兩者中任選其一。Further, when the bit width of the third input is greater than the third bit width and the bit width of the fourth input is less than or equal to the fourth bit width, when the bit width of the fourth input is greater than the When the fourth bit width and the bit width of the third input is less than or equal to the third bit width or when the bit width of the third input is greater than the third bit width and the bit width of the fourth input is larger In the fourth bit width, the second control circuit is used for when the bit width of the third input is less than or equal to the bit width of the fourth input and the third bit width is less than or equal to the third input. When the width is four bits, the number of times to call the exponent processing unit and the data input to the exponent processing unit in each call is determined according to the bit width of the fourth input and the third bit width. It is worth noting that in the above three cases, the number of calls of the index processing unit and the data input to the index processing unit in each call are based on the larger of the bit widths of the third input and the fourth input and the third input. The smaller of the three-digit width and the fourth-digit width is determined. Of course, when the bit widths of the third input and the fourth input are the same or the third bit width and the fourth bit width are the same, either one of the same bit widths can be selected.

在該實施例中，關於第一浮點數和第二浮點數的描述只是為了區分兩個浮點數，其中“第三”和“第四”不具有限定作用。同樣地，關於第三輸入和第四輸入的描述只是為了區分所述指數處理單元的兩個輸入，關於第三位寬和第四位寬的描述只是為了區分指數處理單元所支援的與所述指數處理單元的兩個輸入對應的兩個最大處理位寬，因此其中“第三”和“第四”都不具有限定作用。In this embodiment, the description about the first floating-point number and the second floating-point number is only for distinguishing the two floating-point numbers, wherein "third" and "fourth" have no limiting effect. Likewise, the description about the third input and the fourth input is only for distinguishing the two inputs of the exponent processing unit, and the description about the third bit width and the fourth bit width is only for distinguishing the one supported by the exponent processing unit from the The two inputs of the exponential processing unit correspond to the two maximum processing bit widths, so neither "third" nor "fourth" has a limiting effect.

值得注意的是，以上實施例描述的輸入乘法器的浮點數是符合運算要求格式以及適用乘法器內部部件和外部部件的浮點數，即經過例如規格化等預處理的浮點數。應當理解，輸入乘法器的浮點數可以是規格化或非規格化的浮點數，結合以上關於規格化單元的描述可知，如果輸入的兩個浮點數中的至少一個浮點數為非規格化的非零浮點數，可以首先通過規格化單元對所述至少一個浮點數進行規格化處理，以獲得規格化後的指數和尾數，然後使用規格化後的指數作為指數處理單元的輸入來進行上述的浮點數乘法運算。當然，還可以對浮點數進行其他的預處理，並將預處理後的浮點數的指數作為指數處理單元的輸入來進行上述的浮點數乘法運算，例如以上關於規格化單元的描述中提到的為了適用運算模式而對浮點數進行的規格化，本公開的第三實施例和第四實施例同樣適用於如上所述的根據運算模式進行浮點數的運算。It is worth noting that the floating-point numbers input to the multiplier described in the above embodiments are floating-point numbers that conform to the format required by the operation and are applicable to the internal and external components of the multiplier, that is, the floating-point numbers that have undergone preprocessing such as normalization. It should be understood that the floating-point number input to the multiplier may be a normalized or denormalized floating-point number. In combination with the above description of the normalization unit, it can be known that if at least one of the two input floating-point numbers is a non-normalized floating-point number For the normalized non-zero floating-point number, the at least one floating-point number may be normalized by the normalization unit to obtain the normalized exponent and the mantissa, and then the normalized exponent may be used as the value of the exponent processing unit. input to perform the floating-point multiplication operation described above. Of course, other preprocessing can also be performed on the floating point number, and the exponent of the preprocessed floating point number is used as the input of the exponent processing unit to perform the above-mentioned floating point number multiplication operation, for example, in the above description of the normalization unit Regarding the normalization of floating-point numbers in order to apply the operation mode, the third and fourth embodiments of the present disclosure are also applicable to the above-mentioned operation of floating-point numbers according to the operation mode.

下面將詳細說明多次調用指數處理單元的示例。為了更清楚直觀地理解該示例，上述第三輸入例如可以是加數，第四輸入例如可以是被加數，第三位寬例如可以是指數處理單元所支援的最大加數位寬，第四位寬例如可以是指數處理單元所支援的最大被加數位寬。An example of invoking the exponential processing unit multiple times is detailed below. In order to understand this example more clearly and intuitively, the above-mentioned third input may be, for example, the addend, the fourth input may be, for example, the summand, and the third bit width may be, for example, the maximum addend bit width supported by the exponent processing unit. The width may be, for example, the maximum summed digit width supported by the exponent processing unit.

根據本公開的多次調用指數處理單元的示例，結合以上描述的根據運算模式的浮點數乘法運算，以輸入到本公開乘法器的兩個浮點數為非規格化的非零浮點數為例，首先將兩個浮點數規格化，因此兩個浮點數的尾數擴展1位。在經過該預處理後，兩個浮點數的指數和指數處理單元的輸入進行匹配。因此，當加數的位寬大於最大加數位寬且被加數的位寬小於或等於最大被加數位寬時、當被加數的位寬大於最大被加數位寬且加數的位寬小於或等於最大加數位寬時或者當加數的位寬大於最大加數位寬且被加數的位寬大於最大被加數位寬時，所述控制電路可以根據以下公式來確定指數處理單元的調用次數：According to the example of calling the exponent processing unit multiple times of the present disclosure, in combination with the floating-point multiplication operation according to the operation mode described above, the two floating-point numbers input to the multiplier of the present disclosure are denormalized non-zero floating-point numbers For example, the two floating-point numbers are first normalized, so the mantissas of the two floating-point numbers are extended by 1 bit. After this preprocessing, the exponents of the two floating point numbers are matched with the input to the exponent processing unit. Therefore, when the bit width of the summand is greater than the maximum addend bit width and the bit width of the summand is smaller than or equal to the maximum When it is equal to the maximum addend bit width or when the addend bit width is greater than the maximum addend bit width and the summand bit width is greater than the maximum summand bit width, the control circuit can determine the number of invocations of the exponent processing unit according to the following formula :

m= ceil（P/（Q-1）），m=ceil(P/(Q-1)),

其中，m代表調用指數處理單元的次數，P代表被加數的位寬，Q代表最大加數位寬，Q-1代表每次調用中從加數和被加數中截取的部分的位寬。在每次調用中同時對加數和被加數截取位寬為Q-1的部分，使得從加數和被加數中截取的相同位寬且相同數位的部分進行加法運算，若在調用中截取的部分的資料不足Q-1位或無數據，在其前面或全部補0湊齊Q-1位資料。在將從加數和被加數中截取的部分前面擴展一個進位後，形成輸入指數處理單元的加數部分和被加數部分，因此,Q也代表每次調用時輸入指數處理單元的加數部分和被加數部分的位寬。Among them, m represents the number of times the exponent processing unit is called, P represents the bit width of the summand, Q represents the maximum addend bit width, and Q-1 represents the bit width of the part truncated from the addend and the summand in each call. In each call, the part of the addend and the summand with a bit width of Q-1 is truncated at the same time, so that the part of the same bit width and the same number of digits truncated from the addend and the summand is added. If the data of the intercepted part is not enough for Q-1 bits or there is no data, 0 is added in front or all to make up the Q-1 bits of data. After extending a carry in front of the part truncated from the addend and summand, the addend part and the summand part of the input exponent processing unit are formed, so that Q also represents the addend of the input exponent processing unit on each call The bit width of the part and the summand part.

由此，第二控制電路可在每次調用指數處理單元時，從加數和被加數中按照相同的順序截取Q-1位的部分作為指數處理單元的輸入，通過指數處理單元獲得該次調用的指數結果，並且在調用指數處理單元m次之後獲得最終的指數。值得注意的是，上述相同的順序可以是從高位到低位的順序，也可以從低位到高位的順序。Therefore, the second control circuit can intercept the Q-1 bit part from the addend and the summand in the same order as the input of the exponent processing unit each time the exponent processing unit is called, and obtain this time through the exponent processing unit. The exponent result of the call, and the final exponent is obtained after calling the exponent processing unit m times. It is worth noting that the same order as above can be from high to low, or from low to high.

舉例來說，加數的位寬為6bit，被加數的位寬為9bit，指數處理單元所支援的最大加數位寬和最大被加數位寬都為8bit。因此，調用指數處理單元的次數為ceil（9/（8-1））=2，並且首先將加數前面補0，使得加數的位寬和被加數的位寬相同，然後在每次調用中按照從高位到低位的順序同時對加數和被加數截取位寬為7位的部分，並將這兩個截取的部分分別擴展一位進位位，形成兩個8位的帶進位資料進行相加，在第二次調用（即最後一次調用）時，只能從加數和被加數中截取2位資料（只剩2位資料），因此，在第二次調用時截取的2位資料前補0湊齊7位，並且擴展一位進位位，形成兩個8位的帶進位資料進行相加。For example, the bit width of the addend is 6 bits, the bit width of the summand is 9 bits, and the maximum bit width of the addend and the maximum bit width of the augend supported by the exponent processing unit are both 8 bits. Therefore, the number of times the exponent processing unit is called is ceil(9/(8-1)) = 2, and the front of the addend is first filled with 0, so that the bit width of the addend is the same as the bit width of the summand. In the call, the addend and the summand are simultaneously truncated with a bit width of 7 bits in the order from high to low, and the two truncated parts are respectively extended by one carry bit to form two 8-bit carry bits The data is added. In the second call (that is, the last call), only 2 bits of data can be intercepted from the addend and the summand (only 2 bits of data are left). Therefore, the intercepted data in the second call The 2-bit data is filled with 0 to make up 7 bits, and a carry bit is extended to form two 8-bit data with carry for addition.

值得注意的是，該示例中的當加數的位寬大於最大加數位寬且被加數的位寬小於或等於最大被加數位寬時和當被加數的位寬大於最大被加數位寬且加數的位寬小於或等於最大加數位寬時對指數處理單元的調用同樣適用於本公開上述第三實施例。It is worth noting that in this example, when the bit width of the summand is greater than the maximum augend bit width and the bit width of the summand is less than or equal to the maximum augend bit width and when the bit width of the summand is greater than the maximum augend bit width And the call to the exponent processing unit when the bit width of the addend is less than or equal to the maximum addend bit width is also applicable to the third embodiment of the present disclosure.

根據實施例，所述指數處理單元還可以包括第二移位加法電路，所述第二移位加法電路用於根據每次調用所述指數處理單元所獲得的指數結果來獲得所述乘法運算後的指數。According to an embodiment, the exponent processing unit may further include a second shift and addition circuit, and the second shift and addition circuit is configured to obtain the post-multiplication operation according to the exponent result obtained by calling the exponent processing unit each time. index.

進一步，所述第二移位加法電路包括第二移位器、第二中間存儲器和第二加法器，當所述第二控制電路多次調用所述指數處理單元時，在第一次調用後，所述第二移位器將第一次調用獲得的指數結果進行移位並將移位後的指數結果存入所述第二中間存儲器中，從第二次調用指數處理單元開始，所述第二移位器將當次調用中獲得的指數結果進行移位，所述第二加法器將移位後的指數結果與存儲在第二中間存儲器中的數值相加並且將相加後的結果存儲在所述第二中間存儲器中來更新所述第二中間存儲器，並且將在最後一次調用中存儲在所述第二中間存儲器中的數值作為所述乘法運算後的指數。Further, the second shift and add circuit includes a second shifter, a second intermediate memory and a second adder. When the second control circuit calls the exponent processing unit multiple times, after the first call , the second shifter shifts the exponent result obtained by the first invocation and stores the shifted exponent result in the second intermediate memory, starting from the second invocation of the exponent processing unit, the The exponent result obtained in the current call is shifted by a second shifter that adds the shifted exponent result to the value stored in the second intermediate memory and adds the added result The second intermediate storage is updated by storing in the second intermediate storage, and the value stored in the second intermediate storage in the last call is used as the exponent after the multiplication operation.

在每次調用所述指數處理單元時，所述第二移位器將當次調用中獲得的指數結果按照以下方式進行移位：若在調用指數處理單元時按照從高位到低位的順序截取加數和被加數時，對當次調用從加數和被加數中所截取的部分向左移位，移位位數是當次調用中從被加數中截取的部分之後的部分的位數。Each time the exponent processing unit is called, the second shifter shifts the exponent result obtained in the current call as follows: When counting and the summand, the part truncated from the addend and the summand in the current call is shifted to the left by the number of bits of the part after the part truncated from the summand in the current call. number.

舉例來說，結合以上示例，例如加數的位寬為6bit，被加數的位寬為9bit，指數處理單元所支援的最大加數位寬和最大被加數位寬都為8bit，在每次調用中按照從高位到低位的順序同時對加數和被加數截取位寬為7位的部分。具體地，在第一次調用指數處理單元後，所述第二移位器將第一次調用獲得的指數結果向左移2位（因為該次調用中被加數截取的部分之後有2位資料）並將移位後的指數結果存入所述第二中間存儲器中，從第二次調用指數處理單元開始，所述第二移位器將當次調用中獲得的指數結果向左移位，由於該次調用中截取的部分之後不再有資料，因此向左移0位，即不移位，所述第二加法器將移0位後的指數結果與存儲在第二中間存儲器中的數值相加並且將相加後的結果存儲在所述第二中間存儲器中來更新所述第二中間存儲器，由於該第二次調用即為最後一次調用，因此在該第二次調用後存儲在所述第二中間存儲器中的數值即為所述乘法運算後的指數。For example, in combination with the above examples, for example, the bit width of the addend is 6 bits, the bit width of the summand is 9 bits, and the maximum bit width of the augand and the maximum bit width of the augend supported by the exponent processing unit are both 8 bits. Truncating the 7-bit part of the addend and the summand in the order from the high order to the low order. Specifically, after invoking the exponent processing unit for the first time, the second shifter shifts the exponent result obtained in the first invocation by 2 bits to the left (because there are 2 bits after the part truncated by the addend in this invocation data) and store the shifted exponent result in the second intermediate memory, starting from the second invocation of the exponent processing unit, the second shifter shifts the exponent result obtained in the current call to the left , since there is no more data after the intercepted part in this call, it is shifted to the left by 0 bits, that is, without shifting. Add the values and store the added result in the second intermediate memory to update the second intermediate memory. Since the second call is the last call, it is stored in the second intermediate memory after the second call. The value in the second intermediate storage is the exponent after the multiplication operation.

根據以上具體描述的本公開乘法器（尾數處理單元和指數處理單元）被多次調用的情況可知，所述控制模組可以包括多個子模組，所述多個子模組可以分別用於執行多次調用中的各種操作，例如確定多次調用尾數處理單元、確定調用次數、確定每次調用中輸入所述尾數處理單元的資料、判斷尾數位寬與尾數處理單元所支援位寬是否匹配、調整尾數輸入等。所述第二控制模組也可以包括多個子模組，同樣地，這些子模組可以分別執行多次調用中的各種操作。According to the situation that the multipliers (mantissa processing unit and exponent processing unit) of the present disclosure are called multiple times, it can be known that the control module may include multiple sub-modules, and the multiple sub-modules may be respectively used to execute multiple Various operations in the call, such as determining the number of calls to the mantissa processing unit, determining the number of calls, determining the data input to the mantissa processing unit in each call, judging whether the mantissa bit width matches the bit width supported by the mantissa processing unit, adjusting Mantissa input, etc. The second control module may also include multiple sub-modules, and similarly, these sub-modules may respectively perform various operations in multiple calls.

上文結合第4圖-第6圖詳細描述了本披露的乘法器在執行浮點運算時，對第一浮點數和第二浮點數的尾數相乘所執行的操作。當然，第4圖為了注重描述本披露乘法器的尾數處理單元的操作，並沒有繪出其他的單元，例如指數處理單元和符號處理單元，並對其進行描述。下面將結合第7圖對本披露的乘法器進行整體上的描述，對於前文針對尾數處理單元所做的描述，同樣也適用於第7圖所繪的情形。The operations performed by the multiplier of the present disclosure to multiply the mantissas of the first floating-point number and the second floating-point number when performing floating-point operations are described in detail above with reference to FIGS. 4 to 6 . Of course, in order to focus on describing the operation of the mantissa processing unit of the multiplier of the present disclosure, Fig. 4 does not draw and describe other units, such as the exponent processing unit and the sign processing unit. The multiplier of the present disclosure will be generally described below with reference to FIG. 7 , and the foregoing description of the mantissa processing unit is also applicable to the situation depicted in FIG. 7 .

第7圖是示出根據本披露實施例的乘法器700的整體示意框圖。需要理解的是圖中繪出的各類單元的位置、存在和連接關係僅僅是示例性的而非限制性的，例如其中的一些單元可以集成，而另一些單元也可以分離或依應用場景的不同而被省略或替換。FIG. 7 is an overall schematic block diagram illustrating a multiplier 700 according to an embodiment of the present disclosure. It should be understood that the positions, existence and connection relationships of various units shown in the figures are only exemplary and not limiting, for example, some of the units may be integrated, while other units may be separated or according to application scenarios. be omitted or replaced.

本披露的乘法器在每種運算模式的操作中按操作流程可以示例性地分為第一階段和第二階段，如圖中的虛線所繪出的。概括來說，在第一階段中：輸出符號位的計算結果，輸出指數位的中間計算結果，輸出尾數位的中間計算結果 (例如包括前述的輸入尾數位定點乘法布斯算法的編碼過程和華萊士樹壓縮過程)。在第二階段中：對指數和尾數進行規則化和捨入操作，以輸出指數的計算結果和輸出尾數的計算結果。The multiplier of the present disclosure can be exemplarily divided into a first stage and a second stage according to the operation flow in the operation of each operation mode, as depicted by the dotted line in the figure. In general, in the first stage: output the calculation result of the sign bit, output the intermediate calculation result of the exponent bit, and output the intermediate calculation result of the mantissa bit (for example, including the encoding process of the aforementioned fixed-point multiplication Booth algorithm for the input mantissa bit and Huahua). Race tree compression process). In the second stage: the exponent and mantissa are regularized and rounded to output the calculation result of the exponent and the calculation result of the output mantissa.

如第7圖中所示，本披露的乘法器可以包括模式選擇單元702和規格化處理單元704，其中模式選擇單元可以根據輸入模式信號 (in_mode)來選擇運算模式。在一個實施例中，該輸入模式信號可以與表2中的運算模式編號相對應。例如，當輸入模式信號指示表2中的運算模式編號“1”時，則可以令乘法器工作於FP16*FP16的運算模式中，而當輸入模式信號指示表2中的運算模式編號“3”時，則可以令乘法器工作於FP32*FP32的運算模式中。為了圖示的目的，第7圖僅示出FP16*FP16、BF16*BF16、FP32*FP32和FP32*BP16四種示例性運算模式。然而，正如前所述，本披露的乘法器同樣也支持其他多種不同的運算模式。As shown in FIG. 7, the multiplier of the present disclosure may include a mode selection unit 702 and a normalization processing unit 704, wherein the mode selection unit may select an operation mode according to an input mode signal (in_mode). In one embodiment, the input mode signal may correspond to the operation mode number in Table 2. For example, when the input mode signal indicates the operation mode number "1" in Table 2, the multiplier can be made to work in the operation mode of FP16*FP16, and when the input mode signal indicates the operation mode number "3" in Table 2 When , the multiplier can work in the operation mode of FP32*FP32. For the purpose of illustration, FIG. 7 only shows four exemplary operation modes of FP16*FP16, BF16*BF16, FP32*FP32 and FP32*BP16. However, as mentioned above, the multipliers of the present disclosure also support various other operation modes.

規格化處理單元可以配置成用於當第一浮點數或第二浮點數為非規格化的非零浮點數時，根據運算模式，對第一浮點數或第二浮點數進行規格化處理，以獲得對應的指數和尾數，例如按照IEEE754標準、對運算模式所指示的資料格式的浮點數進行規則化處理。The normalization processing unit may be configured to, when the first floating point number or the second floating point number is a non-normalized non-zero floating point number, perform the first floating point number or the second floating point number according to the operation mode. Normalization processing is performed to obtain the corresponding exponent and mantissa, for example, according to the IEEE754 standard, regular processing is performed on the floating-point number in the data format indicated by the operation mode.

進一步，乘法器包括尾數處理單元，以執行第一浮點數尾數和第二浮點數尾數的相乘操作。為此，在一個或多個實施例中，該尾數處理單元可以包括位數擴展電路706、布斯編碼器708、部分積產生電路710、華萊士樹壓縮器712以及加法器714，其中位數擴展電路可以用於對所述第一浮點數和所述第二浮點數中的至少一個的尾數進行位數擴展，例如在高位補0，以適合於布斯編碼器的操作。控制電路可以根據位數擴展電路對尾數進行符號位擴展後獲得的尾數進行以上多次調用尾數處理單元的操作。由於關於布斯編碼器、部分積產生電路、華萊士樹壓縮器和加法器，已經結合第4圖-第6圖進行了詳細了描述，因此相同的描述在此同樣適用並因此不再贅述。Further, the multiplier includes a mantissa processing unit to perform a multiplication operation of the first floating point mantissa and the second floating point mantissa. To this end, in one or more embodiments, the mantissa processing unit may include a bit expansion circuit 706, a Booth encoder 708, a partial product generation circuit 710, a Wallace tree compressor 712, and an adder 714, where the bits The number expansion circuit may be configured to perform digit expansion on the mantissa of at least one of the first floating-point number and the second floating-point number, for example, adding 0 to high-order bits, so as to be suitable for the operation of the Booth encoder. The control circuit may perform the above operations of calling the mantissa processing unit multiple times according to the mantissa obtained after the sign bit extension of the mantissa by the digit expansion circuit. Since the Booth encoder, the partial product generation circuit, the Wallace tree compressor and the adder have already been described in detail with reference to Figs. .

在一些實施例中，本披露的乘法器還包括規則化單元716和捨入單元718，該規則化單元和捨入單元具有與第3圖中所示出的單元相同的功能。具體地，對於規則化單元，其可以根據如第7圖中所示的輸出模式信號“out_mode”所指示的資料格式來對所述加和結果和來自於指數處理單元的指數資料進行浮點數規則化處理以獲得規則化指數結果和規則化尾數結果。例如，根據輸出模式信號所指示的資料格式，規則化單元可以調整指數和尾數的位寬，以使其符合前述指示的資料格式的要求。再例如，當尾數的最高位為0，且該尾數不為0，則規則化單元可以重複將尾數左移1位，並且指數減1，直到最高位數值為1。對於捨入單元，在一個實施例中，其可以用於根據捨入模式對所述規則化尾數結果執行捨入操作以獲得捨入後的尾數，並將捨入後的尾數作為所述乘法運算後的尾數。In some embodiments, the multiplier of the present disclosure also includes a regularization unit 716 and a rounding unit 718 that have the same functions as the units shown in FIG. 3 . Specifically, for the regularization unit, it can float the summation result and the exponent data from the exponent processing unit according to the data format indicated by the output mode signal "out_mode" as shown in FIG. 7 Regularize processing to obtain regularized exponent results and regularized mantissa results. For example, according to the data format indicated by the output mode signal, the regularization unit can adjust the bit widths of the exponent and the mantissa to make them meet the requirements of the data format indicated above. For another example, when the highest digit of the mantissa is 0, and the mantissa is not 0, the regularization unit can repeatedly shift the mantissa to the left by 1 bit, and decrease the exponent by 1, until the highest digit value is 1. For a rounding unit, in one embodiment, it may be used to perform a rounding operation on the regularized mantissa result to obtain a rounded mantissa according to a rounding mode, and use the rounded mantissa as the multiplication operation after the mantissa.

在一個或多個實施例中，前述的輸出模式信號可以是運算模式的一部分，用於指示乘法運算後的資料格式。例如，如前表3中所描述的，當運算模式編號為“12”時，則其中的數字“1”可以相當於前述的“in_mode”信號，用於指示執行FP16*FP16的乘法操作，而其中的數位“2”可以相當於“out_mode”信號，用於指示輸出結果的資料類型是BF16。因此可以理解的是，在一些應用場景中，輸出模式信號可以與前述的輸入模式信號合併，以提供給模式選擇單元。基於此合併後的模式信號，模式選擇單元可以在乘法器操作的初始階段明確輸入資料和輸出結果的資料格式，而無需向規則化單獨的提供輸出模式信號，由此也可以進一步簡化操作。In one or more embodiments, the aforementioned output mode signal may be part of the operation mode for indicating the data format after the multiplication operation. For example, as described in Table 3 above, when the operation mode number is "12", the number "1" in it can be equivalent to the aforementioned "in_mode" signal, which is used to instruct to perform the multiplication operation of FP16*FP16, and The digit "2" in it can be equivalent to the "out_mode" signal, which is used to indicate that the data type of the output result is BF16. Therefore, it can be understood that, in some application scenarios, the output mode signal may be combined with the aforementioned input mode signal to provide the mode selection unit. Based on the combined mode signal, the mode selection unit can specify the data format of the input data and the output result at the initial stage of the multiplier operation without separately providing the output mode signal to the regularization, thereby further simplifying the operation.

在一個或多個實施例中，對於前述的捨入操作，可以示例性包括如下5種捨入模式。In one or more embodiments, for the aforementioned rounding operation, the following five rounding modes may be exemplarily included.

(1)捨入到最接近的值：在此模式下，當兩個值同樣接近的情況下，偶數優先。此時會將結果捨入為最接近且可以表示的值，但是當存在兩個數同樣接近的時候，則取其中的偶數作為捨入結果(在二進位中是以0結尾的數)；(1) Round to the nearest value: In this mode, when two values are equally close, the even number takes precedence. At this time, the result will be rounded to the closest and representable value, but when there are two numbers that are equally close, the even number among them will be taken as the rounding result (the number ending in 0 in binary digits);

(2)四捨五入：示例性操作參見下面的例子；(2) Rounding: see the example below for an exemplary operation;

(3)朝+∞方向捨入：在此規則下，會將結果朝正無限大的方向捨入；(3) Rounding towards +∞: Under this rule, the result will be rounded towards positive infinity;

(4)朝-∞方向捨入：在此規則下，會將結果朝負無限大的方向捨入；以及(4) rounding towards -∞: Under this rule, the result is rounded towards negative infinity; and

(5)朝0方向捨入：在此規則下，會將結果朝0的方向捨入。(5) Rounding towards 0: Under this rule, the result will be rounded towards 0.

對於“四捨五入”模式下的尾數捨入的例子：例如兩個規格化浮點數的24位的尾數相乘得到一個48位(47~0)的尾數，經過規格化處理（若尾數的最高位為0，將尾數左移1位；若尾數的最高位為1，則尾數不動，且將前面所求的臨時的階碼加1），輸出時只取第46至第24位。當尾數的第23位為0時，則捨去第(23-0)位；當尾數的第23位為1時，則向第24位進1並捨去第(23-0)位。For example of mantissa rounding in "rounding" mode: for example, multiply the 24-bit mantissas of two normalized floating point numbers to obtain a 48-bit (47~0) mantissa, which is normalized (if the highest bit of the mantissa is If it is 0, move the mantissa to the left by 1; if the highest bit of the mantissa is 1, the mantissa will not change, and the temporary exponent obtained above will be added by 1), and only the 46th to 24th digits will be taken when outputting. When the 23rd digit of the mantissa is 0, the (23-0) digit is dropped; when the 23rd digit of the mantissa is 1, 1 is added to the 24th digit and the (23-0) digit is dropped.

返回到第7圖，本披露的乘法器還包括指數處理單元720和符號處理單元722，其中指數處理單元可以用於根據運算模式、第一浮點數的指數和第二浮點數的指數獲得所述乘法運算後的指數。例如，指數處理電路可以將第一浮點數的指數位資料、第二浮點數的指數位資料和各自對應的輸入浮點資料類型的偏移值相加，並且減去輸出浮點資料類型的偏移值，以獲得所述第一浮點數和第二浮點數的乘積的指數位資料。在一個或多個實施例中，指數處理單元可以實現為或包括加減法電路，其用於根據所述運算模式、所述第一浮點數的指數、所述第二浮點數的指數和所述運算模式獲得所述乘法運算後的指數。Returning to FIG. 7, the multiplier of the present disclosure further includes an exponent processing unit 720 and a sign processing unit 722, wherein the exponent processing unit can be used to obtain according to the operation mode, the exponent of the first floating point number and the exponent of the second floating point number The exponent after the multiplication operation. For example, the exponent processing circuit may add the exponent bit data for the first floating point number, the exponent bit data for the second floating point number, and the offset values for their respective input floating point data types, and subtract the output floating point data type The offset value to obtain the exponent bit data of the product of the first floating point number and the second floating point number. In one or more embodiments, an exponent processing unit may be implemented as or include an addition and subtraction circuit for operating according to the operation mode, the exponent of the first floating point number, the exponent of the second floating point number and The operation mode obtains the exponent after the multiplication operation.

符號處理單元在一個實施例中可以實現為異或電路，其用於對所述第一浮點數和第二浮點數的符號位資料執行異或操作，以獲得所述第一浮點數和第二浮點數的乘積的符號位資料。The sign processing unit may be implemented as an XOR circuit in one embodiment, which is configured to perform an XOR operation on the sign bit data of the first floating point number and the second floating point number to obtain the first floating point number Sign bit data for the product of the second floating point number.

上文結合第7圖對本披露的乘法器整體進行了詳細的描述。通過該描述，本領域具有通常知識者可以理解本披露的乘法器支援多種運算模式下的操作，從而克服了習知技術中僅支援單一浮點型運算的乘法器的缺陷。進一步，由於本披露的乘法器可以複用，因此也支援高位寬的浮點型資料，降低了運算成本和開銷。在一個或多個實施例中，本披露的乘法器還可以佈置成或包括於積體電路晶片或計算裝置中，以實現在多種運算模式下對浮點數執行乘法運算。The overall multiplier of the present disclosure is described in detail above with reference to FIG. 7 . Through this description, those skilled in the art can understand that the multiplier of the present disclosure supports operations in multiple operation modes, thereby overcoming the defect of the multiplier that only supports a single floating-point operation in the prior art. Further, since the multiplier of the present disclosure can be reused, it also supports high-bit-width floating-point data, which reduces operation cost and overhead. In one or more embodiments, the multipliers of the present disclosure may also be arranged or included in an integrated circuit chip or computing device to enable multiplication operations on floating point numbers in a variety of operational modes.

第8圖是示出根據本披露實施例的使用乘法器執行浮點數乘法運算的方法800的流程圖。可以理解的是此處所述的乘法器即前面結合第1圖-第7圖詳細描述的乘法器，因此在前關於該乘法器及其內部組成、功能和操作的描述也同樣適用於此處的描述。FIG. 8 is a flowchart illustrating a method 800 of performing a floating-point multiplication operation using a multiplier according to an embodiment of the present disclosure. It can be understood that the multiplier described here is the multiplier described in detail above in conjunction with Figures 1 to 7, so the previous descriptions about the multiplier and its internal composition, function and operation are also applicable here. description of.

如第8圖中所示，所述方法800可以包括在步驟S802處利用所述乘法器的指數處理單元來根據運算模式、第一浮點數的指數和第二浮點數的指數獲得所述乘法運算後的指數。正如前所述，該運算模式可以是多種運算模式中的一種，並且可以用於指示浮點數的資料格式。在一個或多個實施例中，該運算模式還可以用於確定輸出結果的浮點數的資料格式。As shown in FIG. 8, the method 800 may include utilizing the exponent processing unit of the multiplier at step S802 to obtain the said Exponent after multiplication. As previously mentioned, the operation mode can be one of several operation modes and can be used to indicate the data format of floating point numbers. In one or more embodiments, the operation mode may also be used to determine the data format of the floating point number of the output result.

接著，在步驟S804處，該方法800可以利用乘法器的尾數處理單元來根據所述運算模式、第一浮點數和第二浮點數獲得所述乘法運算後的尾數。關於尾數的示例性操作，本披露在一些優選的實施例中使用了布斯編碼算法和華萊士樹壓縮器，從而提高尾數處理的效率。另外，當第一浮點數和第二浮點數是有符號數時，方法800還可以在步驟S806中用於根據第一浮點數的符號和第二浮點數的符號獲得乘法運算後的符號。Next, at step S804, the method 800 may utilize the mantissa processing unit of the multiplier to obtain the mantissa after the multiplication operation according to the operation mode, the first floating point number and the second floating point number. Regarding the exemplary manipulation of the mantissa, the present disclosure uses the Booth coding algorithm and the Wallace tree compressor in some preferred embodiments to improve the efficiency of the mantissa processing. In addition, when the first floating-point number and the second floating-point number are signed numbers, the method 800 may also be used in step S806 to obtain the multiplication operation according to the sign of the first floating-point number and the sign of the second floating-point number symbol.

儘管上述方法以步驟形式示出利用本披露的乘法器來執行浮點數乘法運算，但這些步驟順序並不意味著本方法的步驟必須依所述順序來執行，而是可以以其他順序或並行的方式來處理。另外，此處為了描述的簡明而沒有闡述方法800的其他步驟，但本領域具有通常知識者根據本披露的內容可以理解該方法也可以通過使用乘法器來執行前述結合第1圖-第7圖描述的各種操作。Although the above method is shown in the form of steps to perform floating point multiplication operations using the multiplier of the present disclosure, the order of the steps does not imply that the steps of the method must be performed in the stated order, but may be performed in other orders or in parallel way to deal with. In addition, other steps of the method 800 are not described here for the sake of brevity of description, but those with ordinary knowledge in the art can understand from the content of the present disclosure that the method can also be performed by using a multiplier to perform the aforementioned combination of FIGS. 1-7. various operations described.

在本披露的上述實施例中，對各個實施例的描述都各有側重，某個實施例中沒有詳述的部分，可以參見其他實施例的相關描述。上述實施例的各技術特徵可以進行任意的組合，為使描述簡潔，未對上述實施例中的各個技術特徵所有可能的組合都進行描述，然而，只要這些技術特徵的組合不存在矛盾，都應當認為是本說明書記載的範圍。In the above-mentioned embodiments of the present disclosure, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments. The technical features of the above embodiments can be combined arbitrarily. For the sake of brevity, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, it should be It is considered to be the range described in this specification.

第9圖是示出根據本披露實施例的一種組合處理裝置900的結構圖。如圖所示，該組合處理裝置900包括計算裝置902，該計算裝置可以包括如前結合附圖描述的本披露的乘法器。另外，該組合處理裝置還包括通用互聯介面904和其他處理裝置906。根據本披露的計算裝置與其他處理裝置進行交互，共同完成用戶指定的操作。FIG. 9 is a structural diagram illustrating a combined processing apparatus 900 according to an embodiment of the present disclosure. As shown, the combined processing device 900 includes a computing device 902, which may include the multipliers of the present disclosure as previously described in connection with the figures. In addition, the combined processing device also includes a general interconnection interface 904 and other processing devices 906 . The computing device according to the present disclosure interacts with other processing devices to jointly complete the operation specified by the user.

根據本披露的方案，該其他處理裝置可以包括中央處理器(“CPU”)、圖形處理器(“GPU”)、神經網路處理器等通用和／或專用處理器中的一種或多種類型的處理器，其數目不做限制而是依實際需要來確定。在一個或多個實施例中，該其他處理裝置可以作為本披露的計算裝置(其可以具體化為機器學習運算裝置)與外部資料和控制的介面，執行包括但不限於資料搬運，完成對本機器學習運算裝置的開啟、停止等的基本控制；其他處理裝置也可以和機器學習運算裝置協作共同完成運算任務。According to the solution of the present disclosure, the other processing device may include one or more types of general-purpose and/or special-purpose processors such as a central processing unit ("CPU"), a graphics processing unit ("GPU"), a neural network processor, etc. The number of processors is not limited but determined according to actual needs. In one or more embodiments, the other processing device can be used as an interface between the computing device of the present disclosure (which can be embodied as a machine learning computing device) and external data and control, performing operations including, but not limited to, data handling, completing the processing of this machine. Basic control of opening and stopping the learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.

根據本披露的方案，該通用互聯介面可以用於在計算裝置與其他處理裝置間傳輸資料和控制指令。例如，該計算裝置可以經由所述通用互聯介面從其他處理裝置中獲取所需的輸入資料，寫入該計算裝置片上的存儲裝置。進一步，該計算裝置可以經由所述通用互聯介面從其他處理裝置中獲取控制指令，寫入計算裝置片上的控制緩存。替代地或可選地，通用互聯介面也可以讀取計算裝置的存儲模組中的資料並傳輸給其他處理裝置。According to the solution of the present disclosure, the universal interconnection interface can be used to transfer data and control instructions between the computing device and other processing devices. For example, the computing device can obtain required input data from other processing devices via the universal interconnection interface, and write the required input data into the on-chip storage device of the computing device. Further, the computing device can obtain control instructions from other processing devices via the universal interconnection interface, and write them into a control cache on the computing device chip. Alternatively or alternatively, the Universal Interconnection Interface can also read the data in the storage module of the computing device and transmit it to other processing devices.

可選地，該組合處理裝置還可以包括存儲裝置908，其可以分別與所述計算裝置和所述其他處理裝置連接。在一個或多個實施例中，存儲裝置可以用於保存所述計算裝置和所述其他處理裝置的資料，尤其適用於所需要運算的資料在本計算裝置或其他處理裝置的內部存儲中無法全部保存的資料。Optionally, the combined processing device may further include a storage device 908, which may be respectively connected to the computing device and the other processing devices. In one or more embodiments, the storage device may be used to save the data of the computing device and the other processing device, especially for the data required for the operation that cannot be fully stored in the internal storage of the computing device or the other processing device saved data.

根據應用場景的不同，本披露的組合處理裝置可以作為手機、機器人、無人機、視頻採集、視頻監控設備等設備的系統單晶片，從而有效地降低控制部分的核心面積，提高處理速度並降低整體的功耗。在此情況時，該組合處理裝置的通用互聯介面與設備的某些部件相連接。此處的某些部件可以例如是監視器，顯示器，滑鼠，鍵盤，網卡或wifi介面。According to different application scenarios, the combined processing device of the present disclosure can be used as a SoC for equipment such as mobile phones, robots, drones, video capture, video surveillance equipment, etc., thereby effectively reducing the core area of the control part, improving the processing speed and reducing the overall power consumption. In this case, the general interconnection interface of the combined processing device is connected to certain components of the device. Some of the components here can be, for example, a monitor, display, mouse, keyboard, network card or wifi interface.

在一些實施例裡，本披露還公開了一種晶片或積體電路晶片，其包括了上述計算裝置、組合處理裝置以及本披露的乘法器。在另一些實施例裡，本披露還公開了一種晶片封裝結構，其包括了上述晶片。In some embodiments, the present disclosure also discloses a wafer or integrated circuit chip that includes the above-described computing device, combined processing device, and multiplier of the present disclosure. In other embodiments, the present disclosure also discloses a chip package structure including the above-mentioned chip.

在一些實施例裡，本披露還公開了一種板卡，其包括了上述晶片封裝結構。參閱第10圖，其提供了前述的示例性板卡1000，上述板卡1000除了包括上述晶片1002以外，還可以包括其他的配套部件，該配套部件可以包括但不限於：存儲器件1004、介面裝置1006和控制器件1008。In some embodiments, the present disclosure also discloses a board including the above-mentioned chip package structure. Referring to FIG. 10, the above-mentioned exemplary board 1000 is provided. In addition to the above-mentioned chip 1002, the above-mentioned board 1000 may also include other supporting components, and the supporting components may include but are not limited to: a storage device 1004, an interface device 1006 and control device 1008.

所述存儲器件與所述晶片封裝結構內的晶片通過匯流排連接，用於存儲資料。所述存儲器件可以包括多組存儲單元1010。每一組所述存儲單元與所述晶片通過匯流排連接。可以理解，每一組所述存儲單元可以是DDR SDRAM(“Double Data Rate SDRAM”，雙倍速率同步動態隨機記憶體)。The storage device is connected to the chip in the chip package structure through a bus bar for storing data. The memory device may include groups of memory cells 1010 . Each group of the memory cells is connected to the wafer by bus bars. It can be understood that each group of the memory cells may be DDR SDRAM ("Double Data Rate SDRAM", double rate synchronous dynamic random access memory).

DDR不需要提高時鐘頻率就能加倍提高SDRAM的速度。DDR允許在時鐘脈衝的上升沿和下降沿讀出資料。DDR的速度是標準SDRAM的兩倍。在一個實施例中，所述存儲器件可以包括4組所述存儲單元。每一組所述存儲單元可以包括多個DDR4顆粒(晶片)。在一個實施例中，所述晶片內部可以包括4個72位DDR4控制器，上述72位DDR4控制器中64bit用於傳輸資料，8bit用於ECC校驗。DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the memory device may include 4 sets of the memory cells. Each group of the memory cells may include a plurality of DDR4 granules (wafers). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the above 72-bit DDR4 controllers, 64 bits are used for data transmission and 8 bits are used for ECC verification.

在一個實施例中，每一組所述存儲單元可以包括多個並聯設置的雙倍速率同步動態隨機記憶體。DDR在一個時鐘週期內可以傳輸兩次數據。在所述晶片中設置控制DDR的控制器，用於對每個所述存儲單元的資料傳輸與資料存儲的控制。In one embodiment, each group of the memory cells may comprise a plurality of double-rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling the DDR is arranged in the chip for controlling data transmission and data storage of each of the memory cells.

所述介面裝置與所述晶片封裝結構內的晶片電連接。所述介面裝置用於實現所述晶片與外部設備1012(例如伺服器或電腦)之間的資料傳輸。例如在一個實施例中，所述介面裝置可以為標準PCIE介面。例如，待處理的資料由伺服器通過標準PCIE介面傳遞至所述晶片，實現資料轉移。在另一個實施例中，所述介面裝置還可以是其他的介面，本披露並不限制上述其他的介面的具體表現形式，所述介面單元能夠實現轉接功能即可。另外，所述晶片的計算結果仍由所述介面裝置傳送回外部設備(例如伺服器)。The interface device is electrically connected to the chip in the chip package structure. The interface device is used to realize data transfer between the chip and an external device 1012 (eg, a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted from the server to the chip through a standard PCIE interface to realize data transfer. In another embodiment, the interface device may also be other interfaces, and the present disclosure does not limit the specific expressions of the other interfaces, as long as the interface unit can realize the switching function. In addition, the calculation result of the chip is still sent back to an external device (eg a server) by the interface device.

所述控制器件與所述晶片電連接，以便對所述晶片的狀態進行監控。具體地，所述晶片與所述控制器件可以通過SPI介面電連接。所述控制器件可以包括微控制器(“MCU”，Micro Controller Unit)。所述晶片可以包括多個處理晶片、多個處理核或多個處理電路，並且可以帶動多個負載。由此，所述晶片可以處於多負載和輕負載等不同的工作狀態。通過所述控制裝置可以實現對所述晶片中多個處理晶片、多個處理和/或多個處理電路的工作狀態的調控。The control device is electrically connected to the wafer to monitor the status of the wafer. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a microcontroller ("MCU", Micro Controller Unit). The wafer may include multiple processing wafers, multiple processing cores, or multiple processing circuits, and may drive multiple loads. Thus, the wafer can be in different working states such as multi-load and light-load. The control device can realize the regulation of the working states of the plurality of processing wafers, the plurality of processing and/or the plurality of processing circuits in the wafer.

在一些實施例裡，本披露還公開了一種電子設備或裝置，其包括了上述板卡。根據不同的應用場景，電子設備或裝置可以包括資料處理裝置、機器人、電腦、印表機、掃描器、平板電腦、智慧終端機、手機、行車記錄器、導航儀、感測器、監視器、伺服器、雲端伺服器、相機、攝影機、投影儀、手錶、耳機、移動存儲、可穿戴設備、交通工具、家用電器、和/或醫療設備。所述交通工具包括飛機、輪船和/或車輛；所述家用電器包括電視、空調、微波爐、冰箱、電鍋、加濕器、洗衣機、電燈、瓦斯爐、油煙機；所述醫療設備包括核磁共振儀、B超儀和/或心電圖儀。In some embodiments, the present disclosure also discloses an electronic device or device, which includes the above board. According to different application scenarios, electronic equipment or devices can include data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, mobile phones, driving recorders, navigators, sensors, monitors, Servers, Cloud Servers, Cameras, Video Cameras, Projectors, Watches, Headphones, Mobile Storage, Wearables, Vehicles, Home Appliances, and/or Medical Devices. The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance device, B-ultrasound and/or electrocardiograph.

需要說明的是，對於前述的各方法實施例，為了簡單描述，故將其都表述為一系列的動作組合，但是本領域具有通常知識者應該知悉，本披露並不受所描述的動作順序的限制，因為依據本披露，某些步驟可以採用其他順序或者同時進行。其次，本領域具有通常知識者也應該知悉，說明書中所描述的實施例均屬於可選實施例，所涉及的動作和模組並不一定是本披露所必須的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those with ordinary knowledge in the art should know that the present disclosure is not limited by the described action sequence. limitation, as certain steps may be performed in other orders or concurrently in accordance with the present disclosure. Secondly, those with ordinary knowledge in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by the present disclosure.

在上述實施例中，對各個實施例的描述都各有側重，某個實施例中沒有詳述的部分，可以參見其他實施例的相關描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

在本披露所提供的幾個實施例中，應該理解到，所披露的裝置，可通過其它的方式實現。例如，以上所描述的裝置實施例僅僅是示意性的，例如所述單元的劃分，僅僅為一種邏輯功能劃分，實際實現時可以有另外的劃分方式，例如多個單元或元件可以結合或者可以集成到另一個系統，或一些特徵可以忽略，或不執行。另一點，所顯示或討論的相互之間的耦合或直接耦合或通信連接可以是通過一些介面，裝置或單元的間接耦合或通信連接，可以是電性、光學、聲學、磁性或其它的形式。In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or elements may be combined or integrated. to another system, or some features can be ignored, or not implemented. Another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, indirect coupling or communication connection of devices or units, which may be electrical, optical, acoustic, magnetic or other forms.

所述作為分離部件說明的單元可以是或者也可以不是物理上分開的，作為單元顯示的部件可以是或者也可以不是物理單元，即可以位於一個地方，或者也可以分佈到多個網路單元上。可以根據實際的需要選擇其中的部分或者全部單元來實現本實施例方案的目的。The unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units . Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本披露各個實施例中的各功能單元可以集成在一個處理單元中，也可以是各個單元單獨物理存在，也可以兩個或兩個以上單元集成在一個單元中。上述集成的單元既可以採用硬體的形式實現，也可以採用軟體程式模組的形式實現。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be realized in the form of hardware, or can be realized in the form of software program module.

所述集成的單元如果以軟體程式模組的形式實現並作為獨立的產品銷售或使用時，可以存儲在一個電腦可讀取記憶體中。基於這樣的理解，當本披露的技術方案可以以軟體產品的形式體現出來，該電腦軟體產品存儲在一個存儲器中，包括若干指令用以使得一台電腦設備(可為個人電腦、伺服器或者網路設備等)執行本披露各個實施例所述方法的全部或部分步驟。而前述的存儲器包括：隨身碟、唯讀記憶體(“ROM”，Read-Only Memory)、隨機存取記憶體(“RAM”，Random Access Memory)、移動硬碟、磁碟或者光碟等各種可以存儲程式碼的介質。If the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer-readable memory. Based on this understanding, when the technical solution of the present disclosure can be embodied in the form of a software product, the computer software product is stored in a memory and includes several instructions to enable a computer device (which may be a personal computer, a server or a network). road equipment, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present disclosure. The aforementioned memory includes: flash drive, read-only memory ("ROM", Read-Only Memory), random access memory ("RAM", Random Access Memory), mobile hard disk, magnetic disk or CD, etc. The medium on which the code is stored.

依據以下條款可更好地理解前述內容：The foregoing can be better understood in accordance with the following terms:

條款A1，一種乘法器，用於進行浮點數的乘法運算，其中，所述乘法器包括：Clause A1, a multiplier for performing multiplication of floating point numbers, wherein the multiplier comprises:

尾數處理單元，用於根據所述浮點數的尾數來獲得所述乘法運算後的尾數，a mantissa processing unit, configured to obtain the mantissa after the multiplication operation according to the mantissa of the floating-point number,

所述尾數處理單元包括控制電路，所述控制電路用於在兩個浮點數中的至少一個的尾數位寬大於所述尾數處理單元一次可處理的資料位寬時，多次調用所述尾數處理單元。The mantissa processing unit includes a control circuit, and the control circuit is configured to call the mantissa multiple times when the mantissa bit width of at least one of the two floating-point numbers is larger than the data bit width that the mantissa processing unit can process at one time processing unit.

條款A2，根據條款A1所述的乘法器，其中，所述兩個浮點數包括第一浮點數和第二浮點數，所述尾數處理單元支援第一位寬和第二位寬，所述第一浮點數的尾數作為與所述第一位寬對應的第一輸入，所述第二浮點數的尾數作為與所述第二位寬對應的第二輸入，所述第一輸入的位寬小於或等於所述第一位寬，所述控制電路用於當所述第二輸入的位寬大於所述第二位寬時，多次調用所述尾數處理單元來獲得所述乘法運算後的尾數。Clause A2, the multiplier of clause A1, wherein the two floating point numbers comprise a first floating point number and a second floating point number, the mantissa processing unit supports a first bit width and a second bit width, The mantissa of the first floating point number is used as the first input corresponding to the first bit width, the mantissa of the second floating point number is used as the second input corresponding to the second bit width, and the first The input bit width is less than or equal to the first bit width, and the control circuit is configured to call the mantissa processing unit multiple times to obtain the Mantissa after multiplication.

條款A3，根據條款A1或條款A2所述的乘法器，其中，所述兩個浮點數包括第一浮點數和第二浮點數，所述尾數處理單元支援第一位寬和第二位寬，所述第一浮點數的尾數作為與所述第一位寬對應的第一輸入，所述第二浮點數的尾數作為與所述第二位寬對應的第二輸入，所述控制電路用於當所述第一輸入的位寬大於所述第一位寬且所述第二輸入的位寬小於或等於所述第二位寬時、當所述第二輸入的位寬大於所述第二位寬且所述第一輸入的位寬小於或等於所述第一位寬時或者當所述第一輸入的位寬大於所述第一位寬且所述第二輸入的位寬大於所述第二位寬時，多次調用所述尾數處理單元來獲得所述乘法運算後的尾數。Clause A3. The multiplier of Clause A1 or Clause A2, wherein the two floating point numbers comprise a first floating point number and a second floating point number, the mantissa processing unit supporting a first wide and a second bit width, the mantissa of the first floating point number is used as the first input corresponding to the first bit width, and the mantissa of the second floating point number is used as the second input corresponding to the second bit width, so The control circuit is used for when the bit width of the first input is greater than the first bit width and the bit width of the second input is less than or equal to the second bit width, when the bit width of the second input is larger When the second bit width and the first input bit width is less than or equal to the first input width or when the first input bit width is greater than the first input width and the second input When the bit width is greater than the second bit width, the mantissa processing unit is called multiple times to obtain the mantissa after the multiplication operation.

條款A4，根據條款A1-A3任一項所述的乘法器，其中，當所述第一浮點數的尾數位寬小於所述第二浮點數的尾數位寬並且所述第一位寬大於所述第二位寬時，或者當所述第一浮點數的尾數位寬大於所述第二浮點數的尾數位寬並且所述第一位寬小於所述第二位寬時，所述控制電路選擇所述第一浮點數的尾數作為與所述第二位寬對應的所述第二輸入並且選擇所述第二浮點數的尾數作為與所述第一位寬對應的第一輸入。Clause A4, the multiplier of any of clauses A1-A3, wherein when the mantissa bit width of the first floating point number is less than the mantissa bit width of the second floating point number and the first bit width is greater At the second bit width, or when the mantissa bit width of the first floating point number is greater than the mantissa bit width of the second floating point number and the first bit width is smaller than the second bit width, The control circuit selects the mantissa of the first floating point number as the second input corresponding to the second bit width and selects the mantissa of the second floating point number as the mantissa corresponding to the first bit width first input.

條款A5，根據條款A1-A4任一項所述的乘法器，其中，當所述第一輸入的位寬大於所述第一位寬且所述第二輸入的位寬小於或等於所述第二位寬時，所述控制電路根據所述第一輸入的位寬和所述第一位寬來確定調用所述尾數處理單元的次數以及在每次調用中輸入所述尾數處理單元的資料。Clause A5, the multiplier of any of clauses A1-A4, wherein when the bit width of the first input is greater than the width of the first bit and the bit width of the second input is less than or equal to the width of the first input When the width is two bits, the control circuit determines the number of times to call the mantissa processing unit according to the bit width of the first input and the width of the first bit, and input data of the mantissa processing unit in each call.

條款A6，根據條款A1-A5任一項所述的乘法器，其中，當所述第二輸入的位寬大於所述第二位寬且所述第一輸入的位寬小於或等於所述第一位寬時，所述控制電路根據所述第二輸入的位寬和所述第二位寬來確定調用所述尾數處理單元的次數以及在每次調用中輸入所述尾數處理單元的資料。Clause A6, the multiplier of any of clauses A1-A5, wherein when the bit width of the second input is greater than the second bit width and the bit width of the first input is less than or equal to the first input When the width is one bit, the control circuit determines the number of times of invoking the mantissa processing unit and the data input to the mantissa processing unit in each invocation according to the bit width of the second input and the second bit width.

條款A7，根據條款A1-A6任一項所述的乘法器，其中，當所述第一輸入的位寬大於所述第一位寬且所述第二輸入的位寬大於所述第二位寬時，所述控制電路根據所述第一輸入的位寬和所述第一位寬以及所述第二輸入的位寬和所述第二位寬來確定調用所述尾數處理單元的次數以及在每次調用中輸入所述尾數處理單元的資料。Clause A7. The multiplier of any of clauses A1-A6, wherein when the first input has a bit width greater than the first bit width and the second input has a greater bit width than the second bit When it is wide, the control circuit determines the number of times to call the mantissa processing unit according to the bit width and the first bit width of the first input and the bit width and the second bit width of the second input and The data for the mantissa processing unit is entered in each call.

條款A8，根據條款A1-A7任一項所述的乘法器，其中，所述尾數處理單元還包括移位加法電路，所述移位加法電路用於根據每次調用所述尾數處理單元所獲得的尾數結果來獲得所述乘法運算後的尾數。Clause A8. The multiplier of any one of clauses A1-A7, wherein the mantissa processing unit further includes a shift-and-add circuit for the mantissa result to obtain the mantissa after the multiplication.

條款A9，根據條款A1-A8任一項所述的乘法器，其中，所述移位加法電路包括移位器、中間存儲器和加法器，當所述控制電路多次調用所述尾數處理單元時，在第一次調用後，所述移位器將第一次調用獲得的尾數結果進行移位獲得移位後尾數結果並將所述移位後尾數結果存入所述中間存儲器中，從第二次調用開始，所述移位器將當次調用中獲得的尾數結果進行移位獲得當次尾數結果，所述加法器將所述當次尾數結果與存儲在所述中間存儲器中的結果相加並且將相加後的結果存儲在所述中間存儲器中來更新所述中間存儲器，並且在最後一次調用後存儲在所述中間存儲器中的結果作為所述乘法運算後的尾數。Clause A9, the multiplier of any of clauses A1-A8, wherein the shift-add circuit includes a shifter, an intermediate memory, and an adder, when the control circuit calls the mantissa processing unit multiple times , after the first call, the shifter shifts the mantissa result obtained by the first call to obtain the shifted mantissa result and stores the shifted mantissa result in the intermediate memory, starting from the first When the second call starts, the shifter shifts the mantissa result obtained in the current call to obtain the current mantissa result, and the adder compares the current mantissa result with the result stored in the intermediate memory. The intermediate storage is updated by adding and storing the added result in the intermediate storage, and the result stored in the intermediate storage after the last call as the mantissa after the multiplication.

條款A10，根據條款A1-A9任一項所述的乘法器，其中，所述乘法器還包括指數處理單元，所述指數處理單元用於根據所述兩個浮點數的指數來獲得所述乘法運算後的指數，所述指數處理單元包括第二控制電路，所述第二控制電路用於根據所述兩個浮點數中的一個的指數位寬和所述指數處理單元所支援的兩個位寬中的一個或者根據所述兩個浮點數的指數位寬和所述指數處理單元所支援的兩個位寬來確定多次調用所述指數處理單元以獲得所述乘法運算後的指數。Clause A10, the multiplier according to any one of clauses A1-A9, wherein the multiplier further comprises an exponent processing unit for obtaining the exponent from the exponents of the two floating-point numbers The exponent after the multiplication operation, the exponent processing unit includes a second control circuit, and the second control circuit is configured to: One of the bit widths, or according to the exponent bit widths of the two floating-point numbers and the two bit widths supported by the exponent processing unit, the exponent processing unit is called multiple times to obtain the multiplication result. index.

條款A11，根據條款A1-A10任一項所述的乘法器，其中，所述兩個浮點數包括第一浮點數和第二浮點數，所述指數處理單元支援第三位寬和第四位寬，所述第一浮點數的指數作為與所述第三位寬對應的第三輸入，所述第二浮點數的指數作為與所述第四位寬對應的第四輸入，所述第三輸入的位寬小於或等於所述第三位寬，所述第二控制電路用於當所述第四輸入的位寬大於所述第四位寬時，多次調用所述指數處理單元來獲得所述乘法運算後的指數。Clause A11. The multiplier of any one of clauses A1-A10, wherein the two floating-point numbers comprise a first floating-point number and a second floating-point number, the exponent processing unit supporting a third bit width and The fourth bit width, the exponent of the first floating point number is used as the third input corresponding to the third bit width, and the exponent of the second floating point number is used as the fourth input corresponding to the fourth bit width , the bit width of the third input is less than or equal to the third bit width, and the second control circuit is configured to call the an exponent processing unit to obtain the multiplied exponent.

條款A12，根據條款A1-A11任一項所述的乘法器，其中，所述兩個浮點數包括第一浮點數和第二浮點數，所述指數處理單元支援第三位寬和第四位寬，所述第一浮點數的指數作為與所述第三位寬對應的第三輸入，所述第二浮點數的指數作為與所述第四位寬對應的第四輸入，所述第二控制電路用於當所述第三輸入的位寬大於所述第三位寬且所述第四輸入的位寬小於或等於所述第四位寬時、當所述第四輸入的位寬大於所述第四位寬且所述第三輸入的位寬小於或等於所述第三位寬時或者當所述第三輸入的位寬大於所述第三位寬且所述第四輸入的位寬大於所述第四位寬時，多次調用所述指數處理單元來獲得所述乘法運算後的指數。Clause A12. The multiplier of any of clauses A1-A11, wherein the two floating-point numbers comprise a first floating-point number and a second floating-point number, the exponent processing unit supporting a third bit width and The fourth bit width, the exponent of the first floating point number is used as the third input corresponding to the third bit width, and the exponent of the second floating point number is used as the fourth input corresponding to the fourth bit width , the second control circuit is used for when the bit width of the third input is greater than the third bit width and the bit width of the fourth input is less than or equal to the fourth bit width, when the fourth When the bit width of the input is greater than the fourth bit width and the bit width of the third input is less than or equal to the third bit width or when the bit width of the third input is greater than the third bit width and the When the bit width of the fourth input is larger than the fourth bit width, the exponent processing unit is called multiple times to obtain the exponent after the multiplication operation.

條款A13，根據條款A1-A12任一項所述的乘法器，其中，當所述第一浮點數的指數位寬小於所述第二浮點數的指數位寬並且所述第三位寬大於所述第四位寬時，或者當所述第一浮點數的指數位寬大於所述第二浮點數的指數位寬並且所述第三位寬小於所述第四位寬時，所述第二控制電路選擇所述第一浮點數的指數作為與所述第四位寬對應的所述第四輸入並且選擇所述第二浮點數的指數作為與所述第三位寬對應的第三輸入。Clause A13. The multiplier of any one of clauses A1-A12, wherein when the exponent bit width of the first floating point number is smaller than the exponent bit width of the second floating point number and the third bit width is greater At the fourth bit width, or when the exponent bit width of the first floating point number is greater than the exponent bit width of the second floating point number and the third bit width is smaller than the fourth bit width, The second control circuit selects the exponent of the first floating point number as the fourth input corresponding to the fourth bit width and selects the exponent of the second floating point number as the exponent corresponding to the third bit width the corresponding third input.

條款A14，根據條款A1-A13任一項所述的乘法器，其中，所述第二控制電路用於當所述第三輸入的位寬小於或等於所述第四輸入的位寬且所述第三位寬小於或等於所述第四位寬時，根據所述第四輸入的位寬和所述第三位寬來確定調用所述指數處理單元的次數以及在每次調用中輸入所述指數處理單元的資料。Clause A14. The multiplier of any one of clauses A1-A13, wherein the second control circuit is configured to operate when the bit width of the third input is less than or equal to the bit width of the fourth input and the When the third bit width is less than or equal to the fourth bit width, the number of times to call the exponent processing unit is determined according to the bit width of the fourth input and the third bit width, and the number of times of calling the exponent processing unit is input in each call Data for the index processing unit.

條款A15，根據條款A1-A14任一項所述的乘法器，其中，所述指數處理單元還包括第二移位加法電路，所述第二移位加法電路用於根據每次調用所述指數處理單元所獲得的指數結果來獲得所述乘法運算後的指數。Clause A15. The multiplier of any one of clauses A1-A14, wherein the exponent processing unit further includes a second shift-add circuit for invoking the exponent according to each call The exponent result obtained by the processing unit is used to obtain the multiplied exponent.

條款A16，根據條款A1-A15任一項所述的乘法器，其中，所述尾數處理單元包括部分積運算單元和部分積求和單元，其中所述部分積運算單元用於根據所述兩個浮點數的尾數獲得中間結果，所述部分積求和單元用於將所述中間結果進行加和運算以獲得加和結果，並將所述加和結果作為所述乘法運算後的尾數。Clause A16. The multiplier of any of clauses A1-A15, wherein the mantissa processing unit includes a partial product operation unit and a partial product summation unit, wherein the partial product operation unit is configured to operate according to the two The mantissa of the floating-point number obtains an intermediate result, and the partial product summing unit is configured to perform an addition operation on the intermediate results to obtain an addition result, and use the addition result as the mantissa after the multiplication operation.

條款A17，根據條款A1-A16任一項所述的乘法器，其中，所述部分積運算單元包括布斯編碼電路，所述布斯編碼電路用於對所述第一浮點數或所述第二浮點數的尾數進行布斯編碼處理，以獲得所述中間結果。Clause A17. The multiplier according to any one of clauses A1-A16, wherein the partial product operation unit includes a Booth encoding circuit configured to perform an operation on the first floating point number or the The mantissa of the second floating point number is Booth-encoded to obtain the intermediate result.

條款A18，根據條款A1-A17任一項所述的乘法器，其中，所述部分積求和單元包括加法器，所述加法器用於對所述中間結果進行加和，以獲得所述加和結果。Clause A18. The multiplier of any of clauses A1-A17, wherein the partial product summation unit includes an adder for adding the intermediate results to obtain the summation result.

條款A19，根據條款A1-A18任一項所述的乘法器，其中，所述部分積求和單元包括華萊士樹和加法器，其中所述華萊士樹用於對所述中間結果進行加和，以獲得第二中間結果，所述加法器用於對所述第二中間結果進行加和，以獲得所述加和結果。Clause A19. The multiplier of any one of clauses A1-A18, wherein the partial product summation unit comprises a Wallace tree and an adder, wherein the Wallace tree is used to perform a computation on the intermediate result. summing to obtain a second intermediate result, and the adder is configured to add the second intermediate results to obtain the summing result.

條款A20，根據條款A1-A19任一項所述的乘法器，其中，所述加法器包括全加器、串列加法器和超前進位加法器中的至少一種。Clause A20. The multiplier of any one of clauses A1-A19, wherein the adder includes at least one of a full adder, a serial adder, and a carry-lookahead adder.

條款A21，根據條款A1-A20任一項所述的乘法器，其中，當所述中間結果的個數不足M個時，補充零值作為中間結果，使得所述中間結果的數量等於M，其中M為預設的正整數。Clause A21, the multiplier according to any one of clauses A1-A20, wherein, when the number of the intermediate results is less than M, a zero value is added as the intermediate result, so that the number of the intermediate results is equal to M, wherein M is a preset positive integer.

條款A22，根據條款A1-A21任一項所述的乘法器，其中，每個所述華萊士樹具有M個輸入和N個輸出，所述華萊士樹的數目不小於K,其中N為預設的小於M的正整數,K為不小於所述中間結果的最大位寬的正整數。Clause A22, the multiplier of any one of clauses A1-A21, wherein each of the Wallace trees has M inputs and N outputs, and the number of Wallace trees is not less than K, where N is a preset positive integer smaller than M, and K is a positive integer not smaller than the maximum bit width of the intermediate result.

條款A23，根據條款A1-A22任一項所述的乘法器，其中，所述部分積求和單元用於選用一組或多組所述華萊士樹對所述中間結果進行加和，其中每組所述華萊士樹有X個華萊士樹，X為所述中間結果的位數，其中各組內的所述華萊士樹之間存在依次進位的關係，而各組之間的華萊士樹不存在進位的關係。Clause A23. The multiplier of any one of clauses A1-A22, wherein the partial product summation unit is configured to select one or more sets of the Wallace tree to sum the intermediate results, wherein Each group of the Wallace trees has X Wallace trees, where X is the number of digits of the intermediate results, wherein the Wallace trees in each group are in a sequential carry relationship, and between the groups The Wallace tree does not have a carry relationship.

條款A24，根據條款A1-A23任一項所述的乘法器，其中，所述乘法器還包括：Clause A24, the multiplier of any one of clauses A1-A23, wherein the multiplier further comprises:

規格化處理單元，用於當所述兩個浮點數中的至少一個浮點數為非規格化的非零浮點數時，對所述至少一個浮點數進行規格化處理，以獲得對應的指數和尾數。A normalization processing unit, configured to perform normalization processing on the at least one floating point number when at least one floating point number in the two floating point numbers is a denormalized non-zero floating point number to obtain a corresponding exponent and mantissa.

條款A25，根據條款A1-A24任一項所述的乘法器，其中，所述乘法器用於根據運算模式進行所述兩個浮點數的乘法運算，所述運算模式指示所述兩個浮點數的資料格式，所述尾數處理單元用於根據所述運算模式以及所述兩個浮點數的尾數來獲得所述乘法運算後的尾數，並且所述指數處理單元用於根據所述運算模式以及所述兩個浮點數的指數來獲得所述乘法運算後的指數。Clause A25. The multiplier of any one of clauses A1-A24, wherein the multiplier is configured to perform a multiplication of the two floating-point numbers according to an operation mode indicating the two floating-point numbers The data format of the number, the mantissa processing unit is used for obtaining the mantissa after the multiplication operation according to the operation mode and the mantissas of the two floating-point numbers, and the exponent processing unit is used for obtaining the mantissa after the multiplication operation according to the operation mode and the exponents of the two floating point numbers to obtain the multiplied exponent.

條款A26，根據條款A1-A25任一項所述的乘法器，所述規格化處理單元還用於根據所述運算模式，對所述兩個浮點數中的至少一個浮點數進行規格化處理，以獲得對應的指數和尾數。Item A26, the multiplier according to any one of Items A1-A25, the normalization processing unit is further configured to normalize at least one floating-point number in the two floating-point numbers according to the operation mode process to obtain the corresponding exponent and mantissa.

條款A27，根據條款A1-A26任一項所述的乘法器，其中，所述資料格式包括半精度浮點數、單精確度浮點數、腦浮點數、雙精度浮點數、自訂浮點數中的至少一種。Clause A27, the multiplier according to any one of clauses A1-A26, wherein the data format includes half-precision floating-point number, single-precision floating-point number, brain floating-point number, double-precision floating-point number, custom At least one of floating point numbers.

條款A28，根據條款A1-A27任一項所述的乘法器，其中，所述尾數處理單元包括位數擴展電路，所述位數擴展電路用於對所述第一浮點數和所述第二浮點數中的至少一個的尾數進行位數擴展。Clause A28. The multiplier according to any one of clauses A1-A27, wherein the mantissa processing unit includes a number of bits expansion circuit for performing a comparison between the first floating point number and the first floating point number. The mantissa of at least one of the two floating-point numbers is digit-extended.

條款A29，根據條款A1-A28任一項所述的乘法器，其中，所述浮點數還包括符號，所述乘法器進一步包括：Clause A29. The multiplier of any one of clauses A1-A28, wherein the floating point number further comprises a sign, the multiplier further comprising:

符號處理單元，用於根據所述兩個浮點數的符號獲得乘法運算後的符號。A symbol processing unit, configured to obtain the symbol after the multiplication operation according to the symbols of the two floating-point numbers.

條款A30，根據條款A1-A29任一項所述的乘法器，其中，所述符號處理單元包括異或邏輯電路，所述異或邏輯電路用於根據所述兩個浮點數的符號進行異或運算，獲得所述乘法運算後的符號。Clause A30. The multiplier according to any one of clauses A1-A29, wherein the sign processing unit includes an exclusive-OR logic circuit for XORing according to the signs of the two floating-point numbers OR operation to obtain the symbol after the multiplication operation.

條款A31，根據條款A1-A30任一項所述的乘法器，進一步包括規則化單元，用於：Clause A31, the multiplier of any of clauses A1-A30, further comprising a regularization unit for:

對所述乘法運算後的尾數和指數進行浮點數規則化處理，以獲得規則化指數結果和規則化尾數結果，並且將所述規則化指數結果和所述規則化尾數結果作為所述乘法運算後的指數和所述乘法運算後的尾數。performing floating point regularization processing on the mantissa and exponent after the multiplication operation to obtain a regularized exponent result and a regularized mantissa result, and using the regularized exponent result and the regularized mantissa result as the multiplication operation the exponent after the multiplication and the mantissa after the multiplication.

條款A32，根據條款A1-A31任一項所述的乘法器，進一步包括：Clause A32, the multiplier of any of clauses A1-A31, further comprising:

捨入單元，用於根據捨入模式對所述規則化尾數結果執行捨入操作以獲得捨入後的尾數，並將所述捨入後的尾數作為所述乘法運算後的尾數。A rounding unit, configured to perform a rounding operation on the regularized mantissa result according to a rounding mode to obtain a rounded mantissa, and use the rounded mantissa as the mantissa after the multiplication operation.

條款A33，一種使用乘法器執行浮點數乘法運算的方法，其中，Clause A33, A method of performing a floating-point multiplication operation using a multiplier, wherein,

利用所述乘法器的尾數處理單元根據所述浮點數的尾數來獲得所述乘法運算後的尾數，Using the mantissa processing unit of the multiplier to obtain the mantissa after the multiplication operation according to the mantissa of the floating point number,

條款A34，一種積體電路晶片，包括根據條款A1-A31的任意一項所述的乘法器。Clause A34, an integrated circuit chip comprising the multiplier of any of clauses A1-A31.

條款A35，一種計算裝置，包括根據條款A1-A31的任意一項所述的乘法器或根據條款A34所述的積體電路晶片。Clause A35, A computing device comprising the multiplier of any one of clauses A1-A31 or the integrated circuit chip of clause A34.

以上對本披露實施例進行了詳細介紹，本文中應用了具體個例對本披露的原理及實施方式進行了闡述，以上實施例的說明只是用於幫助理解本披露的方法及其核心思想；同時，對於本領域具有通常知識者，依據本披露的思想，在具體實施方式及應用範圍上均會有改變之處，綜上所述，本說明書內容不應理解為對本披露的限制。The embodiments of the present disclosure have been introduced in detail above, and the principles and implementations of the present disclosure are described in this paper by using specific examples. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure; at the same time, for Those with ordinary knowledge in the art, according to the idea of the present disclosure, will have changes in the specific implementation manner and application scope. In conclusion, the contents of this description should not be construed as a limitation of the present disclosure.

應當理解，本披露的申請專利範圍、說明書及附圖中的術語“第一”、“第二”、“第三”和“第四”等是用於區別不同物件，而不是用於描述特定順序。本披露的說明書和申請專利範圍中使用的術語“包括”和“包含”指示所描述特徵、整體、步驟、操作、元素和/或元件的存在，但並不排除一個或多個其它特徵、整體、步驟、操作、元素、元件和/或其集合的存在或添加。It should be understood that the terms "first", "second", "third" and "fourth" in the scope of the application, the description and the drawings of the present disclosure are used to distinguish different items, rather than to describe specific items. order. The terms "comprising" and "comprising" used in the specification and scope of the present disclosure indicate the presence of the described feature, integer, step, operation, element and/or element, but do not exclude one or more other features, integers , step, operation, element, element and/or the presence or addition of a collection thereof.

還應當理解，在此本披露說明書中所使用的術語僅僅是出於描述特定實施例的目的，而並不意在限定本披露。如在本披露說明書和申請專利範圍中所使用的那樣，除非上下文清楚地指明其它情況，否則單數形式的“一”、“一個”及“該”意在包括複數形式。還應當進一步理解，在本披露說明書和申請專利範圍中使用的術語“和/ 或”是指相關聯列出的項中的一個或多個的任何組合以及所有可能組合，並且包括這些組合。It should also be understood that the terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the claimed scope, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise. It should further be understood that, as used in this disclosure and in the claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.

如在本說明書和申請專利範圍中所使用的那樣，術語“如果”可以依據上下文被解釋為“當... 時”或“一旦”或“回應於確定”或“回應於檢測到”。類似地，短語“如果確定”或“如果檢測到[所描述條件或事件]”可以依據上下文被解釋為意指“一旦確定”或“回應於確定”或“一旦檢測到[所描述條件或事件]”或“回應於檢測到[所描述條件或事件]”。As used in this specification and the scope of the claims, the term "if" may be interpreted as "when" or "once" or "in response to determining" or "in response to detecting", depending on the context. Similarly, the phrases "if it is determined" or "if the [described condition or event] is detected" can be interpreted, depending on the context, to mean "once it is determined" or "in response to the determination" or "once the [described condition or event] is detected. event]" or "in response to detection of the [described condition or event]".

以上對本披露實施例進行了詳細介紹，本文中應用了具體個例對本披露的原理及實施方式進行了闡述，以上實施例的說明僅用於幫助理解本披露的方法及其核心思想。同時，本領域具有通常知識者依據本披露的思想，基於本披露的具體實施方式及應用範圍上做出的改變或變形之處，都屬於本披露保護的範圍。綜上所述，本說明書內容不應理解為對本披露的限制。The embodiments of the present disclosure are described in detail above, and specific examples are used to illustrate the principles and implementations of the present disclosure. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure. Meanwhile, any changes or modifications made by those with ordinary knowledge in the art based on the ideas of the present disclosure, based on the specific embodiments and application scope of the present disclosure, all belong to the protection scope of the present disclosure. In conclusion, the contents of this specification should not be construed as limiting the present disclosure.

100:浮點資料格式 102:符號位 104:指數位 106:尾數位 200、300、700:乘法器 202、720:指數處理單元 204:尾數處理單元 206、722:符號處理單元 308:模式選擇單元 312:部分積運算單元 314:部分積求和單元 316:控制電路 318、716:規則化單元 320:捨入單元 322:異或邏輯電路 324:規格化處理單元 400:尾數處理單元操作 402:布斯編碼電路 404:部分積產生電路 406:Wallace壓縮器 408、714:加法器 500:部分積 600:華萊士樹壓縮器的操作流程和示意框圖 702:模式選擇單元 704:規格化處理單元 706:位數擴展電路 708:布斯編碼器 710:部分積產生電路 712:華萊士樹壓縮器 800:使用乘法器執行浮點數乘法運算的方法 716:模式選擇單元 720:模式選擇單元 722:模式選擇單元 900:組合處理裝置 902:計算裝置 904:通用互聯介面 906:其他處理裝置 908:存儲裝置 1000:板卡 1002:晶片 1004:存儲器件 1006:介面裝置 1008:控制器件 1010:存儲單元 1012:外部設備 S802-S806:步驟100: Floating point data format 102: Sign bit 104: Exponent bit 106: mantissa digits 200, 300, 700: Multiplier 202, 720: Index processing unit 204: Mantissa processing unit 206, 722: Symbol processing unit 308: Mode selection unit 312: Partial product operation unit 314: Partial product sum unit 316: Control circuit 318, 716: Regularization unit 320: rounding unit 322: XOR logic circuit 324: Normalized processing unit 400: Mantissa processing unit operation 402: Booth Encoding Circuit 404: Partial product generation circuit 406: Wallace Compressor 408, 714: Adder 500: Partial product 600: Operational flow and schematic block diagram of Wallace tree compressor 702: Mode selection unit 704: Normalized processing unit 706: Bit expansion circuit 708: Booth Encoder 710: Partial product generation circuit 712: Wallace Tree Compressor 800: Method for performing floating-point multiplication using multipliers 716: Mode selection unit 720: Mode selection unit 722: Mode selection unit 900: Combined Treatment Unit 902: Computing Devices 904: Universal Interconnect Interface 906: Other processing devices 908: Storage Device 1000: board 1002: Wafer 1004: Storage Device 1006: Interface Device 1008: Control device 1010: Storage Unit 1012: External Devices S802-S806: Steps

[第1圖]顯示根據本披露實施例的浮點資料格式的示意圖。 [第2圖]顯示根據本披露實施例的乘法器的示意性結構框圖。 [第3圖]顯示根據本披露實施例的乘法器的更多細節的結構框圖。 [第4圖]顯示根據本披露實施例的尾數處理單元的示意性框圖。 [第5圖]顯示根據本披露實施例的部分積操作的示意圖。 [第6圖]顯示根據本披露實施例的華萊士樹壓縮器的操作流程和示意框圖。 [第7圖]顯示根據本披露實施例的乘法器的整體示意框圖。 [第8圖]顯示根據本披露實施例的使用乘法器執行浮點數乘法運算的方法的流程圖。 [第9圖]顯示根據本披露實施例的一種組合處理裝置的結構圖。 [第10圖]顯示根據本披露實施例的一種板卡的結構示意圖。[FIG. 1] A schematic diagram showing a floating point data format according to an embodiment of the present disclosure. [Fig. 2] shows a schematic structural block diagram of a multiplier according to an embodiment of the present disclosure. [FIG. 3] A block diagram showing more details of the structure of the multiplier according to an embodiment of the present disclosure. [Fig. 4] shows a schematic block diagram of a mantissa processing unit according to an embodiment of the present disclosure. [FIG. 5] A schematic diagram showing a partial product operation according to an embodiment of the present disclosure. [Fig. 6] shows an operation flow and a schematic block diagram of a Wallace tree compressor according to an embodiment of the present disclosure. [Fig. 7] shows an overall schematic block diagram of a multiplier according to an embodiment of the present disclosure. [FIG. 8] A flowchart showing a method of performing a floating-point multiplication operation using a multiplier according to an embodiment of the present disclosure. [Fig. 9] shows a structural diagram of a combined processing apparatus according to an embodiment of the present disclosure. [Fig. 10] shows a schematic structural diagram of a board according to an embodiment of the present disclosure.

900:組合處理裝置900: Combined Treatment Unit

902:計算裝置902: Computing Devices

904:通用互聯介面904: Universal Interconnect Interface

906:其他處理裝置906: Other processing devices

908:存儲裝置908: Storage Device

Claims

A multiplier for multiplying floating-point numbers, wherein the multiplier includes: a mantissa processing unit, configured to obtain the mantissa after the multiplication operation according to the mantissa of the floating-point number, and the mantissa processing The unit includes a control circuit for invoking the mantissa processing unit multiple times when the mantissa bit width of at least one of the two floating-point numbers is larger than the data bit width that the mantissa processing unit can process at one time, wherein , the multiplier further includes an exponent processing unit, which is configured to obtain the exponent after the multiplication operation according to the exponents of the two floating-point numbers, and the exponent processing unit includes a second control circuit, the The second control circuit is used for according to the exponent bit width of one of the two floating point numbers and one of the two bit widths supported by the exponent processing unit or according to the exponent bit of the two floating point numbers The width and the two bit widths supported by the exponent processing unit are used to determine that the exponent processing unit is called multiple times to obtain the multiplied exponent.

The multiplier of claim 1, wherein the two floating point numbers include a first floating point number and a second floating point number, the mantissa processing unit supports a first bit width and a second bit width, the first floating point number The mantissa of the floating point number is used as the first input corresponding to the first bit width, the mantissa of the second floating point number is used as the second input corresponding to the second bit width, and the bit width of the first input is less than or equal to the first bit width, and the control circuit is configured to call the mantissa processing unit multiple times to obtain the multiplication result when the second input bit width is greater than the second bit width. mantissa.

The multiplier of claim 1, wherein the two floating point numbers include a first floating point number and a second floating point number, the mantissa processing unit supports a first bit width and a second bit width, the first floating point number floating point The mantissa of the number is used as the first input corresponding to the first bit width, the mantissa of the second floating point number is used as the second input corresponding to the second bit width, and the control circuit is used for when the first When the bit width of an input is greater than the first bit width and the bit width of the second input is less than or equal to the second bit width, when the bit width of the second input is greater than the second bit width and all When the bit width of the first input is less than or equal to the first bit width or when the bit width of the first input is greater than the first bit width and the bit width of the second input is greater than the second bit width When , the mantissa processing unit is called multiple times to obtain the mantissa after the multiplication operation.

The multiplier of claim 3, wherein when the mantissa bit width of the first floating point number is smaller than the mantissa bit width of the second floating point number and the first bit width is greater than the second bit width, or when the mantissa bit width of the first floating point number is greater than the mantissa bit width of the second floating point number and the first bit width is smaller than the second bit width, the control circuit selects the first The mantissa of the floating point number is used as the second input corresponding to the second bit width and the mantissa of the second floating point number is selected as the first input corresponding to the first bit width.

The multiplier of claim 4, wherein when the bit width of the first input is greater than the first bit width and the bit width of the second input is less than or equal to the second bit width, the control circuit The number of times to call the mantissa processing unit and the data input to the mantissa processing unit in each call are determined according to the bit width of the first input and the first bit width.

The multiplier of claim 4, wherein when the bit width of the second input is greater than the second bit width and the bit width of the first input is less than or equal to the first bit width, the control circuit The number of times to call the mantissa processing unit and the data input to the mantissa processing unit in each call are determined according to the bit width of the second input and the second bit width.

The multiplier of claim 4, wherein when the bit width of the first input is greater than the first bit width and the bit width of the second input is greater than the second bit width, the control circuit The bit width and the first bit width of the first input and the bit width and the second bit width of the second input to determine the number of times to call the mantissa processing unit and input the mantissa in each call Processing unit data.

The multiplier of any one of claim 2 to 7, wherein the mantissa processing unit further comprises a shift-add circuit, the shift-and-add circuit is configured to obtain a mantissa result obtained by calling the mantissa processing unit each time Obtain the mantissa after the multiplication operation.

The multiplier of claim 8, wherein the shift and addition circuit includes a shifter, an intermediate memory and an adder, and when the control circuit calls the mantissa processing unit multiple times, after the first call, all The shifter shifts the mantissa result obtained by the first call to obtain the shifted mantissa result and stores the shifted mantissa result in the intermediate memory. Starting from the second call, the shift The adder shifts the mantissa result obtained in the current call to obtain the current mantissa result, the adder adds the current mantissa result and the result stored in the intermediate memory and stores the added result The intermediate storage is updated in the intermediate storage, and the result stored in the intermediate storage after the last call is used as the mantissa after the multiplication operation.

The multiplier of claim 1, wherein the two floating-point numbers include a first floating-point number and a second floating-point number, the exponent processing unit supports a third bit width and a fourth bit width, the first The exponent of the floating point number is used as the third input corresponding to the third bit width, the exponent of the second floating point number is used as the fourth input corresponding to the fourth bit width, and the bit width of the third input is less than or equal to the third bit width, and the second control circuit is configured to call the exponent processing unit multiple times to obtain the multiplication operation when the bit width of the fourth input is greater than the fourth bit width after the index.

The multiplier of claim 1, wherein the two floating-point numbers include a first floating-point number and a second floating-point number, the exponent processing unit supports a third bit width and a fourth bit width, the first floating point The exponent of the number is used as the third input corresponding to the third bit width, the exponent of the second floating point number is used as the fourth input corresponding to the fourth bit width, and the second control circuit is used when all the When the bit width of the third input is greater than the third bit width and the bit width of the fourth input is less than or equal to the fourth bit width, when the bit width of the fourth input is greater than the fourth bit width and the bit width of the third input is less than or equal to the third bit width or when the bit width of the third input is greater than the third bit width and the bit width of the fourth input is greater than the fourth When the bit width is large, the exponent processing unit is called multiple times to obtain the exponent after the multiplication operation.

The multiplier of claim 11, wherein when the exponent bit width of the first floating point number is smaller than the exponent bit width of the second floating point number and the third bit width is greater than the fourth bit width, Or when the exponent bit width of the first floating point number is greater than the exponent bit width of the second floating point number and the third bit width is smaller than the fourth bit width, the second control circuit selects the The exponent of the first floating point number is selected as the fourth input corresponding to the fourth bit width and the exponent of the second floating point number is selected as the third input corresponding to the third bit width.

The multiplier of claim 12, wherein the second control circuit is configured to operate when the bit width of the third input is less than or equal to the bit width of the fourth input and the third bit width is less than or equal to the When the fourth bit width is used, the number of times to call the exponent processing unit and the data input to the exponent processing unit in each call is determined according to the bit width of the fourth input and the third bit width.

The multiplier of any one of claim 10 to 13, wherein the exponent processing unit further comprises a second shift-add circuit, the second shift-and-add circuit is configured to obtain the result obtained by calling the exponent processing unit each time the exponent result to obtain the multiplied exponent.

The multiplier of claim 1, wherein the mantissa processing unit comprises a partial product operation unit and a partial product summation unit, wherein the partial product operation unit is configured to The mantissa of the number obtains an intermediate result, and the partial product summing unit is configured to perform an addition operation on the intermediate results to obtain an addition result, and use the addition result as the mantissa after the multiplication operation.

The multiplier of claim 15, wherein the partial product operation unit comprises a Booth encoding circuit, and the Booth encoding circuit is used to distribute the mantissa of the first floating point number or the second floating point number s encoding process to obtain the intermediate result.

The multiplier of claim 16, wherein the partial product summation unit includes an adder for adding the intermediate results to obtain the addition result.

The multiplier of claim 16, wherein the partial product summation unit comprises a Wallace tree and an adder, wherein the Wallace tree is used to sum the intermediate results to obtain a second intermediate result , the adder is configured to add the second intermediate result to obtain the added result.

The multiplier of claim 17 or 18, wherein the adder comprises at least one of a full adder, a serial adder, and a carry-lookahead adder.

The multiplier of claim 18, wherein when the number of the intermediate results is less than M, a zero value is added as the intermediate result, so that the number of the intermediate results is equal to M, where M is a preset positive integer.

The multiplier of claim 20, wherein each of the Wallace trees has M inputs and N outputs, and the number of the Wallace trees is not less than K, where N is a preset positive integer less than M , K is a positive integer not less than the maximum bit width of the intermediate result.

The multiplier of claim 21, wherein the partial product summation unit is configured to select one or more groups of the Wallace trees to sum the intermediate results, wherein each group of the Wallace trees has X A Wallace tree, X is the number of digits of the intermediate result, wherein the Wallace tree in each group has a carry relationship in turn, and the Wallace tree between groups does not have a carry relationship .

The multiplier of claim 1, wherein the multiplier further comprises: a normalization processing unit for when at least one of the two floating-point numbers is a denormalized non-zero floating-point number , normalize the at least one floating-point number to obtain the corresponding exponent and mantissa.

The multiplier of claim 1, wherein the multiplier is configured to perform a multiplication operation of the two floating-point numbers according to an operation mode, the operation mode indicating the data format of the two floating-point numbers, and the mantissa processing The unit is configured to obtain the mantissa after the multiplication operation according to the operation mode and the mantissas of the two floating-point numbers, and the exponent processing unit is configured to obtain the mantissa after the multiplication operation according to the operation mode and the mantissas of the two floating-point numbers. exponent to obtain the multiplied exponent.

The multiplier of claim 24, wherein the normalization processing unit is further configured to perform normalization processing on at least one floating-point number in the two floating-point numbers according to the operation mode to obtain a corresponding exponent and mantissa.

The multiplier of claim 25, wherein the data format includes at least one of half-precision floating-point numbers, single-precision floating-point numbers, brain floating-point numbers, double-precision floating-point numbers, and custom floating-point numbers.

The multiplier of claim 16, wherein the mantissa processing unit includes a number of bits expansion circuit for performing a multiplication of at least one of the first floating-point number and the second floating-point number The mantissa is digit-extended.

The multiplier of claim 1, wherein the floating-point number further comprises a sign, and the multiplier further comprises: a sign processing unit, configured to obtain a multiplied sign according to the sign of the two floating-point numbers.

The multiplier of claim 28, wherein the symbol processing unit includes an exclusive-OR logic circuit, and the exclusive-OR logic circuit is configured to perform an exclusive-OR operation according to the signs of the two floating-point numbers, and obtain the multiplication symbol.

The multiplier of claim 24, further comprising a regularization unit for: performing floating point regularization processing on the mantissa and exponent after the multiplication operation to obtain a regularized exponent result and a regularized mantissa result, and The regularized exponent result and the regularized mantissa result are used as the multiplied exponent and the multiplied mantissa.

The multiplier of claim 30, further comprising: a rounding unit configured to perform a rounding operation on the regularized mantissa result according to a rounding mode to obtain a rounded mantissa, and use the rounded mantissa as The mantissa after the multiplication operation.

A method for performing a floating-point multiplication operation using a multiplier, wherein a mantissa after the multiplication operation is obtained according to the mantissa of the floating-point number by a mantissa processing unit of the multiplier, and the mantissa processing unit includes a control circuit , the control circuit is configured to call the mantissa processing unit multiple times when the mantissa bit width of at least one of the two floating-point numbers is larger than the data bit width that the mantissa processing unit can process at one time, and use the exponent to process The unit obtains the exponent after the multiplication operation according to the exponents of the two floating-point numbers, the exponent processing unit includes a second control circuit, and the second control circuit is used for according to the exponent of the two floating-point numbers. The exponent bit width of one and one of the two bit widths supported by the exponent processing unit, or the number of bits is determined according to the exponent bit width of the two floating-point numbers and the two bit widths supported by the exponent processing unit. The exponent processing unit is called several times to obtain the multiplied exponent.

An integrated circuit chip comprising the multiplier of any one of claims 1 to 31.

A computing device, comprising the multiplier of any one of claims 1-31 or the integrated circuit chip of claim 33.