CN115437602A

CN115437602A - Arbitrary-precision calculation accelerator, integrated circuit device, board card and method

Info

Publication number: CN115437602A
Application number: CN202210990132.8A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-12-06
Also published as: WO2023065701A1; CN114003198A; CN114003198B

Abstract

The invention relates to a random precision computing device, method and computer readable storage medium, reading a plurality of operands from an off-chip memory; splitting the plurality of operands into a plurality of vectors, the plurality of vectors comprising a first vector and a second vector; inner-product the first vector and the second vector according to the lengths of the first vector and the second vector to obtain an inner-product result; integrating the inner product results into a calculation result of the plurality of operands; and storing the calculation result to the off-chip memory.

Description

Arbitrary-precision calculation accelerator, integrated circuit device, board card and method

The present application is a divisional application with the application number of 202111221317.4, the application date of 2021, 10/20, entitled "arbitrary precision computing accelerator, integrated circuit device, board card and method".

Technical Field

The present invention relates generally to the field of computers. More particularly, the present invention relates to arbitrary precision computing accelerators, integrated circuit devices, boards and methods.

Background

Any precise computation, which uses any number of bits to represent operands, is of crucial importance in many areas of technology, such as supernova simulation, climate simulation, atomic simulation, artificial intelligence, planetary orbit computation, etc. These fields require processing hundreds, even thousands or millions of bits of data, which far exceeds the hardware capabilities of conventional processors.

Even if the prior art uses high bit width processors, the variable length required for any precise computational operation cannot be handled because: the optimal bit width varies greatly between different algorithms, and subtle differences in bit width can result in significant cost differences. Furthermore, the prior art also proposes many techniques for improving the efficiency of architecture level computation, mainly the efficient-only computation (which only performs basic computation, in which invalid computations such as thinning and duplicate data are skipped or eliminated, and the approximate computation (which replaces the original accurate computation with less accurate data such as low-bit-width data or quantized data). However, for pure computation, finding duplicate data is difficult and expensive, and for approximate computation, it is intuitively contradictory to the goal of any precise computation that requires precise computation to achieve higher accuracy. Finally, these prior art techniques inevitably result in a large number of inefficient memory accesses.

Therefore, an efficient arbitrary accurate calculation scheme is urgently needed.

Disclosure of Invention

To at least partially solve the technical problems mentioned in the background, the present invention provides an arbitrary-precision computing accelerator, an integrated circuit device, a board card and a method.

In one aspect, the present disclosure provides a processing component for inner-product a first vector and a second vector, comprising: a conversion unit, a plurality of inner product units and a synthesis unit. The conversion unit is used for generating a plurality of mode vectors according to the length and the bit width of the first vector. Each inner product unit accumulates a specific mode vector of the plurality of mode vectors to form a unit accumulation number sequence based on a data vector of the second vector in the length direction as an index. The synthesis unit is used for summing the unit accumulation arrays to obtain an inner product result.

In another aspect, the present invention discloses an arbitrary precision computation accelerator connected to an off-chip memory, the arbitrary precision computation accelerator comprising: a core memory agent, a core controller and a processing array. The kernel memory agent is used for reading a plurality of operands from the off-chip memory. The core controller is configured to split the plurality of operands into a plurality of vectors, the plurality of vectors including a first vector and a second vector. The processing array includes a plurality of processing elements for inner-product the first vector and the second vector according to the lengths of the first vector and the second vector to obtain an inner-product result. The kernel controller integrates the inner product result into a calculation result of a plurality of operands, and the kernel memory agent stores the calculation result into the off-chip memory.

In another aspect, the present invention discloses an integrated circuit device comprising any of the precision computing accelerators, processing devices and off-chip memory described above. The processing device is used for controlling an arbitrary precision computing accelerator, and the off-chip memory comprises an LLC. Wherein, the arbitrary precision computation accelerator is connected with the processing device through LLC.

In another aspect, the present invention discloses a board card including the integrated circuit device.

In another aspect, the present invention discloses a method for inner product a first vector and a second vector, comprising: generating a plurality of mode vectors according to the length and bit width of the first vector; accumulating a specific mode vector of the plurality of mode vectors to form a plurality of unit accumulation number rows based on the data vector of the second vector in the length direction as an index; and summing the plurality of unit accumulated sequences to obtain an inner product result.

In another aspect, the present invention discloses a method for arbitrary precision computation, comprising: reading a plurality of operands from an off-chip memory; splitting a plurality of operands into a plurality of vectors, the plurality of vectors comprising a first vector and a second vector; inner-product the first vector and the second vector according to the lengths of the first vector and the second vector to obtain an inner-product result; integrating the inner product result into a calculation result of a plurality of operands; and storing the calculation result to an off-chip memory.

In another aspect, the present invention discloses a computer readable storage medium having stored thereon computer program code for performing the aforementioned method when the computer program code is executed by a processing device.

The invention provides a scheme for processing arbitrary precision calculation, which processes different bit streams in parallel and deploys a complete bit serial data path to flexibly and flexibly perform high-precision calculation. The invention fully utilizes simple hardware configuration, reduces repeated calculation and further realizes arbitrary accurate calculation with low energy consumption.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description, which proceeds with reference to the accompanying drawings. In the drawings, several embodiments of the invention are illustrated by way of example and not by way of limitation, and like or corresponding reference numerals indicate like or corresponding parts. Wherein:

fig. 1 is a structural diagram showing a board card of the embodiment of the present invention;

FIG. 2 is a block diagram illustrating an integrated circuit device of an embodiment of the invention;

FIG. 3 is a schematic diagram showing the internal structure of a computing device of an embodiment of the invention;

FIG. 4 is a schematic diagram illustrating an exemplary multiplication operation;

FIG. 5 is a schematic diagram illustrating a conversion unit of an embodiment of the invention;

FIG. 6 is a schematic diagram showing a generation unit of an embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating an inner product cell of an embodiment of the invention;

FIG. 8 is a schematic diagram showing a synthesis unit of an embodiment of the invention;

FIG. 9 is a schematic diagram showing a full adder group of an embodiment of the invention;

FIG. 10 is a flow chart illustrating arbitrary precision calculations for another embodiment of the present invention; and

FIG. 11 is a flow chart showing inner product first and second vectors according to another embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

It should be understood that the terms "first", "second", "third" and "fourth", etc. in the claims, the description and the drawings of the present invention are used for distinguishing different objects and are not used for describing a particular order. The terms "comprises" and "comprising," when used in the description and claims of the present invention, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification and claims of this application, the singular form of "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this application refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

The following detailed description of the embodiments of the invention refers to the accompanying drawings.

Arbitrary precision calculations play a key role in many areas of science and technology. For example, equation x which appears to be trivial ³ +y ³ +z ³ =3, accuracy of 200 bits or more may be required to solve with a computer; in the Isonist theory (Ising theory), the calculation of the integral requires more than 1000 bits of precision; calculating the volume of the nodal space (knottcomplete) in the hyperbolic space (hyperbolic space) involves up to 60000 bits of precision. A very small precision error may cause a great difference in the calculation result, so that any precision calculation is a serious technical subject in the computer field.

The invention provides an efficient arbitrary precision computing accelerator framework, which mainly refers to a computing form of inner product operation, highlights intra-parallel operation (intra-parallel) and inter-parallel operation (inter-parallel) of the accelerator framework, and realizes multiplication of operands.

Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the present invention. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combined processing device includes a computing device 201, a processing device 202, an off-chip memory 203, a communication node 204, and an interface device 205. In this embodiment, several integration schemes may be used to cooperate with the operation of the computing device 201, the processing device 202, and the off-chip memory 203, where fig. 2A shows an LLC integration scheme, fig. 2B shows an SoC integration scheme, and fig. 2C shows an IO integration scheme.

The computing device 201 is configured to perform user-specified operations, primarily implemented as a multi-core intelligent processor, to perform deep learning or machine learning computations, which may interact with the processing device 202 to collectively perform the user-specified operations. The computing device 201 includes any of the precision computing accelerators described above for processing linear computations, and more particularly, operand multiplication operations as applied in convolution.

The processing device 202, as a general purpose processor, performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, non-linear calculations, and the like. Depending on the implementation, the processing device 202 may be one or more types of Central Processing Unit (CPU), graphics Processing Unit (GPU) or other general purpose and/or special purpose processors including, but not limited to, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. When considered together, the computing device 201 and the processing device 202 are considered to form a heterogeneous multi-core structure.

The off-chip memory 203 is used for storing data to be processed and processed, and the hierarchy thereof can be divided into: a first level cache (L1), a second level cache (L2), a third level cache (L3, also called LLC), and physical memory. The physical memory is DDR, typically 16G or more in size. When the computing device 201 or the processing device 202 wants to read data from the off-chip memory 203, since the speed of L1 is the fastest, L1 is usually accessed first, if the data is not stored in L1, then L2 is accessed, if the data is not stored in L2, L3 is continuously accessed, and if the data is not stored in L3, finally DDR is accessed. The cache hierarchy of the off-chip memory 203 speeds up data access by storing the most frequently accessed data in the cache. DDR is rather slow compared to cache. As the cache level increases (L1 → L2 → LLC → DDR), the access latency becomes higher and higher, but the memory space becomes larger and larger.

The communication node 204 is a routing node or a router in a network-on-chip (NoC), and when the computing device 201 or the processing device 202 generates a data packet, the data packet is sent to the communication node 204 through a specific interface, and the communication node 204 reads address information in a header flit of the data packet and calculates an optimal routing path by using a specific routing algorithm, so as to establish a reliable transmission path to send the data packet to a destination node (e.g., the off-chip memory 203). Similarly, when the computing device 201 or the processing device 202 needs to read a packet from the off-chip memory 203, the communication node 204 also calculates the optimal routing path to route the packet from the off-chip memory 203 to the computing device 201 or the processing device 202.

The interface device 205 is an external input/output interface of the combination processing device, when the combination processing device exchanges information with an external device, since the external device is of various types and the requirements of each device for the transmitted information are different, the interface device 205 performs the tasks of setting data buffering to solve the incongruity problem caused by the speed difference between the two devices according to the requirements of the sender and the receiver of data transmission, setting signal level conversion, setting information conversion logic to meet the requirements of respective formats, setting a time sequence control circuit to synchronize the work of the sender and the receiver and provide address transcoding, and the like.

LLC integration in fig. 2A refers to the computing device 201 and the processing device 202 being in communication via the LLC, and SoC integration in fig. 2B is the integration of the computing device 201, the processing device 202, and the off-chip memory 203 via the communication node 204. The IO integration of fig. 2C is to integrate the computing device 201, the processing device 202 and the off-chip memory 203 through the interface device 205. These 3 integration modes are only examples, and the present invention is not limited to the integration mode.

This embodiment preferably selects the LLC integration scheme. The core of deep learning and machine learning is convolution operator, the convolution operator is based on inner product operation, and the inner product operation is formed by combining multiplication and addition, so that the main task of the computing device 201 is a large amount of low-level operations such as multiplication and addition, when training and reasoning of a neural network model are executed, intensive interaction is required between the computing device 201 and the processing device 202, the computing device 201 and the processing device 202 are integrated into LLC, and data are shared through LLC, so that lower interaction cost is achieved. Furthermore, since high precision data may have millions of bits, L1 and L2 have limited capacity, and the interaction between L1 and L2 may cause insufficient capacity. The computing device 201 utilizes the relatively large capacity of the LLC to cache high-precision data to save time on repeated accesses.

Fig. 3 shows a schematic diagram of an internal structure of the computing apparatus 201, which includes a core memory agent 301, a core controller 302, and a processing array 303.

The kernel memory agent 301 serves as a manager for the computing device 201 to access the off-chip memory 203. When the kernel memory agent 301 reads an operand from the off-chip memory 203, the start address of the operand is set in the kernel memory agent 301, and the kernel memory agent 301 reads a plurality of operands in series, simultaneously, and continuously by adding addresses, in a manner of reading from the lower bits of the operands sequentially to the upper bits at one time, for example, when 3 operands need to be read, the lowest bit 512 bits of the first operand are read in series according to the start address of each operand, then the lower bit 512 bits of the second operand are read in series, then the lower bit 512 bits of the third operand are read in series, and after the lowest bit is read, the lowest bit is read by adding addresses (adding 512 bits), then the lower bits of each time are read in series, and in this manner, until the highest bit of the 3 operands is read. When the kernel memory agent 301 stores the computation results back to the off-chip memory 203, the computation results are sent in parallel, for example, if the kernel memory agent 301 needs to send 3 computation results to the off-chip memory 203, the lowest order bit of the 3 computation results is sent at the same time, and the next lowest order bit of the 3 computation results is sent at the same time, so that the highest order bits of the 3 computation results are sent at the same time. Typically, these operands are represented in a matrix or vector form.

The core controller 302 controls splitting each operand into a plurality of data segments, that is, a plurality of vectors, based on the computation capability and the number of processing elements in the processing array 303, so that the core memory agent 301 sends to the processing array 303 in units of data segments.

The processing array 303 is used to perform a multiplication of two operands, e.g., the first operand may be split into x ₀ To x ₇ Wait for 8 data segments, the second operand can be split into y ₀ To y ₃ When the first operand and the second operand perform a multiplication operation, the algorithm expands as shown in fig. 4. The processing array 303 performs inner product calculation by splitting the first operand and the second operand, and shifts, aligns and sums the

intermediate results

401, 402, 403, and 404 to obtain the calculation result of the multiplication operation.

For clarity, the data segments are collectively referred to as vectors, and the multiplication of two data segments is an inner product of two vectors (a first vector and a second vector), wherein the first vector is from a first operand and the second vector is from a second operand.

The processing array 303 includes a plurality of processing units 304, the processing units 304 are arranged in an array, 4 × 8 processing units 304 are exemplarily shown in the figure, and the number of the processing units 304 is not limited by the present invention. Each processing unit 304 is configured to inner-product the first vector and the second vector according to the length of the first vector and the length of the second vector to obtain an inner-product result. Finally, the core controller 302 controls the memory agent 301 to integrate or reduce the inner product result into a calculation result of multiple operands, and sends the calculation result to the core memory agent 301, and the core memory agent 301 stores the calculation result in the off-chip memory 203.

Specifically, the computing device 201 employs a recursive decomposition algorithm (recursive decomposition) in control, and when the computing device 201 receives an instruction from the processing device 202 to perform an arbitrary precision calculation, the core controller 302 splits the operands of the multiplication into a plurality of vectors on average and sends them to the processing array 303 for calculation, and each processing unit 304 is responsible for calculation of a set of vectors, for example, an inner product of a first vector and a second vector. In this embodiment, each processing unit 304 further splits a set of vectors into smaller inner product calculation units based on its own hardware resources to facilitate inner product calculation. The computing means 201 employs a multi-bit stream on the data path, i.e. each operand is imported from the kernel memory agent 301 into the processing unit 303 at a rate of 1 bit per cycle, but a plurality of operands are transferred in parallel at the same time, and after the computation is finished, the processing unit 304 sends the inner product result to the kernel memory agent 301 in a bit-serial manner.

As a core computing unit of the computing device 201, the main task of the processing unit 304 is inner product computation. The processing unit 304 is divided into 3 stages based on the flow of bit index vector inner product, the first stage is a mode generation stage, the second stage is a mode index stage, and the third stage is a weighted synthesis stage.

With a first vector

And second vector

For example, assume the first vector

And a second vector

Are respectively Nxp _x And Nxp _y Where N is the first vector

And a second vector

Length of (d), more particularly the number of line elements, p _x Is a first vector

Bit width of (p) _y Is a second vector

Is determined. In this embodiment, a first vector is to be performed

And second vector

Inner product of (2) first of all, the first vector

Transpose, then combine with the second vector

Do inner products, i.e. (p) _x ×N)·(N×p _y ) To generate p _x ×p _y The inner product of (2).

This example converts the second vector to

The disassembly is as follows:

where K is a constant and of size Nx 2 ^N Binary matrix, B _col Is one size of 2 ^N ×p _y C is p _y A weighting vector.

At a first vector

Has an arrangement of elements in the longitudinal direction of 2 in total ^N Seed pattern, with N being 2, i.e. first vector

Has a length of 2,K according to a first vector

Is divided into 2 ^N Unit vectors to arrange all possible unit vectors of length 2, so that K is 2 × 2 in size ² Is used to cover all possibilities of all length-2 element combinations, with length-2 element combinations

These 4 possibilities, so the fixed form of K is:

in other words, once the first vector is measured

And a second vector

The length of K is determined, and the size and element value of K are determined.

B _col Is a one-hot vector, each column has only 1 element of 1, the remaining elements are 0, and which element is 1 depends on the second vector

Corresponds to which column of K. For convenience of explanation, the first vector is exemplarily set

And a second vector

Comprises the following steps:

second vector is measured

Comparing with K, the second vector can be found

First column of (1)

Fourth column of K, second vector

Second column of (2)

Third column for K, second vector

Third column of

Fourth column of K, second vector

Fourth column of (2)

The first column of K, so when the second vector

With K.B _col When is shown, B _col Is of size 2 ² The x 4 index matrix is as follows:

B _col only the fourth element of (1) represents the second vector

The first column of (a) is the fourth column of K; b is _col Only the third element of (a) is 1, representing a second vector

The second column of (a) is a third column of K; b is _col The third column of (2) has only the fourth element 1, representing the second vector

The third column of (2) is the fourth column of K; b is _col Only the first element of (1) represents the second vector

Is the first column of K. In summary, B is only determined if K is determined _col The element value of (2) is also determined.

C is p _y A weight vector for reflecting the second vector

I.e. the bit width. Due to p _y Is 4, represents a second vector

Is 4, so C is:

this embodiment disassembles the second vector in the manner described above

So that the second vector

The elements in (A) can be K and B _col Two binary matrices. In other words, this embodiment will be

Is converted into

And (4) performing the operation of (1).

The processing unit 304 is used to perform vector inner products based on the above-mentioned transformation

In (1). In the pattern generation phase, processing component 304 obtains

I.e. generating pattern vectors

In the mode indexing phase, processing component 304 computes

In the weighted synthesis phase, the processing component 304 accumulates the index patterns according to the weight C. The design is such that the operands can be converted into an index pattern no matter how high the precision isThe inner product is performed to reduce duplicate calculations to avoid any requirement for high bandwidth for precision calculations.

Fig. 3 further shows a schematic of the structure of the processing unit 304. To implement the aforementioned 3 stages, the processing unit 304 includes a processing unit memory agent unit 305, a processing unit control unit 306, a conversion unit 307, a plurality of inner product units 308, and a synthesis unit 309.

The processing element memory agent unit 305, as the interface for the processing element 304 to access the kernel memory agent 301, is used to receive two vectors, such as the first vector, which need to be subjected to the inner product operation

And a second vector

Processing unit control unit 306 is used to coordinate and manage the operation of the various units in processing unit 304.

The conversion unit 307 is used to implement the pattern generation phase. Receiving a first vector from a processing element memory agent unit 305

And implementing the binary matrix K in hardware, performing

To generate a plurality of pattern vectors

Fig. 5 shows a schematic diagram of the conversion unit 307, the conversion unit 307 comprising: n bit stream inputs 501, generating components 502 and 2 ^N A bitstream output 503.

N bit stream inputs 501 for corresponding to a first vector

Respectively receiving N data vectors. FIG. 5 shows a first vector

Length of 4, first vector

Including x ₀ 、x ₁ 、x ₂ 、x ₃ The 4 data vectors are equal, and each data vector has bit width p _x I.e. each data vector has p _x Single digit number.

Generate component 502 for execution

The core element of (1). Response K has 2 ^N The unit vector, the generating component 502 includes 2 ^N Generating units each simulating a unit vector to generate 2 respectively ^N A mode vector

As shown in fig. 5, the first vector

Splitting into x ₀ 、x ₁ 、x ₂ 、x ₃ And 4 data vectors, input in parallel from the left side of the generation component 502. Since the inner product operation is simply an addition of each bit in binary, the generation component 502 directly simulates all unit vectors in K, and x, in hardware ₀ 、x ₁ 、x ₂ 、x ₃ The bits of (a) are added in sequence. In more detail, x is input simultaneously every period ₀ 、x ₁ 、x ₂ 、x ₃ With parity bits, e.g. with simultaneous input x for the first period ₀ 、x ₁ 、x ₂ 、x ₃ The second cycle of the input of x at the same time ₀ 、x ₁ 、x ₂ 、x ₃ In this way up to the p-th bit _x Periodic simultaneous input of x ₀ 、x ₁ 、x ₂ 、x ₃ Up to the most significant bit of the bit. The required bandwidth is only N bits per cycle, in this example the required bandwidthOnly 4 bits per cycle.

At a first vector

If the length of (4) is greater, the generating component 502 includes 16 generating units, each simulating 16 unit vectors in K, where the unit vectors are (0000), (0001), (0010), (0011), (0100), (0101), (0110), (0111), (1000), (1001), (1010), (1011), (1100), (1101), (1110), and (1111).

Fig. 6 shows a schematic diagram of the generation unit 504 with unit vector (1011). Taking generation unit 504 as an example, it simulates a unit vector (1011), so generation unit 504 includes 3 element registers 601, adder 602, and carry register 603. The 3-element register 601 receives and temporarily stores the bit values of the data vector corresponding to the simulated unit vector, i.e., x ₀ 、x ₁ 、x ₃ The bit value of (1) directly ignoring x ₂ The bit value of (c) is implemented in this structure:

the value in the register 601 is sent to the adder 602 for accumulation, and if a carry occurs after accumulation, the value of the carry is temporarily stored in the carry register 603 and x input in the next cycle ₀ 、x ₁ 、x ₃ Up to the p-th bit value _x Period will x ₀ 、x ₁ 、x ₃ Until the most significant bit of the data is added. Each generation unit is designed according to the same technical logic, and those skilled in the art can easily deduce the structures of other generation units without creative work based on the structure of the generation unit 504 with unit vector (1011) in fig. 6, so that further description is omitted. It should be noted that some of the generating units, such as the generating units simulating the cell vectors (0000), (0001), (0010), (0100), and (1000), which have only one input in the same cycle, do not need to have the adder 602 and the carry register 603, and no carry occurs even in the addition operation.

Returning to FIG. 5,2 ^N The bit stream outputs 503 are respectively connected to the output of the adder 602 of each generation unit for outputting 2 ^N A mode vector

In fig. 5, since N is 4, the 16 bit stream outputs 503 output 16 pattern vectors in total

These mode vectors

Is likely to be p _x (if the most significant bit is not added), or p _x +1 (if the most significant bit is added the carry over). As can be seen from FIG. 5, the pattern vector

Is x ₀ 、x ₁ 、x ₂ 、x ₃ All addition possibilities of combination, namely:

z ₀ ＝0

z ₁ ＝x ₀

z ₂ ＝x ₁

z ₃ ＝x ₀ +x ₁

z ₄ ＝x ₂

z ₅ ＝x ₀ +x ₂

z ₆ ＝x ₁ +x ₂

z ₇ ＝x ₀ +x ₁ +x ₂

z ₈ ＝x ₃

z ₉ ＝x ₀ +x ₃

z ₁₀ ＝x ₁ +x ₃

z ₁₁ ＝x ₀ +x ₁ +x ₃

z ₁₂ ＝x ₂ +x ₃

z ₁₃ ＝x ₀ +x ₂ +x ₃

z ₁₄ ＝x ₁ +x ₂ +x ₃

z ₁₅ ＝x ₀ +x ₁ +x ₂ +x ₃

mode vector

The inner product units 308 are sent to the inner product units 308, and each inner product unit 308 corresponds to a processor core for implementing the mode indexing stage and the weighted synthesis stage, but the number of the inner product units 308 is not limited in the invention. Inner product unit 308 receives a second vector from processing element memory agent unit 305

In the second direction

Is an index, from all the mode vectors according to each index

Selects corresponding specific mode vectors, accumulates the specific mode vectors, generates an intermediate result of one bit per cycle, and generates a plurality of intermediate results in p _x Or p _x The +1 periods form a unit accumulation number sequence. The above operations are executed

Fig. 7 shows a schematic diagram of the inner product unit 308 of this embodiment. To realize

Inner product unit 308 includes p _y Multiple multiplexers 701 and p _y 1 serial full adder 702.

p _y Multiple multiplexers 701 for implementingThe mode index phase. Each multiplexer 701 receives all of the mode vectors

(z ₀ To z ₁₅ ) According to a second vector

The length direction of the same-bit data vector of (1) all the pattern vectors

The particular mode vector of (1) is passed. Due to the second vector

Is N, so the second vector

Can be decomposed into N data vectors, since N is 4, the second vector

Can be disassembled into y ₀ 、y ₁ 、y ₂ 、y ₃ Wait for 4 data vectors, and each data vector has a bit width of p _y Therefore, these data vectors can be decomposed into p from the viewpoint of parity bits _y A number of same-bit data vectors. For example, y ₀ 、y ₁ 、y ₂ 、y ₃ These 4 highest order bits of the data vector form the highest order parity data vector 703, y ₀ 、y ₁ 、y ₂ 、y ₃ These next higher order bits of the 4 data vectors form the next higher order parity data vector 704, and so on, y ₀ 、y ₁ 、y ₂ 、y ₃ The least significant bits of the 4 data vectors form the least significant parity data vector 705.

The multiplexer 701 determines which unit vector of the input same-bit data vector is the same as the unit vector of the binary matrix K, and outputs a specific pattern vector corresponding to the same unit vector. For example, the highest order parity data vector 703As a selection signal input to the first multiplexer, assuming that the most significant same-bit data vector 703 is (0101), which is the same as the unit vector 505 in FIG. 5, the first multiplexer will output the specific pattern vector z corresponding to the unit vector 505 ₅ . For another example, the next highest-order data vector 704 is used as the selection signal to be input to the second multiplexer, and assuming that the next highest-order data vector 704 is (0010), which is the same as the unit vector 506 in FIG. 5, the second multiplexer will output the specific pattern vector z corresponding to the unit vector 506 ₂ . Finally, the least significant bit parity data vector 705 is inputted as a selection signal to the pth bit _y Multiplexer, assuming that the least significant bit parity data vector 705 is (1110), which is the same as the unit vector 507 in FIG. 5, then the pth _y The multiplexer will output the particular mode vector z corresponding to the unit vector 507 ₁₄ . Thus, the completion of

The operation of (3).

Serial full adder 702 implements the weighted synthesis stage. p is a radical of formula _y -1 serial full adders 702 are connected serially in the manner shown, the receiving multiplexer 701 outputting specific mode vectors, and sequentially accumulating the specific mode vectors to obtain a unit accumulated sequence. It should be noted that, in order to accumulate and carry (if any) from the lower bit to the next bit so that the next bit can be correctly accumulated and carried, the specific mode vector corresponding to the lowest-order bit data vector 705 must be arranged and input to the outermost serial full adder 702, so that the specific mode vector corresponding to the lower-order bit data vector is preferentially accumulated, the specific mode vector corresponding to the higher-order bit data vector is arranged and input to the inner serial full adder 702, and the specific mode vector corresponding to the highest-order bit data vector 703 must be arranged and input to the innermost serial full adder 702, so that the specific mode vector corresponding to the higher-order bit data vector is accumulated more late, thereby ensuring the correctness of accumulation, i.e. according to p _y Weighting vector C to reflect the second vector

To the power of (c). The unit accumulated sequence is

Further realize the weighting of C. Intermediate results 401, 402, 403 and 404 as in fig. 4 are obtained so far.

The synthesis unit 309 is used to perform the summation calculation 405 as in fig. 4. The synthesis unit 309 receives the unit accumulated sequences from the inner product units 308, each of which is like the

intermediate results

401, 402, 403 and 404 in FIG. 4, which are aligned in the inner product unit 308, and then the synthesis unit 309 sums the aligned unit accumulated sequences to obtain the first vector

And second vector

The inner product of (2).

Fig. 8 shows a schematic diagram of the synthesis unit 309 of this embodiment. The synthesis unit 309 in the figure illustratively receives the outputs of the 8 inner product units 308, i.e., the unit accumulated number series 801 to 808. These unit accumulation sequences 801 to 808 are first vectors

And second vector

After the data is split into 8 data segments, the data segments are respectively sent to 8 inner product units 308 to perform inner product calculation to obtain intermediate results. The synthesis unit 309 comprises 7 full adder groups 809 to 815. Since there is only one intermediate result between the lowest operation 816 and the highest operation 817, the lowest operation 816 and the highest operation 817 do not require an adder set, as in x in FIG. 4 ₀ y ₀ (lowest order) and x ₇ y ₃ (highest bit) is directly output without adding other intermediate results. In other words, only the second lowest to second highest order operations require full adder groupsTo perform the summation calculation 405 as shown in fig. 4.

Fig. 9 shows a schematic diagram of a full adder group 810 to 815. The full adder groups 810 to 815 include a first full adder 901 and a second full adder 902, the first full adder 901 and the second full adder 902 respectively include

multiplexers

903 and 904, wherein an input terminal of the multiplexer 903 is connected to a carry output and a value 0 of the adder, an input terminal of the multiplexer 904 is connected to a carry output and a value 1 of the adder, and the values 0 and 1 are respectively used for simulating an undepleted and a carried intermediate result of a previous digit after summing, so that the first full adder 901 is used for generating an intermediate result sum of the undepleted previous digit, and the second full adder 902 is used for generating an intermediate result sum of the carried previous digit. The structure can determine whether to carry or not without waiting for the intermediate result of the previous digit, and the embodiment adopts the design of synchronously calculating the carry and the not carry, so that the operation delay time can be reduced. The full adder groups 810 to 815 further include a multiplexer 905, wherein the two intermediate result sums are input to the multiplexer 905, and the multiplexer 905 selects to output the intermediate result sum of the carry or the non-carry intermediate result sum according to whether the calculation result of the previous digit carries. The accumulated output 818 is the first vector

And second vector

The inner product of (4).

Returning to fig. 8, since the operation of the lowest order bit is not likely to produce a carry, the next-lowest order full adder group 809 includes only the first full adder 901, directly generating the intermediate result without the need to provide the second full adder 902 and the multiplexer 905.

According to FIG. 8, FIG. 9 and the related descriptions, when the synthesis unit 309 of this embodiment is to sum up M unit accumulation sequences, M-1 full adder groups are configured, which include M-1 first full adders 901, M-2 second full adders 902 and M-2 multiplexers 905.

In other cases, the synthesis unit 309 may be elastically selectively turned onOr by shutting down operation of full adder groups, e.g. first vector

And second vector

When the generated unit accumulated sequence is less than M, a specific number of full adder groups can be properly closed to flexibly support various possible split numbers and expand the application scenarios of the synthesis unit 309.

Returning to fig. 3, a first vector is obtained at the synthesis unit 309

And second vector

The inner product result is sent to the processing unit memory agent unit 305, the processing unit memory agent unit 305 receives the inner product result and sends the inner product result to the core memory agent 301, the core memory agent 301 integrates the inner product results of all the processing units 304 to generate a calculation result, and sends the calculation result to the off-chip memory 203 to complete the product operation of the first operand and the second operand.

Based on the above structure, the computing apparatus 201 of this embodiment performs different numbers of inner product operations according to the length of the operand. Further, the processing array 303 can control the index to be shared among the processing units 304 in the vertical direction and the pattern vector to be shared among the processing units 304 in the horizontal direction to efficiently perform the operation.

In the data path management, this embodiment employs a two-stage architecture, i.e., the core memory agent 301 and the processing element memory agent unit 305. The starting address of the operand in the LLC is recorded in core memory agent 301, and core memory agent 301 reads multiple operands simultaneously, sequentially, and serially from the LLC by self-incrementing the address. The source address is self-growing and therefore the order of the data blocks is deterministic. The core controller 302 determines which processing elements 304 receive the data blocks and the processing element control unit 306 determines which inner product units 308 receive the data blocks.

Another embodiment of the present invention is an arbitrary precision calculation method, which can be implemented by using the hardware structure of the foregoing embodiment. Fig. 10 shows a flowchart of this embodiment.

In step 1001, a plurality of operands are read from off-chip memory. When reading the operand from the off-chip memory, the start address of the operand is set in the kernel memory agent, and the kernel memory agent reads a plurality of operands simultaneously, continuously and serially by increasing the address, wherein the reading mode is to read from the lower bits to the upper bits of the operands one by one.

In step 1002, a plurality of operands are split into a plurality of vectors, the plurality of vectors including a first vector and a second vector. The core controller controls the division of each operand into a plurality of data segments, i.e., a plurality of vectors, based on the computational power and the number of processing elements in the processing array, so that the core memory agent sends the data segments to the processing array in units of data segments.

In step 1003, the first vector and the second vector are inner-multiplied according to the lengths of the first vector and the second vector to obtain an inner-product result. The processing array includes a plurality of processing elements arranged in an array, each processing element inner-multiplying a first vector and a second vector based on a length of the first vector and a length of the second vector to obtain an inner-product result. More specifically, in this step, the pattern generation stage is performed first, the pattern indexing stage is performed, and the weighted synthesis stage is performed last.

With a first vector

And a second vector

For example, assume the first vector

And second vector

Are respectively Nxp _x And Nxp _y Where N is the first vector

And a second vector

Length of (p) _x Is a first vector

Bit width of p _y Is a second vector

Is determined. This embodiment also applies the second vector

The disassembly is as follows:

where K is a constant and of size Nx 2 ^N Binary matrix, B _col Is one size of 2 ^N ×p _y Is p, C is p _y Weight vector, K, B _col The definition of C is the same as that of the previous embodiment, and thus is not repeated. This embodiment disassembles the second vector in the manner described above

So that the second vector

The elements in (1) can be K and B _col Two binary matrices. In other words, this embodiment will be

Is converted into

The operation of (3).

In the pattern generation phase, this embodiment obtains

I.e. generating a pattern vector

In the mode indexing phase, this embodiment computes

In the weighted synthesis phase, the index patterns are accumulated according to the weight C. The design can convert the operands with high precision into an index mode to execute inner products so as to reduce repeated calculation, thereby avoiding the requirement of high bandwidth of any precision calculation. FIG. 11 further illustrates a flow diagram for inner-multiplying the first vector with the second vector.

In step 1101, a plurality of mode vectors are generated based on the length and bit width of the first vector. First, corresponding to a first vector

Respectively receiving N data vectors. Then response K has 2 ^N Unit vectors, each unit vector being simulated by hardware to generate 2 respectively ^N A mode vector

Since the inner product operation is actually an addition operation of each bit in the binary system, the generating component of this embodiment directly simulates all the unit vectors in K and the first vector

The bits of the data vector of (1) are added in sequence. In more detail, the first vector is input simultaneously every period

For example, the first cycle inputs the least significant bit of the data vector at the same time, and the second cycle inputs the second least significant bit of the data vector at the same time, in this way until the p-th cycle _x The period is input simultaneously until the most significant bit of the data vector. The required bandwidth is only N bits per cycle.

When simulating unit vector, firstly receiving and temporarily storing bit value of data vector correspondent to said unit vector, these bit values can be accumulated, if the carry bit appears after accumulation, the value of carry bit can be temporarily stored in carry temporary storage device, and added with bit value of data vector inputted in next period until p _x The period adds the most significant bit values of the data vectors until.

Finally, the accumulated result is received, i.e. the mode vector

In summary, the pattern vector

Is a first vector

All addition possibilities of the data vector of (1) are combined.

In step 1102, based on a second vector

The data vectors in the length direction are used as indexes, and specific mode vectors in the plurality of mode vectors are accumulated to form a plurality of unit accumulation number columns. This step implements the mode indexing stage and the weighted synthesis stage. In the second direction

Is an index, from all the mode vectors according to each index

To select the corresponding specific mode vectorAccumulating the particular pattern vectors to produce an intermediate result of one bit per cycle, at successive p _x Or p _x The +1 periods form a unit accumulation number sequence. The above operations are executed

In more detail, according to the second vector

The length direction of the same-bit data vector of (1) all the pattern vectors

The particular mode vector of (1) is passed. Due to the second vector

Is N, so the second vector

Can be decomposed into N data vectors, and the bit width of each data vector is p _y Therefore, these data vectors can be decomposed into p from the viewpoint of parity bits _y A plurality of parity data vectors.

Then, it is determined which unit vector of the binary matrix K is the same as the input same-bit data vector, and a specific pattern vector corresponding to the same unit vector is output. This is done

The operation of (3).

Finally, the specific mode vectors are accumulated in sequence to obtain a unit accumulation array. It should be noted that the correctness of the accumulation, i.e. in terms of p, should be ensured _y Weighting vector C to reflect the second vector

To the power of (c). The unit accumulated sequence is

Further realize the weighting of C. Each unit accumulation sequence is like the

intermediate results

401, 402, 403 and 404 in fig. 4, which have been aligned.

In step 1103, the plurality of unit accumulated arrays are summed to obtain an inner product result. To achieve synchronous computation, this embodiment uses a first vector

And a second vector

And after splitting into a plurality of data segments, respectively carrying out inner product calculation to obtain intermediate results. Since the least significant operation and the most significant operation have only one intermediate result, the least significant operation and the most significant operation do not need to be added, as shown by x in FIG. 4 ₀ y ₀ (lowest order) and x ₇ y ₃ (highest bit) is directly output without adding other intermediate results. In other words, only the next lower to next higher order operations require addition operations to be performed.

The embodiment adopts a design of synchronously calculating the carry and the not carry to reduce the operation delay time. The intermediate result sum of the carry-out and the carry-in is obtained at the same time, and then the intermediate result sum of the carry-out or the intermediate result sum of the carry-out is selected according to whether the calculation result of the previous digit carries out the carry-in. The accumulated output is the first vector

And second vector

The inner product of (2).

Returning to FIG. 10, in step 1004, the inner product result is integrated into the calculation result of a plurality of operands. The core controller controls the memory agent to integrate or reduce the inner product result into a calculation result of a plurality of operands and send the calculation result to the core memory agent.

In step 1005, the calculation result is stored to the off-chip memory. And the kernel memory agent sends the calculation results in parallel, firstly sends the lowest bit of the calculation results at the same time, and then sends the second lowest bit of the calculation results at the same time, and the above way is carried out until the highest bit of the calculation results is sent at the same time.

Another embodiment of the invention is a computer readable storage medium having stored thereon computer program code for performing a method according to fig. 10 or fig. 11, when said computer program code is executed by a processor. In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when the aspects of the present invention are embodied in a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in the embodiments of the present invention. The Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

The present invention proposes a novel architecture for efficiently handling arbitrary precision calculations. No matter how high the precision of the operand is, the invention can disassemble the operand, utilize the index to process the bit stream of the fixed length in parallel, avoid the bit level redundancy, such as sparsity or repeated calculation, etc., do not need to dispose the hardware of the high bit width, can achieve the effects of flexible application and large bit width calculation.

According to different application scenarios, the electronic device or apparatus of the present invention may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an automatic driving terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance instrument, a B ultrasonic instrument and/or an electrocardiograph. The electronic device or apparatus of the present invention can also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical care, and the like. Furthermore, the electronic equipment or the device can be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, an electronic device or apparatus with high computing power according to the present disclosure may be applied to a cloud device (e.g., a cloud server), and an electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of simplicity, the present invention sets forth some methods and embodiments thereof as a series and combination of acts, but those skilled in the art will appreciate that the inventive arrangements are not limited by the order of acts described. Accordingly, persons skilled in the art may appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the invention. Further, those skilled in the art will appreciate that the described embodiments of the invention are capable of being practiced in other alternative embodiments that may involve fewer acts or modules than are necessary to practice one or more aspects of the invention. In addition, the description of some embodiments of the present invention is also focused on different schemes. In view of this, those skilled in the art will understand that portions of the present invention that are not described in detail in one embodiment may also refer to related descriptions of other embodiments.

In particular implementations, based on the disclosure and teachings of the present invention, one of ordinary skill in the art will appreciate that the several embodiments disclosed herein can be practiced in other ways not disclosed herein. For example, as for each unit in the foregoing embodiments of the electronic device or apparatus, the unit is split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of the connection relationships between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present invention, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed over multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the scheme of the embodiment of the invention. In addition, in some scenarios, multiple units in an embodiment of the present invention may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors and like devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained by applying specific embodiments, and the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An arbitrary-precision computation accelerator coupled to off-chip memory, the arbitrary-precision computation accelerator comprising:

a kernel memory agent to read a plurality of operands from the off-chip memory;

a core controller to split the plurality of operands into a plurality of vectors, the plurality of vectors including a first vector and a second vector; and

a processing array comprising a plurality of processing elements to inner product the first vector and the second vector according to lengths of the first vector and the second vector to obtain an inner product result;

the core controller integrates the inner product result into a calculation result of the plurality of operands, and the core memory agent stores the calculation result into the off-chip memory.

2. An arbitrary precision computation accelerator according to claim 1, wherein a start address of the plurality of operands is set in the core memory agent, the core memory agent serially reading the plurality of operands by self-incrementing the address.

3. An arbitrary precision computation accelerator according to claim 2, wherein the core memory agent reads the plurality of operands in a manner that reads from lower bits of the plurality of operands one at a time, successively higher bits.

4. An arbitrary precision computation accelerator according to claim 1, wherein the kernel memory agent sends the computation results to the off-chip memory in parallel.

5. An arbitrary precision computing accelerator as defined in claim 1, wherein each processing component comprises:

the conversion unit is used for generating a plurality of mode vectors according to the length and the bit width of the first vector;

a plurality of inner product units each accumulating a specific mode vector of the plurality of mode vectors to form a unit accumulation number series based on a data vector of the second vector in the length direction as an index; and

and the synthesis unit is used for summing the unit accumulation arrays to obtain the inner product result.

6. An integrated circuit device, comprising:

an arbitrary precision computation accelerator according to any of claims 1 to 5;

processing means to control the arbitrary precision computation accelerator; and

off-chip memory, including LLC;

wherein the arbitrary precision computation accelerator is in communication with the processing device via the LLC.

7. A board card comprising the integrated circuit device of claim 6.

8. An arbitrary precision calculation method comprising:

reading a plurality of operands from an off-chip memory;

splitting the plurality of operands into a plurality of vectors, the plurality of vectors comprising a first vector and a second vector;

inner-product the first vector and the second vector according to the lengths of the first vector and the second vector to obtain an inner-product result;

integrating the inner product result into a calculation result of the plurality of operands; and

and storing the calculation result to the off-chip memory.