WO2023065701A1

WO2023065701A1 - Inner product processing component, arbitrary-precision computing device and method, and readable storage medium

Info

Publication number: WO2023065701A1
Application number: PCT/CN2022/100304
Authority: WO
Inventors: 赵永威; 郝一帆; 刘晨骁; 承书尧; 喻歆; 陈天石
Original assignee: 寒武纪(西安)集成电路有限公司
Priority date: 2021-10-20
Filing date: 2022-06-22
Publication date: 2023-04-27
Also published as: CN114003198A; CN114003198B; CN115437602A

Abstract

The present invention relates to an arbitrary-precision computing device and method, and a computer readable storage medium. A core memory agent reads a plurality of operands from an off-chip memory; a core controller splits the plurality of operands into a plurality of vectors; a processing array comprises a plurality of processing components; each processing component performs inner product on a first vector and a second vector according to the lengths of the first vector and the second vector so as to obtain inner product results; the core controller integrates the inner product results into calculation results of the plurality of operands; the core memory agent stores the calculation results in the off-chip memory.

Description

Inner product processing component, arbitrary precision computing device, method and readable storage medium

Cross References to Related Applications

This disclosure claims the priority of the Chinese patent application filed on October 20, 2021 with the application number 202111221317.4 and the title of the invention is "Inner Product Processing Component, Arbitrary Precision Computing Equipment, Method, and Readable Storage Medium".

technical field

The present invention relates generally to the field of computers. More specifically, the present invention relates to an inner product processing component, an arbitrary-precision computing device, a method, and a readable storage medium.

Background technique

Arbitrary precision calculation is the use of arbitrary digits to represent operands, which is crucial in many technical fields, such as supernova simulation, climate simulation, atomic simulation, artificial intelligence, planetary orbit calculation, etc. These fields need to process hundreds, even thousands or millions of bits of data, such a large range of data bits processing is far beyond the hardware capabilities of traditional processors.

Even state-of-the-art processors using high bit-widths cannot handle the variable lengths required for arbitrary precise computational operations because the optimal bit-width varies widely between algorithms and small differences in bit-width can lead to significant cost difference. Furthermore, the prior art also proposes many techniques to improve computational efficiency at the architecture level, mainly effectual-only computation and approximate computation. Data is skipped or eliminated, the latter using less accurate data such as low-bit-width data or quantized data instead of the original accurate data for calculations. However, for purely efficient computations it is difficult and expensive to find duplicates, and for approximate computations it intuitively contradicts the purpose of arbitrarily exact computations, which require exact computations to achieve high precision. In the end, these existing techniques inevitably lead to a large number of inefficient memory accesses.

Therefore, an efficient arbitrarily precise computation scheme is urgently needed.

Contents of the invention

In order to at least partly solve the technical problems mentioned in the background art, the solutions of the present invention provide an inner product processing component, an arbitrary precision computing device, a method, and a readable storage medium.

In one aspect, the present invention discloses a processing unit for inner product of a first vector and a second vector, including: a conversion unit, a plurality of inner product units and a combination unit. The conversion unit is used for generating multiple pattern vectors according to the length and bit width of the first vector. Each inner product unit accumulates a specific pattern vector among the plurality of pattern vectors based on the data vector in the length direction of the second vector as an index to form a unit accumulation sequence. The synthesis unit is used to add up multiple unit accumulation sequences to obtain an inner product result.

In another aspect, the present invention discloses an arbitrary-precision computing accelerator connected to an off-chip memory. The arbitrary-precision computing accelerator includes: a kernel memory agent, a kernel controller, and a processing array. The kernel memory agent is used to read multiple operands from off-chip memory. The core controller is used for splitting multiple operands into multiple vectors, and the multiple vectors include a first vector and a second vector. The processing array includes a plurality of processing units for inner producting the first vector and the second vector according to the lengths of the first vector and the second vector to obtain an inner product result. Among them, the core controller integrates the inner product result into the calculation result of multiple operands, and the core memory agent stores the calculation result in the off-chip memory.

In another aspect, the present invention discloses an integrated circuit device including the aforementioned arbitrary precision computing accelerator, a processing device and an off-chip memory. The processing device is used to control the arbitrary precision computing accelerator, and the off-chip memory includes LLC. Wherein, the arbitrary-precision computing accelerator communicates with the processing device through the LLC.

In another aspect, the present invention discloses a board including the aforementioned integrated circuit device.

In another aspect, the present invention discloses a method for inner producting a first vector and a second vector, comprising: generating a plurality of pattern vectors according to the length and bit width of the first vector; based on the data vector in the length direction of the second vector accumulating specific pattern vectors among the plurality of pattern vectors as an index to form a plurality of unit accumulation sequences; and summing up the plurality of unit accumulation sequences to obtain an inner product result.

In another aspect, the present invention discloses an arbitrary precision calculation method, including: reading multiple operands from off-chip memory; splitting the multiple operands into multiple vectors, the multiple vectors including the first vector and the second vector; according to the lengths of the first vector and the second vector, inner product the first vector and the second vector to obtain the inner product result; integrate the inner product result into a calculation result of multiple operands; and store the calculation result in a slice external memory.

In another aspect, the present invention discloses a computer-readable storage medium, on which computer program codes for arbitrary precision calculations are stored, and when the computer program codes are executed by a processing device, the aforesaid methods are executed.

The present invention proposes a scheme for processing arbitrary-precision calculations, and processes different bit streams in parallel, deploying a complete bit-serial data path to perform high-precision calculations flexibly and flexibly. The invention makes full use of simple hardware configuration, reduces repeated calculations, and further realizes arbitrary accurate calculations with low energy consumption.

Description of drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the drawings, several embodiments of the present invention are shown by way of illustration and not limitation, and the same or corresponding reference numerals indicate the same or corresponding parts. in:

Fig. 1 is the structural diagram showing the plate card of the embodiment of the present invention;

[Corrected 30.08.2022 under Rule 91]
2A to 2C are structural diagrams illustrating an integrated circuit device according to an embodiment of the present invention;

3 is a schematic diagram showing the internal structure of a computing device according to an embodiment of the present invention;

Figure 4 is a schematic diagram illustrating an exemplary multiplication operation;

FIG. 5 is a schematic diagram illustrating a conversion unit according to an embodiment of the present invention;

Fig. 6 is a schematic diagram showing a generation unit of an embodiment of the present invention;

Fig. 7 is a schematic diagram showing an inner product unit according to an embodiment of the present invention;

Figure 8 is a schematic diagram illustrating a synthesis unit of an embodiment of the present invention;

Fig. 9 is a schematic diagram showing a full adder group of an embodiment of the present invention;

FIG. 10 is a flowchart illustrating arbitrary precision calculations of another embodiment of the present invention; and

FIG. 11 is a flow chart illustrating the inner product of the first vector and the second vector according to another embodiment of the present invention.

Detailed ways

The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without making creative efforts belong to the protection scope of the present invention.

It should be understood that the terms "first", "second", "third" and "fourth" in the claims, description and drawings of the present invention are used to distinguish different objects, rather than to describe a specific order . The terms "comprising" and "comprising" used in the description and claims of the present invention indicate the presence of described features, integers, steps, operations, elements and/or components, but do not exclude one or more other features, integers , steps, operations, elements, components, and/or the presence or addition of collections thereof.

It should also be understood that the terms used in the description of the present invention are for the purpose of describing specific embodiments only, and are not intended to limit the present invention. As used in the specification and claims herein, the singular forms "a", "an" and "the" are intended to include the plural forms unless the context clearly dictates otherwise. It should be further understood that the term "and/or" used in the description and claims of the present invention refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations.

As used in this specification and claims, the term "if" may be interpreted as "when" or "once" or "in response to determining" or "in response to detecting" depending on the context.

The specific implementation manner of the present invention will be described in detail below in conjunction with the accompanying drawings.

Arbitrary-precision computing plays a key role in many fields of technology. For example, for the seemingly trivial equation x ³ +y ³ +z ³ =3, it requires more than 200 digits of precision to solve it by computer; in the Ising theory (Ising theory), the calculation of the integral requires more than 1000 digits of precision; Computing the volume of the knot complement in hyperbolic space involves up to 60,000 bits of precision. A very small precision error may lead to a huge difference in the calculation results, so arbitrary precision calculation is a very serious technical issue in the computer field.

The present invention proposes a high-efficiency arbitrary-precision computing accelerator architecture, which mainly refers to the calculation form of the inner product operation, and highlights the intra-parallelism and inter-parallelism of the accelerator architecture to achieve Multiplication of operands.

FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present invention. As shown in Figure 1, the board card 10 includes a chip 101, which is a system-on-chip (System on Chip, SoC), or system-on-chip, integrated with one or more combined processing devices, and the combined processing device is an artificial The intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining. In particular, deep learning technology is widely used in the field of cloud intelligence. A notable feature of cloud intelligence applications is the large amount of input data, which has high requirements for the storage capacity and computing power of the platform. The board 10 of this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing capabilities.

The chip 101 is connected to an external device 103 through an external interface device 102 . The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like. The data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 . The calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102 . According to different application scenarios, the external interface device 102 may have different interface forms, such as a PCIe interface and the like.

The board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 . The storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101 . To this end, in an application scenario, the control device 106 may include a microcontroller (Micro Controller Unit, MCU).

FIG. 2 is a block diagram showing the combined processing means in the chip 101 of this embodiment. As shown in FIG. 2 , the combined processing device includes a computing device 201 , a processing device 202 , an off-chip memory 203 , a communication node 204 and an interface device 205 . In this embodiment, there are several integration solutions that can be used to coordinate the work of the computing device 201, the processing device 202, and the off-chip memory 203, wherein FIG. 2A shows an LLC integration solution, FIG. 2B shows an SoC integration solution, and FIG. 2C shows Provide an IO integration solution.

The computing device 201 is configured to execute user-specified operations, and is mainly implemented as a multi-core intelligent processor for performing deep learning or machine learning calculations, which can interact with the processing device 202 to jointly complete user-specified operations. The computing device 201 includes the aforementioned arbitrary-precision computing accelerator for processing linear computations, more specifically operand multiplication operations such as convolution.

As a general-purpose processor, the processing device 202 performs basic controls including but not limited to data transfer, starting and/or stopping of the computing device 201 , nonlinear calculation, and the like. According to different implementations, the processing device 202 may be one or more types of a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or other general-purpose and/or special-purpose processors. Processors, including but not limited to digital signal processors (digital signal processors, DSPs), application specific integrated circuits (application specific integrated circuits, ASICs), field-programmable gate arrays (field-programmable gate arrays, FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. When considering the integration of the computing device 201 and the processing device 202 together, they are considered to form a heterogeneous multi-core structure.

The off-chip memory 203 is used to store the data to be processed and processed, and its hierarchy can be divided into: first-level cache (L1), second-level cache (L2), and third-level cache (L3, also known as for LLC) and physical memory. The physical memory is DDR, usually 16G or larger. When the computing device 201 or the processing device 202 intends to read data from the off-chip memory 203, because L1 is the fastest, it usually accesses L1 first. If the data is not stored in L1, then access L2. If the data is not stored in the L2, continue to access L3, if the data is still not stored in L3, finally access DDR. The cache hierarchy of the off-chip memory 203 speeds up data access by storing the most frequently accessed data in the cache. DDR is pretty slow compared to cache. As the cache level increases (L1→L2→LLC→DDR), the access latency is higher and higher, but the storage space is larger and larger.

The communication node 204 is a routing node or router in a network-on-chip (NoC). When the computing device 201 or the processing device 202 generates a data packet, it will be sent to the communication node 204 through a specific interface. The communication node 204 reads the address information in the header flake of the data packet, and uses a specific routing algorithm to calculate the best routing path, thereby establishing a reliable transmission path to send the data packet to the destination node (such as the off-chip memory 203). Similarly, when the computing device 201 or the processing device 202 needs to read the data packet from the off-chip memory 203, the communication node 204 will also calculate the optimal routing path, and send the data packet from the off-chip memory 203 to the computing device 201 or the processing device 202.

The interface device 205 is the external input and output interface of the combination processing device. When the combination processing device exchanges information with external equipment, due to the wide variety of external equipment, each type of equipment has different requirements for the information to be transmitted. The interface device 205 will transmit information according to the data. According to the requirements of the sender and receiver, set the data buffer to solve the incoordination problem caused by the speed difference between the two, set the signal level conversion, set the information conversion logic to meet the requirements of their respective formats, and set the timing control circuit to Synchronize the work of the sender and receiver and provide address transcoding and other tasks.

The LLC integration in FIG. 2A refers to the communication between the computing device 201 and the processing device 202 through LLC. The SoC integration in FIG. 2B is to integrate the computing device 201 , the processing device 202 and the off-chip memory 203 through the communication node 204 . The IO integration in FIG. 2C is to integrate the computing device 201 , the processing device 202 and the off-chip memory 203 through the interface device 205 . These three integration methods are only examples, and the present invention does not limit the integration methods.

This embodiment preferably chooses the LLC integration scheme. Since the core of deep learning and machine learning is the convolution operator, the basis of the convolution operator is the inner product operation, and the inner product operation is a combination of multiplication and addition. Therefore, the main task of the computing device 201 is a large number of multiplications. When performing neural network model training and reasoning, the computing device 201 and the processing device 202 need intensive interaction. The computing device 201 and the processing device 202 are integrated into the LLC, and the data is shared through the LLC to achieve a higher Low interaction cost. Furthermore, since the high-precision data may have millions of bits, the capacity of L1 and L2 is limited, and the interaction between L1 and L2 will lead to a problem of insufficient capacity. The computing device 201 utilizes the relatively large capacity of the LLC to cache high-precision data to save time for repeated access.

FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 , which includes a core memory agent 301 , a core controller 302 and a processing array 303 .

The kernel memory agent 301 serves as a management terminal for the computing device 201 to access the off-chip memory 203 . When the kernel memory agent 301 reads the operand from the off-chip memory 203, the starting address of the operand is set in the kernel memory agent 301, and the kernel memory agent 301 reads simultaneously, continuously and serially by self-increasing addresses To take multiple operands, the reading method is to read from the lower bits of these operands to the higher bits one by one. The lowest 512 bits of an operand, then serially read the lower 512 bits of the second operand, and then serially read the lower 512 bits of the third operand, after the lowest reading is completed, through the self-increment address (increase 512 bits), and then serially read the lower 512 bits each time, and so on until the highest bits of the three operands are read. When the kernel memory agent 301 stores the calculation result back to the off-chip memory 203, then send it in parallel. Bits, and then send the second-lowest bits of the three calculation results at the same time, and in this way until the highest-order bits of the three calculation results are sent at the same time. Typically, these operands are represented in the form of matrices or vectors.

The core controller 302 is based on the computing power and quantity of the processing components in the processing array 303, and controls to split each operand into multiple data segments, that is, multiple vectors, so that the core memory agent 301 sends data segments in units of to processing array 303 .

The processing array 303 is used to perform the multiplication calculation of two operands. For example, the first operand can be divided into 8 data segments such as x ₀ to x ₇ , and the second operand can be divided into y ₀ to y ₃ Waiting for 4 data segments, when the multiplication operation is performed between the first operand and the second operand, the algorithm unfolds as shown in Figure 4. The processing array 303 divides the first operand and the second operand, performs inner product calculation respectively, and then shifts, aligns and sums the

intermediate results

401 , 402 , 403 and 404 to obtain the calculation result of the multiplication operation.

In order to clearly explain the technical solution, the above-mentioned data segments are collectively represented as vectors below, and the multiplication of two data segments is the inner product of two vectors (the first vector and the second vector), wherein the first vector comes from the first operand , the second vector from the second operand.

The processing array 303 includes a plurality of processing units 304 arranged in an array. The figure shows 4×8 processing units 304 as an example, and the number of the processing units 304 is not limited in the present invention. Each processing unit 304 is configured to inner product the first vector and the second vector according to the length of the first vector and the length of the second vector to obtain an inner product result. Finally, the core controller 302 controls the memory proxy 301 to integrate or reduce the inner product result into calculation results of multiple operands and send them to the core memory proxy 301, and the core memory proxy 301 stores the calculation results in the off-chip memory 203 .

Specifically, the computing device 201 adopts a recursive decomposition algorithm (recursive decomposition) in control. When the computing device 201 receives an instruction from the processing device 202 to perform arbitrary precision calculations, the core controller 302 averagely decomposes the operands of the multiplication Divided into multiple vectors, and send them to the processing array 303 for calculation, each processing unit 304 is responsible for the calculation of a group of vectors, such as the inner product of the first vector and the second vector. In this embodiment, each processing unit 304 further splits a group of vectors into smaller inner product calculation units based on its own hardware resources, so as to facilitate inner product calculation. Computing device 201 adopts multi-bit streams on the data path, that is, each operand is imported from core memory agent 301 to processing unit 303 at a speed of 1 bit per cycle, but multiple operands are transmitted in parallel at the same time. After the calculation is completed, the processing The component 304 sends the inner product result to the kernel memory agent 301 in a bit-serial manner.

As the core calculation unit of the calculation device 201, the main task of the processing unit 304 is inner product calculation. The processing unit 304 divides the process into three stages based on the inner product of bit index vectors. The first stage is the pattern generation stage, the second stage is the pattern index stage, and the third stage is the weighted synthesis stage.

Take the first vector

with the second vector

As an example, assuming that the first vector

with the second vector

The sizes are N×p _x and N×p _y , where N is the first vector

with the second vector

The length of, more specifically, the number of row elements, p _x is the first vector

The bit width of , p _y is the second vector

bit width. In this example, the first vector

with the second vector

Inner product of , the first vector

transpose, and then with the second vector

Do the inner product, ie (p _x ×N)·(N× _py ), to generate the inner product result of p _x ×p _y .

This example converts the second vector

Dismantled as:

Among them, K is a fixed binary matrix with a size of N×2 ^N , B _col is a binary matrix with a size of 2 ^N ×p _y , and C is a weighted vector of p _y .

in the first vector

There are 2 ^N patterns in the arrangement of each element in the length direction of , taking N as 2, that is, the first vector

The length is 2, K according to the first vector

The length of is divided into 2 ^N unit vectors to arrange all possible unit vectors with a length of 2, so K is a binary matrix with a size of 2×2 ² to cover all possibilities of combinations of elements with a length of 2, A combination of elements of length 2 has

4 possibilities, so the fixed form of K is:

In other words, once the first vector

with the second vector

The length of K is determined, and the size and element value of K are determined.

B _col is an effective vector (one-hot vector), each column has only one element as 1, and the rest of the elements are 0, and which element is 1 depends on the second vector

This column of corresponds to which column of K. For the convenience of illustration, the first vector is exemplarily set

with the second vector

for:

the second vector

Compared with K, it can be found that the second vector

first column of

is the fourth column of K, the second vector

the second column of

is the third column of K, the second vector

the third column of

is the fourth column of K, the second vector

the fourth column of

is the first column of K, so when the second vector

When represented by K·B _col , B _col is an index matrix with a size of 2 ² ×4 as follows:

Only the fourth element of the first column of B _col is 1, indicating the second vector

The first column of K is the fourth column of K; the second column of B _col only has the third element as 1, indicating the second vector

The second column of K is the third column of K; only the fourth element of the third column of B _col is 1, indicating the second vector

The third column of K is the fourth column of K; only the first element of the fourth column of B _col is 1, indicating the second vector

The fourth column of is the first column of K. To sum up, as long as K is determined, the element value of B _col is also determined.

C is the p _y weighting vector to reflect the second vector

to the power of , that is, the bit width. Since p _y is 4, the second vector

The power of is 4, so C is:

This embodiment disassembles the second vector in the above-mentioned way

such that the second vector

Each element in can be represented by two binary matrices K and B _col . In other words, this embodiment will

The inner product operation is converted into

operation.

The processing unit 304 is used to implement the vector inner product based on the aforementioned conversion

of. During the schema generation phase, the processing component 304 obtains

The various possibilities of generating pattern vectors

During the schema indexing phase, the processing unit 304 calculates

In the weighted synthesis stage, the processing component 304 accumulates index patterns according to the weight C. Such a design enables operands no matter how high the precision is to be converted into an index mode to perform inner products to reduce repeated calculations and avoid high bandwidth requirements for arbitrary precision calculations.

FIG. 3 further shows a schematic structural diagram of the processing unit 304 . To realize the aforementioned three stages, the processing unit 304 includes a processing unit memory proxy unit 305 , a processing unit control unit 306 , a conversion unit 307 , a plurality of inner product units 308 and a synthesis unit 309 .

The processing unit memory proxy unit 305 is used as the interface for the processing unit 304 to access the kernel memory proxy 301 to receive the two vectors that need to be inner producted, such as the aforementioned first vector

with the second vector

The processing unit control unit 306 is used to coordinate and manage the work of each unit in the processing unit 304 .

The conversion unit 307 is used to implement the pattern generation stage. Receive the first vector from the processing element memory proxy unit 305

And implement the binary matrix K in hardware, execute

to generate multiple pattern vectors

FIG. 5 shows a schematic diagram of the conversion unit 307 , and the conversion unit 307 includes: N bit stream input terminals 501 , a generation component 502 and 2 ^N bit stream output terminals 503 .

N bitstream input terminals 501 are used to correspond to the first vector

The length N, receive N data vectors respectively. Figure 5 takes the first vector

The length is 4 to illustrate that the first vector

It includes four data vectors such as x ₀ , x ₁ , x ₂ , and x ₃ , and the bit width of each data vector is p _x , that is, each data vector has p _x digits.

Generate component 502 for execution

core components. The response K has 2 ^N unit vectors, and the generating component 502 includes 2 ^N generating units, each of which simulates a unit vector to generate 2 ^N pattern vectors respectively

As shown in Figure 5, the first vector

Split into four data vectors such as x ₀ , x ₁ , x ₂ , and x ₃ , and input them from the left side of the generation component 502 in parallel. Since the inner product operation in binary is actually an addition operation of each bit, the generation component 502 directly simulates all unit vectors in K on the hardware, and adds them to the bits of x ₀ , x ₁ , x ₂ , and x ₃ in sequence . In more detail, the parity bits of x ₀ , x ₁ , x ₂ , and x ₃ are input at the same time in each cycle, for example, the least significant bits of x ₀ , x ₁ , x ₂ , and x ₃ are input at the same time in the first cycle, and in the second cycle The next low-order bits of x ₀ , x ₁ , x ₂ , and x ₃ are input simultaneously, in this way until the highest-order bits of x ₀ , x ₁ , x ₂ , and x ₃ are simultaneously input in the p _x-th cycle. The required bandwidth is only N bits per cycle, which in this example is only 4 bits per cycle.

in the first vector

When the length of is 4, the generating component 502 includes 16 generating units, respectively simulating 16 unit vectors in K, and these unit vectors are (0000), (0001), (0010), (0011), (0100) , (0101), (0110), (0111), (1000), (1001), (1010), (1011), (1100), (1101), (1110) and (1111).

FIG. 6 shows a schematic diagram of the generation unit 504 with a unit vector of (1011). Taking the generation unit 504 as an example, what it simulates is a unit vector (1011), so the generation unit 504 includes three element registers 601, an adder 602, and a carry register 603. The three element registers 601 receive and temporarily The stored data vector corresponds to the bit value of the simulated unit vector, that is, the bit value of x ₀ , x ₁ , and x ₃ , and the bit value of x ₂ is directly ignored, and this structure is implemented:

The value in the temporary register 601 will be sent to the adder 602 for accumulation. If a carry occurs after the accumulation, the value of the carry will be temporarily stored in the carry register 603, and will be compared with the x ₀ , x ₁ , and x ₃ input in the next cycle. The bit values of x 0 , x 1 , and x 3 are added up until the p _xth cycle adds the most significant bits of x ₀ , x ₁ , and x ₃ . Each generating unit is designed according to the same technical logic. Those skilled in the art can easily deduce the structure of other generating units without creative work based on the structure of generating unit 504 with unit vector (1011) in FIG. 6 . I won't go into details. It should be noted that some generation units do not need to be provided with adder 602 and carry register 603, such as the generation units of analog unit vectors (0000), (0001), (0010), (0100) and (1000), these generation units The unit has only one input in the same cycle, and there is no addition and no carry.

Returning to Fig. 5, 2 ^N bit stream output ports 503 are respectively connected to the output of the adder 602 of each generation unit to output 2 ^N pattern vectors

In Fig. 5, since N is 4, 16 bit stream output terminals 503 output 16 pattern vectors in total

These pattern vectors

The bit width of may be p _x (if the highest bits are added without carry), or p _x +1 (if the highest bits are added and then carried). As can be seen from Figure 5, the pattern vector

is all possible addition combinations of x ₀ , x ₁ , x ₂ , x ₃ , namely:

z ₀ =0

z ₁ =x ₀

z ₂ =x ₁

z ₃ =x ₀ +x ₁

z ₄ =x ₂

z ₅ =x ₀ +x ₂

z ₆ =x ₁ +x ₂

z ₇ =x ₀ +x ₁ +x ₂

z ₈ =x ₃

z ₉ =x ₀ +x ₃

z ₁₀ =x ₁ +x ₃

z ₁₁ =x ₀ +x ₁ +x ₃

z ₁₂ =x ₂ +x ₃

z ₁₃ =x ₀ +x ₂ +x ₃

z ₁₄ =x ₁ +x ₂ +x ₃

z ₁₅ =x ₀ +x ₁ +x ₂ +x ₃

pattern vector

It is sent to the inner product unit 308. There are multiple inner product units 308 in this embodiment, and each inner product unit 308 is equivalent to a processor core for realizing the mode index stage and the weighted synthesis stage. The present invention does not limit the inner product The number of units 308 . The inner product unit 308 receives the second vector from the processing element memory proxy unit 305

take the second vector

The data vectors in the length direction are indices, according to each index from all pattern vectors

Select the corresponding specific pattern vectors, accumulate these specific pattern vectors, generate a one-bit intermediate result in each cycle, and form a unit accumulation sequence in consecutive p _x or p _x +1 cycles. The above operation is performed

FIG. 7 shows a schematic diagram of the inner product unit 308 of this embodiment. In order to achieve

The inner product unit 308 includes p _y multiplexers 701 and p _y −1 serial full adders 702 .

p _y multiplexers 701 are used to implement the pattern indexing stage. Each multiplexer 701 receives all pattern vectors

(z0 to z15), according to the second vector

The co-located data vectors in the length direction of let all pattern vectors

The specific pattern vector in the pass. Since the second vector

The length of is N, so the second vector

It can be disassembled into N data vectors. Since N is 4, the second vector

It can be disassembled into 4 data vectors such as y ₀ , y ₁ , y ₂ , and y ₃ , and the bit width of each data vector is p _y , so these data vectors can be disassembled into p _y from the perspective of parity bits co-located data vectors. For example, the most significant bits of the four data vectors y ₀ , y ₁ , y ₂ , and y ₃ form the highest bit parity data vector 703, and the order of the four data vectors such as y ₀ , y ₁ , y ₂ , and y ₃ The high-order bits form the next-highest parity data vector 704 , and so on, the lowest-order bits of the four data vectors y ₀ , y ₁ , y ₂ , and y ₃ form the lowest-order parity data vector 705 .

The multiplexer 701 judges which unit vector of the binary matrix K is the same as the input data vector of the same position, and outputs a specific pattern vector corresponding to the same unit vector. For example, the highest bit parity data vector 703 is input to the first multiplexer as a selection signal, assuming that the highest bit parity data vector 703 is (0101), which is the same as the unit vector 505 in Fig. 5, then the first multiplexer The converter will output a specific pattern vector z ₅ corresponding to the unit vector 505 . For another example, the second highest bit parity data vector 704 is input to the second multiplexer as a selection signal, assuming that the second highest bit parity data vector 704 is (0010), which is the same as the unit vector 506 in FIG. 5 , then the second multiplexer The user will output a specific pattern vector z ₂ corresponding to the unit vector 506 . Finally, the lowest bit parity data vector 705 is input to the p _y multiplexer as a selection signal, assuming that the lowest bit parity data vector 705 is (1110), which is the same as the unit vector 507 in Figure 5, then the p _y multiplexer The multiplexer will output a specific pattern vector z ₁₄ corresponding to the unit vector 507 . so far completed

operation.

Serial full adder 702 implements the weighted synthesis stage. p _y −1 serial full adders 702 are connected in series as shown in the figure, and the receiving multiplexer 701 outputs specific pattern vectors, and these specific pattern vectors are accumulated sequentially to obtain a unit accumulation sequence. It should be noted that, in order to comply with accumulation and carry (if any) from the lower bit to the next bit so that the next bit can be correctly accumulated and carried, the specific mode vector corresponding to the lowest bit parity data vector 705 must be arranged to be input to the lowest bit. The outer serial full adder 702 enables the specific pattern vectors corresponding to the low-order same-bit data vectors to be accumulated preferentially, and the specific pattern vectors corresponding to the higher-order same-bit data vectors are arranged to be input to the inner serial full adder 702 , the specific pattern vector corresponding to the highest-order data vector 703 must be arranged to be input to the innermost serial full adder 702, so that the specific pattern vector corresponding to the higher-order data vector is accumulated more laggingly, so as to ensure the accumulation Correctness, that is, weighting the vector C according to p _y to reflect the second vector

to the power of . The accumulative sequence of units is in

The weighting of C is further realized on the basis of . So far,

intermediate results

401 , 402 , 403 and 404 as shown in FIG. 4 are obtained.

The combining unit 309 is used for performing the sum calculation 405 as shown in FIG. 4 . Combining unit 309 receives the unit accumulation sequence from each inner product unit 308, each unit accumulation sequence is like

intermediate results

401, 402, 403 and 404 in FIG. 4, these intermediate results have been aligned in inner product unit 308, and then synthesized Unit 309 sums up these aligned unit accumulation sequences to obtain the first vector

with the second vector

inner product result.

FIG. 8 shows a schematic diagram of the synthesis unit 309 of this embodiment. The synthesis unit 309 in the figure exemplarily receives the outputs of 8 inner product units 308 , that is, the unit accumulation sequence 801 to 808 . These unit accumulation sequences 801 to 808 are the first vector

with the second vector

After splitting into 8 data segments, the intermediate results obtained by the inner product calculation are handed over to the eight inner product units 308 respectively. The synthesis unit 309 includes seven full adder groups 809 to 815 . Since the lowest bit operation 816 and the highest bit operation 817 have only one intermediate result, the lowest bit operation 816 and the highest bit operation 817 do not need an adder group, as in Fig. 4 x ₀ y ₀ (lowest bit) and x ₇ y ₃ (highest bit), no need to add with other intermediate results, just output directly. In other words, only the operation from the second-lowest bit to the second-highest bit requires the set of full adders to perform the summing calculation 405 shown in FIG. 4 .

FIG. 9 shows a schematic diagram of the bank of full adders 810 to 815 . The full adder group 810 to 815 includes a first full adder 901 and a second full adder 902, and the first full adder 901 and the second full adder 902 include

multiplexers

903 and 904 respectively, wherein the multiplexer The input terminal of the multiplier 903 is connected to the carry output of the adder and the value 0, and the input terminal of the multiplexer 904 is connected to the carry output of the adder and the value 1, and the values 0 and 1 are respectively used to simulate the middle of the previous digit. The results are summed without carry and carry, so the first full adder 901 is used to generate the sum of the intermediate results of the previous digit without carry, and the second full adder 902 is used to generate the sum of the intermediate results of the previous digit without carry. Such a structure can decide whether to carry without waiting for the intermediate result of the previous digit. In this embodiment, the design of synchronously calculating the non-carry and carry can reduce the operation delay time. The full adder group 810 to 815 also includes a multiplexer 905, the sum of the two intermediate results is input to the multiplexer 905, and the multiplexer 905 will select according to whether the calculation result of the previous digit is carried. Output the carried sum of intermediate results or the uncarried sum of intermediate results. The accumulated output 818 is the first vector

with the second vector

inner product result.

Returning to Fig. 8, since the operation of the lowest bit is impossible to generate a carry, the next-lowest full adder group 809 only includes the first full adder 901, which directly generates the intermediate result without carrying out, without setting the second full adder 902 and multiplexer 905 .

According to Fig. 8, Fig. 9 and related explanations, when the synthesis unit 309 of this embodiment intends to sum up M unit accumulation sequences, M-1 full adder groups will be configured, including M-1 first full adders 901, M-2 second full adders 902 and M-2 multiplexers 905.

In other cases, the synthesis unit 309 can flexibly choose to enable or disable the operation of the full adder group, for example, the first vector

with the second vector

When the generated unit accumulation sequences are less than M, a specific number of full adder groups can be properly closed to flexibly support various possible split numbers and expand the application scenarios of the combining unit 309 .

Returning to Fig. 3, the first vector is obtained in the synthesis unit 309

with the second vector

After the inner product result, it is sent to the processing unit memory proxy unit 305, and the processing unit memory proxy unit 305 receives the inner product result and sends it to the kernel memory proxy 301, and the kernel memory proxy 301 will integrate the inner product of all processing units 304 As a result, the calculation result is generated and sent to the off-chip memory 203 to complete the product operation of the first operand and the second operand.

Based on the above structure, the computing device 201 of this embodiment performs different numbers of inner product operations according to the length of the operands. Further, the processing array 303 can control the index to be shared among the vertical processing units 304, and control the mode vector to be shared among the horizontal processing units 304, so as to perform operations efficiently.

In terms of data path management, this embodiment adopts a two-level architecture, that is, a core memory agent 301 and a processing component memory agent unit 305 . The starting address of the operand in the LLC is recorded in the kernel memory agent 301, and the kernel memory agent 301 simultaneously, continuously, and serially reads multiple operands from the LLC by self-increasing the address. The source address is self-increasing, so the order of data blocks is deterministic. The core controller 302 determines which processing elements 304 receive the data blocks, and the processing element control unit 306 then determines which inner product units 308 receive the data blocks.

Another embodiment of the present invention is an arbitrary precision calculation method, which can be realized by using the hardware structure of the foregoing embodiments. Fig. 10 shows a flowchart of this embodiment.

In step 1001, a plurality of operands are read from off-chip memory. When reading operands from the off-chip memory, the starting address of the operands is set in the kernel memory agent, and the kernel memory agent reads multiple operands simultaneously, continuously, and serially through self-increasing addresses. The fetching method is to read from the lower bits of these operands to the higher bits one by one.

In step 1002, multiple operands are split into multiple vectors, and the multiple vectors include a first vector and a second vector. Based on the computing power and quantity of processing components in the processing array, the core controller controls to split each operand into multiple data segments, that is, multiple vectors, so that the core memory agent sends data segments to the processing array .

In step 1003, according to the lengths of the first vector and the second vector, the first vector and the second vector are inner producted to obtain an inner product result. The processing array includes a plurality of processing units arranged in an array, and each processing unit inner-products the first vector and the second vector according to the length of the first vector and the length of the second vector to obtain an inner product result. In more detail, in this step, the pattern generation stage is executed first, then the pattern index stage is executed, and finally the weighted synthesis stage is executed.

Take the first vector

with the second vector

As an example, assuming that the first vector

with the second vector

The sizes are N×p _x and N×p _y , where N is the first vector

with the second vector

The length of p _x is the first vector

The bit width of , p _y is the second vector

bit width. This embodiment also converts the second vector

Dismantled as:

Among them, K is a fixed binary matrix with a size of N×2 ^N , B _col is a binary matrix with a size of 2 ^N ×p _y , C is a weighted vector of p _y , and the definition of K, B _col and C is the same as the aforementioned implementation The examples are the same, so I won't go into details. This embodiment disassembles the second vector in the above-mentioned way

such that the second vector

The inner product operation is converted into

operation.

During the schema generation phase, this embodiment obtains

The various possibilities of generating pattern vectors

During the schema indexing phase, this embodiment computes

In the weighted synthesis stage, the index patterns are accumulated according to the weight C. Such a design enables operands no matter how high the precision is to be converted into an index mode to perform inner products to reduce repeated calculations and avoid high bandwidth requirements for arbitrary precision calculations. FIG. 11 further shows a flowchart of the inner product of the first vector and the second vector.

In step 1101, a plurality of pattern vectors are generated according to the length and bit width of the first vector. First, corresponding to the first vector

The length N, receive N data vectors respectively. Then the response K has 2 ^N unit vectors, each unit vector is simulated by hardware to generate 2 ^N pattern vectors respectively

Since the inner product operation is actually the addition operation of each bit in binary, the generation component of this embodiment directly simulates all unit vectors in K, and the first vector

The bits of the data vector of are added sequentially. In more detail, each cycle simultaneously inputs the first vector

For example, the lowest bit of the data vector is input at the same time in the first cycle, and the second-lowest bit of the data vector is input at the same time in the second cycle, in this way until the highest bit of the data vector is input at the p _xth cycle at the same time. The required bandwidth is only N bits per cycle.

When simulating a unit vector, first receive and temporarily store the bit values corresponding to the data vector of the unit vector, these bit values will be accumulated, if a carry occurs after accumulation, the value of the carry will be temporarily stored in the carry register, and the next The bit values of the data vectors input in one cycle are added until the p _xth cycle adds the most significant bit values of the data vectors.

Finally, receive the accumulated result, which is the pattern vector

In summary, the pattern vector

is the first vector

All combinations of addition possibilities for a data vector of .

In step 1102, based on the second vector

The data vector in the length direction is used as an index, and specific pattern vectors among the plurality of pattern vectors are accumulated to form a plurality of unit accumulation sequence. This step implements the pattern indexing phase and the weighted synthesis phase. take the second vector

In more detail, according to the second vector

The co-located data vectors in the length direction of let all pattern vectors

The specific pattern vector in the pass. Since the second vector

The length of is N, so the second vector

It can be disassembled into N data vectors, and the bit width of each data vector is p _y , so these data vectors can be disassembled into p _y same data vectors from the perspective of the same bits.

Next, it is judged which unit vector of the binary matrix K is the same as the input data vector of the same position, and a specific pattern vector corresponding to the same unit vector is output. so far completed

operation.

Finally, these specific pattern vectors are sequentially accumulated to obtain a unit accumulation sequence. Special attention should be paid to ensure the correctness of the accumulation, that is, to weight the vector C according to p _? to reflect the second vector

to the power of . The accumulative sequence of units is in

The weighting of C is further realized on the basis of . Each unit accumulation sequence is like the

intermediate results

401 , 402 , 403 and 404 in FIG. 4 , and these intermediate results have been aligned.

In step 1103, a plurality of unit accumulation sequences are summed up to obtain an inner product result. In order to achieve synchronous calculation, this embodiment will first vector

with the second vector

After splitting into multiple data segments, the intermediate results obtained by performing inner product calculations respectively. Since there is only one intermediate result between the lowest bit operation and the highest bit operation, the lowest bit operation and the highest bit operation do not need to be added, just like x ₀ y ₀ (lowest bit) and x ₇ y ₃ in Figure 4 (highest bit), no need to add with other intermediate results, just output directly. In other words, only the operation from the second lowest bit to the second highest bit needs to perform the addition operation.

This embodiment adopts the design of synchronously calculating the non-carry and carry to reduce the operation delay time. The sum of intermediate results without carry and carry is obtained at the same time, and then the sum of intermediate results with carry or the sum of intermediate results without carry is selected according to whether the calculation result of the previous digit is carry. The accumulated output is the first vector

with the second vector

inner product result.

Returning to Fig. 10, in step 1004, the inner product result is integrated into calculation results of multiple operands. The core controller controls the memory agent to integrate or reduce the inner product result into calculation results of multiple operands and send it to the kernel memory agent.

In step 1005, the calculation result is stored in the off-chip memory. The kernel memory agent sends calculation results in parallel, first sending the lowest bits of these calculation results at the same time, and then sending the second-lowest bits of these calculation results at the same time, and in this way until the highest bits of these calculation results are sent at the same time.

Another embodiment of the present invention is a computer-readable storage medium, on which is stored computer program code for calculation with arbitrary precision. When the computer program code is run by a processor, the method shown in FIG. 10 or FIG. 11 is executed. In some implementation scenarios, the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer readable memory. Based on this, when the solution of the present invention is embodied in the form of a software product (such as a computer-readable storage medium), the software product can be stored in a memory, and it can include several instructions to make a computer device (such as a personal computer, server or network device, etc.) execute some or all of the steps of the method described in the embodiment of the present invention. The aforementioned memory may include but not limited to U disk, flash disk, read only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), mobile hard disk, magnetic disk or optical disk, etc., which can store programs. The medium of the code.

The present invention proposes a novel architecture to efficiently handle arbitrary precision calculations. No matter how high the precision of the operand is, the present invention can disassemble the operand, use the index to process the fixed-length bit stream in parallel, avoid bit-level redundancy, such as sparsity or repeated calculation, and do not need to configure a high-bit-width The hardware can achieve the effect of flexible application and large bit width calculation.

According to different application scenarios, the electronic equipment or device of the present invention may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, Internet of Things terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment. Said vehicles include airplanes, ships and/or vehicles; said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods; said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph. The electronic equipment or device of the present invention can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical treatment. Further, the electronic device or device of the present invention can also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as cloud, edge, and terminal. In one or more embodiments, electronic devices or devices with high computing power according to the solution of the present invention can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.

It should be noted that, for the purpose of brevity, the present invention expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present invention is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present invention, those skilled in the art can understand that some of the steps can be performed in other order or at the same time. Further, those skilled in the art can understand that the embodiments described in the present invention can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present invention. In addition, according to different schemes, the description of some embodiments of the present invention also has different emphases. In view of this, those skilled in the art may understand the parts not described in detail in a certain embodiment of the present invention, and may also refer to relevant descriptions of other embodiments.

In terms of specific implementation, based on the disclosure and teaching of the present invention, those skilled in the art can understand that several embodiments disclosed in the present invention can also be implemented in other ways not disclosed herein. For example, with respect to each unit in the above-mentioned electronic device or device embodiment, this paper divides them on the basis of considering logical functions, but there may be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions in units or components may be selectively disabled. As far as the connection relationship between different units or components is concerned, the connections discussed above in conjunction with the drawings may be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection using an interface, where the communication interface may support electrical, optical, acoustic, magnetic or other forms of signal transmission.

In the present invention, a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit. The aforementioned components or units may be located at the same location or distributed over multiple network units. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present invention. In addition, in some scenarios, multiple units in this embodiment of the present invention may be integrated into one unit, or each unit exists physically independently.

In other implementation scenarios, the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits. The physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors. In view of this, various devices (such as computing devices or other processing devices) described herein may be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs. Further, the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (Enhanced Dynamic Random Access Memory, EDRAM), High Bandwidth Memory (High Bandwidth Memory , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.

The embodiments of the present invention have been described in detail above, and specific examples have been used in this paper to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only used to help understand the method and core idea of the present invention; at the same time, for Those skilled in the art will have changes in the specific implementation and scope of application according to the idea of the present invention. In summary, the contents of this specification should not be construed as limiting the present invention.

Claims

A processing unit for inner producting a first vector and a second vector, comprising:

a conversion unit, configured to generate multiple pattern vectors according to the length and bit width of the first vector;

a plurality of inner product units, each inner product unit is based on the data vector of the second vector in the length direction as an index, and accumulates a specific pattern vector in the plurality of pattern vectors to form a unit accumulation sequence; and

Synthetic unit, which is used to add up multiple unit accumulation sequences to obtain the inner product result.
The processing unit according to claim 1, wherein when the length is N, the conversion unit generates 2 N pattern vectors, and N is a positive integer.
The processing unit according to claim 2, wherein the first vector is divided into N data vectors according to the length, and the conversion unit comprises:

N bit stream input ends, for respectively receiving the N data vectors; and

The generating component includes 2 N generating units, each generating unit simulates one of the 2 N unit vectors corresponding to the length, and the 2 N generating units respectively generate the 2 N pattern vectors.
The processing unit according to claim 3, wherein said generating unit comprises:

An element register, used to receive and temporarily store the bit value corresponding to the simulated unit vector from the data vector;

an adder for accumulating the bit values; and

The carry register is used to temporarily store the carry value after accumulation.
The processing unit according to claim 4, wherein said converting unit further comprises:

The 2 N bit stream output ports are respectively connected to the output of the adder to output the 2 N pattern vectors.
The processing unit of claim 5, wherein said 2 N pattern vectors are all addition possibility combinations of said data vectors.
The processing unit according to claim 2 or 5, wherein the bit width of the 2 N pattern vector is one of the bit width of the first vector and the bit width of the first vector plus one.
The processing unit according to claim 2, wherein the conversion unit has a bandwidth of N bits per cycle.
The processing unit according to claim 1, wherein each inner product unit comprises:

a plurality of multiplexers, respectively receiving the plurality of pattern vectors, and allowing specific pattern vectors in the plurality of pattern vectors to pass through according to the co-located data vectors of the second vector in the length direction; and

A plurality of serial full adders are used for weighting and synthesizing the specific pattern vector to obtain the unit accumulation sequence.
The processing unit according to claim 9, wherein the specific pattern vector is a pattern vector corresponding to the same unit vector as the colocated data vector.
The processing unit according to claim 9, wherein the number of the multiplexers is the same as the bit width of the second vector, and the number of the serial full adders is reduced by the bit width of the second vector one.
The processing unit according to claim 9, wherein the specific pattern vector corresponding to the lowest bit parity data vector is input to the outermost serial full adder, and the specific pattern vector corresponding to the highest bit parity data vector is input to the innermost serial row full adder.
The processing unit according to claim 1 , wherein the combination unit comprises a plurality of full adder groups, for performing the addition operation from the second lowest bit to the second highest bit after the alignment of the plurality of unit accumulation sequences.
The processing unit according to claim 13, wherein said set of full adders comprises a first full adder for generating uncarried intermediate results.
The processing unit according to claim 14, wherein said full adder bank further comprises:

a second full adder to generate carry-in intermediate results; and

The multiplexer is used to select and output one of the carried intermediate result and the non-carried intermediate result according to the preceding intermediate result.
The processing unit according to claim 15, wherein when the unit accumulation sequence is M, the number of the full adder group is M-1, and the number of the first full adder is M-1 , the number of the second full adders is M-2, and the number of the multiplexers is M-2.
An arbitrary-precision computing accelerator connected to an off-chip memory, the arbitrary-precision computing accelerator comprising:

a kernel memory agent, for reading a plurality of operands from said off-chip memory;

a core controller, configured to split the plurality of operands into a plurality of vectors, the plurality of vectors comprising a first vector and a second vector; and

A processing array, including a plurality of processing units, the processing units are used to inner product the first vector and the second vector according to the lengths of the first vector and the second vector, to obtain an inner product result;

Wherein, the core controller integrates the inner product result into calculation results of the plurality of operands, and the core memory agent stores the calculation results in the off-chip memory.
The arbitrary-precision computing accelerator according to claim 17, wherein the start addresses of the plurality of operands are set in the kernel memory agent, and the kernel memory agent serially reads all operands by self-incrementing addresses multiple operands.
The arbitrary-precision computing accelerator according to claim 18, wherein the manner of reading the plurality of operands by the kernel memory agent is to read from lower bits to higher bits of the plurality of operands one by one.
The arbitrary precision computing accelerator according to claim 17, wherein the kernel memory agent sends the computing results to the off-chip memory in parallel.
The arbitrary precision computing accelerator of claim 17, wherein each processing element comprises:

a conversion unit, configured to generate multiple pattern vectors according to the length and bit width of the first vector;

a plurality of inner product units, each inner product unit is based on the data vector of the second vector in the length direction as an index, and accumulates a specific pattern vector in the plurality of pattern vectors to form a unit accumulation sequence; and

A synthesis unit is used for summing up a plurality of unit accumulation sequences to obtain the inner product result.
An integrated circuit device comprising:

The arbitrary precision computing accelerator according to any one of claims 17 to 21;

processing means for controlling said arbitrary precision computing accelerator; and

Off-chip memory, including LLC;

Wherein, the arbitrary precision computing accelerator communicates with the processing device through the LLC.
A board comprising the integrated circuit device according to claim 22.
A method for inner producting a first vector and a second vector, comprising:

generating multiple pattern vectors according to the length and bit width of the first vector;

Based on the data vector of the second vector in the length direction as an index, accumulating a specific pattern vector in the plurality of pattern vectors to form a plurality of unit accumulation sequences; and

summing up the plurality of unit accumulation sequences to obtain an inner product result.
An arbitrary precision calculation method comprising:

Read multiple operands from off-chip memory;

splitting the plurality of operands into a plurality of vectors, the plurality of vectors comprising a first vector and a second vector;

Inner producting the first vector and the second vector according to the lengths of the first vector and the second vector to obtain an inner product result;

integrating the inner product result into a calculation result of the plurality of operands; and

The calculation result is stored in the off-chip memory.
A computer-readable storage medium on which is stored computer program code for arbitrary precision calculations, which, when executed by a processing device, performs the method of claim 24 or 25.