WO2023065701A1 - Inner product processing component, arbitrary-precision computing device and method, and readable storage medium - Google Patents

Inner product processing component, arbitrary-precision computing device and method, and readable storage medium Download PDF

Info

Publication number
WO2023065701A1
WO2023065701A1 PCT/CN2022/100304 CN2022100304W WO2023065701A1 WO 2023065701 A1 WO2023065701 A1 WO 2023065701A1 CN 2022100304 W CN2022100304 W CN 2022100304W WO 2023065701 A1 WO2023065701 A1 WO 2023065701A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
unit
vectors
inner product
data
Prior art date
Application number
PCT/CN2022/100304
Other languages
French (fr)
Chinese (zh)
Inventor
赵永威
郝一帆
刘晨骁
承书尧
喻歆
陈天石
Original Assignee
寒武纪(西安)集成电路有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 寒武纪(西安)集成电路有限公司 filed Critical 寒武纪(西安)集成电路有限公司
Publication of WO2023065701A1 publication Critical patent/WO2023065701A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7821Tightly coupled to memory, e.g. computational memory, smart memory, processor in memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the present invention relates generally to the field of computers. More specifically, the present invention relates to an inner product processing component, an arbitrary-precision computing device, a method, and a readable storage medium.
  • Arbitrary precision calculation is the use of arbitrary digits to represent operands, which is crucial in many technical fields, such as supernova simulation, climate simulation, atomic simulation, artificial intelligence, planetary orbit calculation, etc. These fields need to process hundreds, even thousands or millions of bits of data, such a large range of data bits processing is far beyond the hardware capabilities of traditional processors.
  • the solutions of the present invention provide an inner product processing component, an arbitrary precision computing device, a method, and a readable storage medium.
  • the present invention discloses a processing unit for inner product of a first vector and a second vector, including: a conversion unit, a plurality of inner product units and a combination unit.
  • the conversion unit is used for generating multiple pattern vectors according to the length and bit width of the first vector.
  • Each inner product unit accumulates a specific pattern vector among the plurality of pattern vectors based on the data vector in the length direction of the second vector as an index to form a unit accumulation sequence.
  • the synthesis unit is used to add up multiple unit accumulation sequences to obtain an inner product result.
  • the present invention discloses an arbitrary-precision computing accelerator connected to an off-chip memory.
  • the arbitrary-precision computing accelerator includes: a kernel memory agent, a kernel controller, and a processing array.
  • the kernel memory agent is used to read multiple operands from off-chip memory.
  • the core controller is used for splitting multiple operands into multiple vectors, and the multiple vectors include a first vector and a second vector.
  • the processing array includes a plurality of processing units for inner producting the first vector and the second vector according to the lengths of the first vector and the second vector to obtain an inner product result.
  • the core controller integrates the inner product result into the calculation result of multiple operands, and the core memory agent stores the calculation result in the off-chip memory.
  • the present invention discloses an integrated circuit device including the aforementioned arbitrary precision computing accelerator, a processing device and an off-chip memory.
  • the processing device is used to control the arbitrary precision computing accelerator, and the off-chip memory includes LLC.
  • the arbitrary-precision computing accelerator communicates with the processing device through the LLC.
  • the present invention discloses a board including the aforementioned integrated circuit device.
  • the present invention discloses a method for inner producting a first vector and a second vector, comprising: generating a plurality of pattern vectors according to the length and bit width of the first vector; based on the data vector in the length direction of the second vector accumulating specific pattern vectors among the plurality of pattern vectors as an index to form a plurality of unit accumulation sequences; and summing up the plurality of unit accumulation sequences to obtain an inner product result.
  • the present invention discloses an arbitrary precision calculation method, including: reading multiple operands from off-chip memory; splitting the multiple operands into multiple vectors, the multiple vectors including the first vector and the second vector; according to the lengths of the first vector and the second vector, inner product the first vector and the second vector to obtain the inner product result; integrate the inner product result into a calculation result of multiple operands; and store the calculation result in a slice external memory.
  • the present invention discloses a computer-readable storage medium, on which computer program codes for arbitrary precision calculations are stored, and when the computer program codes are executed by a processing device, the aforesaid methods are executed.
  • the present invention proposes a scheme for processing arbitrary-precision calculations, and processes different bit streams in parallel, deploying a complete bit-serial data path to perform high-precision calculations flexibly and flexibly.
  • the invention makes full use of simple hardware configuration, reduces repeated calculations, and further realizes arbitrary accurate calculations with low energy consumption.
  • Fig. 1 is the structural diagram showing the plate card of the embodiment of the present invention.
  • 2A to 2C are structural diagrams illustrating an integrated circuit device according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram showing the internal structure of a computing device according to an embodiment of the present invention.
  • Figure 4 is a schematic diagram illustrating an exemplary multiplication operation
  • FIG. 5 is a schematic diagram illustrating a conversion unit according to an embodiment of the present invention.
  • Fig. 6 is a schematic diagram showing a generation unit of an embodiment of the present invention.
  • Fig. 7 is a schematic diagram showing an inner product unit according to an embodiment of the present invention.
  • Figure 8 is a schematic diagram illustrating a synthesis unit of an embodiment of the present invention.
  • Fig. 9 is a schematic diagram showing a full adder group of an embodiment of the present invention.
  • FIG. 10 is a flowchart illustrating arbitrary precision calculations of another embodiment of the present invention.
  • FIG. 11 is a flow chart illustrating the inner product of the first vector and the second vector according to another embodiment of the present invention.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
  • the present invention proposes a high-efficiency arbitrary-precision computing accelerator architecture, which mainly refers to the calculation form of the inner product operation, and highlights the intra-parallelism and inter-parallelism of the accelerator architecture to achieve Multiplication of operands.
  • FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present invention.
  • the board card 10 includes a chip 101, which is a system-on-chip (System on Chip, SoC), or system-on-chip, integrated with one or more combined processing devices, and the combined processing device is an artificial
  • SoC System on Chip
  • the intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligence applications is the large amount of input data, which has high requirements for the storage capacity and computing power of the platform.
  • the board 10 of this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing capabilities.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus.
  • the control device 106 in the board 10 is configured to regulate the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 2 is a block diagram showing the combined processing means in the chip 101 of this embodiment.
  • the combined processing device includes a computing device 201 , a processing device 202 , an off-chip memory 203 , a communication node 204 and an interface device 205 .
  • there are several integration solutions that can be used to coordinate the work of the computing device 201, the processing device 202, and the off-chip memory 203 wherein FIG. 2A shows an LLC integration solution, FIG. 2B shows an SoC integration solution, and FIG. 2C shows Provide an IO integration solution.
  • the computing device 201 is configured to execute user-specified operations, and is mainly implemented as a multi-core intelligent processor for performing deep learning or machine learning calculations, which can interact with the processing device 202 to jointly complete user-specified operations.
  • the computing device 201 includes the aforementioned arbitrary-precision computing accelerator for processing linear computations, more specifically operand multiplication operations such as convolution.
  • the processing device 202 performs basic controls including but not limited to data transfer, starting and/or stopping of the computing device 201 , nonlinear calculation, and the like.
  • the processing device 202 may be one or more types of a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or other general-purpose and/or special-purpose processors.
  • Processors including but not limited to digital signal processors (digital signal processors, DSPs), application specific integrated circuits (application specific integrated circuits, ASICs), field-programmable gate arrays (field-programmable gate arrays, FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field-programmable gate arrays
  • Programmable logic devices discrete gate or transistor logic devices, discrete hardware components, etc.
  • the off-chip memory 203 is used to store the data to be processed and processed, and its hierarchy can be divided into: first-level cache (L1), second-level cache (L2), and third-level cache (L3, also known as for LLC) and physical memory.
  • the physical memory is DDR, usually 16G or larger.
  • L1 is the fastest, it usually accesses L1 first. If the data is not stored in L1, then access L2. If the data is not stored in the L2, continue to access L3, if the data is still not stored in L3, finally access DDR.
  • the cache hierarchy of the off-chip memory 203 speeds up data access by storing the most frequently accessed data in the cache. DDR is pretty slow compared to cache. As the cache level increases (L1 ⁇ L2 ⁇ LLC ⁇ DDR), the access latency is higher and higher, but the storage space is larger and larger.
  • the communication node 204 is a routing node or router in a network-on-chip (NoC).
  • NoC network-on-chip
  • the communication node 204 reads the address information in the header flake of the data packet, and uses a specific routing algorithm to calculate the best routing path, thereby establishing a reliable transmission path to send the data packet to the destination node (such as the off-chip memory 203).
  • the communication node 204 will also calculate the optimal routing path, and send the data packet from the off-chip memory 203 to the computing device 201 or the processing device 202.
  • the interface device 205 is the external input and output interface of the combination processing device.
  • the interface device 205 will transmit information according to the data. According to the requirements of the sender and receiver, set the data buffer to solve the incoordination problem caused by the speed difference between the two, set the signal level conversion, set the information conversion logic to meet the requirements of their respective formats, and set the timing control circuit to Synchronize the work of the sender and receiver and provide address transcoding and other tasks.
  • the LLC integration in FIG. 2A refers to the communication between the computing device 201 and the processing device 202 through LLC.
  • the SoC integration in FIG. 2B is to integrate the computing device 201 , the processing device 202 and the off-chip memory 203 through the communication node 204 .
  • the IO integration in FIG. 2C is to integrate the computing device 201 , the processing device 202 and the off-chip memory 203 through the interface device 205 .
  • This embodiment preferably chooses the LLC integration scheme. Since the core of deep learning and machine learning is the convolution operator, the basis of the convolution operator is the inner product operation, and the inner product operation is a combination of multiplication and addition. Therefore, the main task of the computing device 201 is a large number of multiplications.
  • the computing device 201 and the processing device 202 need intensive interaction.
  • the computing device 201 and the processing device 202 are integrated into the LLC, and the data is shared through the LLC to achieve a higher Low interaction cost. Furthermore, since the high-precision data may have millions of bits, the capacity of L1 and L2 is limited, and the interaction between L1 and L2 will lead to a problem of insufficient capacity.
  • the computing device 201 utilizes the relatively large capacity of the LLC to cache high-precision data to save time for repeated access.
  • FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 , which includes a core memory agent 301 , a core controller 302 and a processing array 303 .
  • the kernel memory agent 301 serves as a management terminal for the computing device 201 to access the off-chip memory 203 .
  • the kernel memory agent 301 reads the operand from the off-chip memory 203
  • the starting address of the operand is set in the kernel memory agent 301
  • the kernel memory agent 301 reads simultaneously, continuously and serially by self-increasing addresses
  • the reading method is to read from the lower bits of these operands to the higher bits one by one.
  • the lowest 512 bits of an operand then serially read the lower 512 bits of the second operand, and then serially read the lower 512 bits of the third operand, after the lowest reading is completed, through the self-increment address (increase 512 bits), and then serially read the lower 512 bits each time, and so on until the highest bits of the three operands are read.
  • the kernel memory agent 301 stores the calculation result back to the off-chip memory 203, then send it in parallel. Bits, and then send the second-lowest bits of the three calculation results at the same time, and in this way until the highest-order bits of the three calculation results are sent at the same time.
  • these operands are represented in the form of matrices or vectors.
  • the core controller 302 is based on the computing power and quantity of the processing components in the processing array 303, and controls to split each operand into multiple data segments, that is, multiple vectors, so that the core memory agent 301 sends data segments in units of to processing array 303 .
  • the processing array 303 is used to perform the multiplication calculation of two operands.
  • the first operand can be divided into 8 data segments such as x 0 to x 7
  • the second operand can be divided into y 0 to y 3 Waiting for 4 data segments, when the multiplication operation is performed between the first operand and the second operand, the algorithm unfolds as shown in Figure 4.
  • the processing array 303 divides the first operand and the second operand, performs inner product calculation respectively, and then shifts, aligns and sums the intermediate results 401 , 402 , 403 and 404 to obtain the calculation result of the multiplication operation.
  • the above-mentioned data segments are collectively represented as vectors below, and the multiplication of two data segments is the inner product of two vectors (the first vector and the second vector), wherein the first vector comes from the first operand , the second vector from the second operand.
  • the processing array 303 includes a plurality of processing units 304 arranged in an array.
  • the figure shows 4 ⁇ 8 processing units 304 as an example, and the number of the processing units 304 is not limited in the present invention.
  • Each processing unit 304 is configured to inner product the first vector and the second vector according to the length of the first vector and the length of the second vector to obtain an inner product result.
  • the core controller 302 controls the memory proxy 301 to integrate or reduce the inner product result into calculation results of multiple operands and send them to the core memory proxy 301, and the core memory proxy 301 stores the calculation results in the off-chip memory 203 .
  • the computing device 201 adopts a recursive decomposition algorithm (recursive decomposition) in control.
  • the core controller 302 averagely decomposes the operands of the multiplication Divided into multiple vectors, and send them to the processing array 303 for calculation, each processing unit 304 is responsible for the calculation of a group of vectors, such as the inner product of the first vector and the second vector.
  • each processing unit 304 further splits a group of vectors into smaller inner product calculation units based on its own hardware resources, so as to facilitate inner product calculation.
  • Computing device 201 adopts multi-bit streams on the data path, that is, each operand is imported from core memory agent 301 to processing unit 303 at a speed of 1 bit per cycle, but multiple operands are transmitted in parallel at the same time. After the calculation is completed, the processing The component 304 sends the inner product result to the kernel memory agent 301 in a bit-serial manner.
  • the main task of the processing unit 304 is inner product calculation.
  • the processing unit 304 divides the process into three stages based on the inner product of bit index vectors. The first stage is the pattern generation stage, the second stage is the pattern index stage, and the third stage is the weighted synthesis stage.
  • the first vector with the second vector Take the first vector with the second vector as an example, assuming that the first vector with the second vector The sizes are N ⁇ p x and N ⁇ p y , where N is the first vector with the second vector The length of, more specifically, the number of row elements, p x is the first vector The bit width of , p y is the second vector bit width.
  • the first vector with the second vector Inner product of , the first vector transpose, and then with the second vector Do the inner product, ie (p x ⁇ N) ⁇ (N ⁇ py ), to generate the inner product result of p x ⁇ p y .
  • K is a fixed binary matrix with a size of N ⁇ 2 N
  • B col is a binary matrix with a size of 2 N ⁇ p y
  • C is a weighted vector of p y .
  • K is 2 N unit vectors to arrange all possible unit vectors with a length of 2
  • K is a binary matrix with a size of 2 ⁇ 2 2 to cover all possibilities of combinations of elements with a length of 2
  • a combination of elements of length 2 has 4 possibilities, so the fixed form of K is:
  • B col is an effective vector (one-hot vector), each column has only one element as 1, and the rest of the elements are 0, and which element is 1 depends on the second vector
  • This column of corresponds to which column of K.
  • the first vector is exemplarily set with the second vector for:
  • the second vector first column of is the fourth column of K
  • the second vector the second column of is the third column of K
  • the second vector the third column of is the fourth column of K
  • the second vector the fourth column of is the first column of K
  • the fourth element of the first column of B col is 1, indicating the second vector
  • the first column of K is the fourth column of K; the second column of B col only has the third element as 1, indicating the second vector
  • the second column of K is the third column of K; only the fourth element of the third column of B col is 1, indicating the second vector
  • the third column of K is the fourth column of K; only the first element of the fourth column of B col is 1, indicating the second vector
  • the fourth column of is the first column of K. To sum up, as long as K is determined, the element value of B col is also determined.
  • C is the p y weighting vector to reflect the second vector to the power of , that is, the bit width. Since p y is 4, the second vector The power of is 4, so C is:
  • This embodiment disassembles the second vector in the above-mentioned way such that the second vector Each element in can be represented by two binary matrices K and B col .
  • this embodiment will The inner product operation is converted into operation.
  • the processing unit 304 is used to implement the vector inner product based on the aforementioned conversion of.
  • the processing component 304 obtains The various possibilities of generating pattern vectors
  • the processing unit 304 calculates In the weighted synthesis stage, the processing component 304 accumulates index patterns according to the weight C.
  • Such a design enables operands no matter how high the precision is to be converted into an index mode to perform inner products to reduce repeated calculations and avoid high bandwidth requirements for arbitrary precision calculations.
  • FIG. 3 further shows a schematic structural diagram of the processing unit 304 .
  • the processing unit 304 includes a processing unit memory proxy unit 305 , a processing unit control unit 306 , a conversion unit 307 , a plurality of inner product units 308 and a synthesis unit 309 .
  • the processing unit memory proxy unit 305 is used as the interface for the processing unit 304 to access the kernel memory proxy 301 to receive the two vectors that need to be inner producted, such as the aforementioned first vector with the second vector
  • the processing unit control unit 306 is used to coordinate and manage the work of each unit in the processing unit 304 .
  • the conversion unit 307 is used to implement the pattern generation stage. Receive the first vector from the processing element memory proxy unit 305 And implement the binary matrix K in hardware, execute to generate multiple pattern vectors
  • FIG. 5 shows a schematic diagram of the conversion unit 307 , and the conversion unit 307 includes: N bit stream input terminals 501 , a generation component 502 and 2 N bit stream output terminals 503 .
  • N bitstream input terminals 501 are used to correspond to the first vector The length N, receive N data vectors respectively.
  • Figure 5 takes the first vector The length is 4 to illustrate that the first vector It includes four data vectors such as x 0 , x 1 , x 2 , and x 3 , and the bit width of each data vector is p x , that is, each data vector has p x digits.
  • the response K has 2 N unit vectors
  • the generating component 502 includes 2 N generating units, each of which simulates a unit vector to generate 2 N pattern vectors respectively
  • the first vector Split into four data vectors such as x 0 , x 1 , x 2 , and x 3 , and input them from the left side of the generation component 502 in parallel. Since the inner product operation in binary is actually an addition operation of each bit, the generation component 502 directly simulates all unit vectors in K on the hardware, and adds them to the bits of x 0 , x 1 , x 2 , and x 3 in sequence .
  • the parity bits of x 0 , x 1 , x 2 , and x 3 are input at the same time in each cycle, for example, the least significant bits of x 0 , x 1 , x 2 , and x 3 are input at the same time in the first cycle, and in the second cycle
  • the next low-order bits of x 0 , x 1 , x 2 , and x 3 are input simultaneously, in this way until the highest-order bits of x 0 , x 1 , x 2 , and x 3 are simultaneously input in the p x-th cycle.
  • the required bandwidth is only N bits per cycle, which in this example is only 4 bits per cycle.
  • the generating component 502 includes 16 generating units, respectively simulating 16 unit vectors in K, and these unit vectors are (0000), (0001), (0010), (0011), (0100) , (0101), (0110), (0111), (1000), (1001), (1010), (1011), (1100), (1101), (1110) and (1111).
  • FIG. 6 shows a schematic diagram of the generation unit 504 with a unit vector of (1011).
  • the generation unit 504 includes three element registers 601, an adder 602, and a carry register 603.
  • the three element registers 601 receive and temporarily
  • the stored data vector corresponds to the bit value of the simulated unit vector, that is, the bit value of x 0 , x 1 , and x 3 , and the bit value of x 2 is directly ignored, and this structure is implemented:
  • the value in the temporary register 601 will be sent to the adder 602 for accumulation. If a carry occurs after the accumulation, the value of the carry will be temporarily stored in the carry register 603, and will be compared with the x 0 , x 1 , and x 3 input in the next cycle. The bit values of x 0 , x 1 , and x 3 are added up until the p xth cycle adds the most significant bits of x 0 , x 1 , and x 3 .
  • Each generating unit is designed according to the same technical logic. Those skilled in the art can easily deduce the structure of other generating units without creative work based on the structure of generating unit 504 with unit vector (1011) in FIG. 6 . I won't go into details.
  • generation units do not need to be provided with adder 602 and carry register 603, such as the generation units of analog unit vectors (0000), (0001), (0010), (0100) and (1000), these generation units
  • the unit has only one input in the same cycle, and there is no addition and no carry.
  • 2 N bit stream output ports 503 are respectively connected to the output of the adder 602 of each generation unit to output 2 N pattern vectors
  • N since N is 4, 16 bit stream output terminals 503 output 16 pattern vectors in total
  • the bit width of may be p x (if the highest bits are added without carry), or p x +1 (if the highest bits are added and then carried).
  • the pattern vector is all possible addition combinations of x 0 , x 1 , x 2 , x 3 , namely:
  • each inner product unit 308 is equivalent to a processor core for realizing the mode index stage and the weighted synthesis stage.
  • the present invention does not limit the inner product The number of units 308 .
  • the inner product unit 308 receives the second vector from the processing element memory proxy unit 305 take the second vector
  • the data vectors in the length direction are indices, according to each index from all pattern vectors Select the corresponding specific pattern vectors, accumulate these specific pattern vectors, generate a one-bit intermediate result in each cycle, and form a unit accumulation sequence in consecutive p x or p x +1 cycles. The above operation is performed
  • FIG. 7 shows a schematic diagram of the inner product unit 308 of this embodiment.
  • the inner product unit 308 includes p y multiplexers 701 and p y ⁇ 1 serial full adders 702 .
  • Each multiplexer 701 receives all pattern vectors (z0 to z15), according to the second vector The co-located data vectors in the length direction of let all pattern vectors The specific pattern vector in the pass. Since the second vector The length of is N, so the second vector It can be disassembled into N data vectors. Since N is 4, the second vector It can be disassembled into 4 data vectors such as y 0 , y 1 , y 2 , and y 3 , and the bit width of each data vector is p y , so these data vectors can be disassembled into p y from the perspective of parity bits co-located data vectors.
  • the most significant bits of the four data vectors y 0 , y 1 , y 2 , and y 3 form the highest bit parity data vector 703, and the order of the four data vectors such as y 0 , y 1 , y 2 , and y 3
  • the high-order bits form the next-highest parity data vector 704
  • the lowest-order bits of the four data vectors y 0 , y 1 , y 2 , and y 3 form the lowest-order parity data vector 705 .
  • the multiplexer 701 judges which unit vector of the binary matrix K is the same as the input data vector of the same position, and outputs a specific pattern vector corresponding to the same unit vector.
  • the highest bit parity data vector 703 is input to the first multiplexer as a selection signal, assuming that the highest bit parity data vector 703 is (0101), which is the same as the unit vector 505 in Fig. 5, then the first multiplexer The converter will output a specific pattern vector z 5 corresponding to the unit vector 505 .
  • the second highest bit parity data vector 704 is input to the second multiplexer as a selection signal, assuming that the second highest bit parity data vector 704 is (0010), which is the same as the unit vector 506 in FIG.
  • the second multiplexer The user will output a specific pattern vector z 2 corresponding to the unit vector 506 .
  • the lowest bit parity data vector 705 is input to the p y multiplexer as a selection signal, assuming that the lowest bit parity data vector 705 is (1110), which is the same as the unit vector 507 in Figure 5, then the p y multiplexer The multiplexer will output a specific pattern vector z 14 corresponding to the unit vector 507 . so far completed operation.
  • Serial full adder 702 implements the weighted synthesis stage. p y ⁇ 1 serial full adders 702 are connected in series as shown in the figure, and the receiving multiplexer 701 outputs specific pattern vectors, and these specific pattern vectors are accumulated sequentially to obtain a unit accumulation sequence. It should be noted that, in order to comply with accumulation and carry (if any) from the lower bit to the next bit so that the next bit can be correctly accumulated and carried, the specific mode vector corresponding to the lowest bit parity data vector 705 must be arranged to be input to the lowest bit.
  • the outer serial full adder 702 enables the specific pattern vectors corresponding to the low-order same-bit data vectors to be accumulated preferentially, and the specific pattern vectors corresponding to the higher-order same-bit data vectors are arranged to be input to the inner serial full adder 702 , the specific pattern vector corresponding to the highest-order data vector 703 must be arranged to be input to the innermost serial full adder 702, so that the specific pattern vector corresponding to the higher-order data vector is accumulated more laggingly, so as to ensure the accumulation Correctness, that is, weighting the vector C according to p y to reflect the second vector to the power of .
  • the accumulative sequence of units is in The weighting of C is further realized on the basis of . So far, intermediate results 401 , 402 , 403 and 404 as shown in FIG. 4 are obtained.
  • the combining unit 309 is used for performing the sum calculation 405 as shown in FIG. 4 .
  • Combining unit 309 receives the unit accumulation sequence from each inner product unit 308, each unit accumulation sequence is like intermediate results 401, 402, 403 and 404 in FIG. 4, these intermediate results have been aligned in inner product unit 308, and then synthesized Unit 309 sums up these aligned unit accumulation sequences to obtain the first vector with the second vector inner product result.
  • FIG. 8 shows a schematic diagram of the synthesis unit 309 of this embodiment.
  • the synthesis unit 309 in the figure exemplarily receives the outputs of 8 inner product units 308 , that is, the unit accumulation sequence 801 to 808 . These unit accumulation sequences 801 to 808 are the first vector with the second vector After splitting into 8 data segments, the intermediate results obtained by the inner product calculation are handed over to the eight inner product units 308 respectively.
  • the synthesis unit 309 includes seven full adder groups 809 to 815 . Since the lowest bit operation 816 and the highest bit operation 817 have only one intermediate result, the lowest bit operation 816 and the highest bit operation 817 do not need an adder group, as in Fig.
  • FIG. 9 shows a schematic diagram of the bank of full adders 810 to 815 .
  • the full adder group 810 to 815 includes a first full adder 901 and a second full adder 902, and the first full adder 901 and the second full adder 902 include multiplexers 903 and 904 respectively, wherein the multiplexer
  • the input terminal of the multiplier 903 is connected to the carry output of the adder and the value 0, and the input terminal of the multiplexer 904 is connected to the carry output of the adder and the value 1, and the values 0 and 1 are respectively used to simulate the middle of the previous digit.
  • the results are summed without carry and carry, so the first full adder 901 is used to generate the sum of the intermediate results of the previous digit without carry, and the second full adder 902 is used to generate the sum of the intermediate results of the previous digit without carry.
  • Such a structure can decide whether to carry without waiting for the intermediate result of the previous digit.
  • the design of synchronously calculating the non-carry and carry can reduce the operation delay time.
  • the full adder group 810 to 815 also includes a multiplexer 905, the sum of the two intermediate results is input to the multiplexer 905, and the multiplexer 905 will select according to whether the calculation result of the previous digit is carried. Output the carried sum of intermediate results or the uncarried sum of intermediate results.
  • the accumulated output 818 is the first vector with the second vector inner product result.
  • next-lowest full adder group 809 only includes the first full adder 901, which directly generates the intermediate result without carrying out, without setting the second full adder 902 and multiplexer 905 .
  • M-1 full adder groups will be configured, including M-1 first full adders 901, M-2 second full adders 902 and M-2 multiplexers 905.
  • the synthesis unit 309 can flexibly choose to enable or disable the operation of the full adder group, for example, the first vector with the second vector
  • a specific number of full adder groups can be properly closed to flexibly support various possible split numbers and expand the application scenarios of the combining unit 309 .
  • the first vector is obtained in the synthesis unit 309 with the second vector
  • the processing unit memory proxy unit 305 receives the inner product result and sends it to the kernel memory proxy 301
  • the kernel memory proxy 301 will integrate the inner product of all processing units 304
  • the calculation result is generated and sent to the off-chip memory 203 to complete the product operation of the first operand and the second operand.
  • the computing device 201 of this embodiment performs different numbers of inner product operations according to the length of the operands. Further, the processing array 303 can control the index to be shared among the vertical processing units 304, and control the mode vector to be shared among the horizontal processing units 304, so as to perform operations efficiently.
  • this embodiment adopts a two-level architecture, that is, a core memory agent 301 and a processing component memory agent unit 305 .
  • the starting address of the operand in the LLC is recorded in the kernel memory agent 301, and the kernel memory agent 301 simultaneously, continuously, and serially reads multiple operands from the LLC by self-increasing the address.
  • the source address is self-increasing, so the order of data blocks is deterministic.
  • the core controller 302 determines which processing elements 304 receive the data blocks, and the processing element control unit 306 then determines which inner product units 308 receive the data blocks.
  • FIG. 10 shows a flowchart of this embodiment.
  • step 1001 a plurality of operands are read from off-chip memory.
  • the starting address of the operands is set in the kernel memory agent, and the kernel memory agent reads multiple operands simultaneously, continuously, and serially through self-increasing addresses.
  • the fetching method is to read from the lower bits of these operands to the higher bits one by one.
  • step 1002 multiple operands are split into multiple vectors, and the multiple vectors include a first vector and a second vector.
  • the core controller controls to split each operand into multiple data segments, that is, multiple vectors, so that the core memory agent sends data segments to the processing array .
  • the first vector and the second vector are inner producted to obtain an inner product result.
  • the processing array includes a plurality of processing units arranged in an array, and each processing unit inner-products the first vector and the second vector according to the length of the first vector and the length of the second vector to obtain an inner product result.
  • the pattern generation stage is executed first, then the pattern index stage is executed, and finally the weighted synthesis stage is executed.
  • K is a fixed binary matrix with a size of N ⁇ 2 N
  • B col is a binary matrix with a size of 2 N ⁇ p y
  • C is a weighted vector of p y
  • K, B col and C is the same as the aforementioned implementation
  • This embodiment disassembles the second vector in the above-mentioned way such that the second vector
  • Each element in can be represented by two binary matrices K and B col .
  • this embodiment will The inner product operation is converted into operation.
  • this embodiment obtains The various possibilities of generating pattern vectors During the schema indexing phase, this embodiment computes In the weighted synthesis stage, the index patterns are accumulated according to the weight C. Such a design enables operands no matter how high the precision is to be converted into an index mode to perform inner products to reduce repeated calculations and avoid high bandwidth requirements for arbitrary precision calculations.
  • FIG. 11 further shows a flowchart of the inner product of the first vector and the second vector.
  • a plurality of pattern vectors are generated according to the length and bit width of the first vector.
  • the length N receive N data vectors respectively.
  • the response K has 2 N unit vectors, each unit vector is simulated by hardware to generate 2 N pattern vectors respectively Since the inner product operation is actually the addition operation of each bit in binary, the generation component of this embodiment directly simulates all unit vectors in K, and the first vector The bits of the data vector of are added sequentially.
  • each cycle simultaneously inputs the first vector For example, the lowest bit of the data vector is input at the same time in the first cycle, and the second-lowest bit of the data vector is input at the same time in the second cycle, in this way until the highest bit of the data vector is input at the p xth cycle at the same time.
  • the required bandwidth is only N bits per cycle.
  • the pattern vector is the first vector All combinations of addition possibilities for a data vector of .
  • step 1102 based on the second vector
  • the data vector in the length direction is used as an index, and specific pattern vectors among the plurality of pattern vectors are accumulated to form a plurality of unit accumulation sequence.
  • This step implements the pattern indexing phase and the weighted synthesis phase.
  • the data vectors in the length direction are indices, according to each index from all pattern vectors Select the corresponding specific pattern vectors, accumulate these specific pattern vectors, generate a one-bit intermediate result in each cycle, and form a unit accumulation sequence in consecutive p x or p x +1 cycles. The above operation is performed
  • the second vector The co-located data vectors in the length direction of let all pattern vectors The specific pattern vector in the pass. Since the second vector The length of is N, so the second vector It can be disassembled into N data vectors, and the bit width of each data vector is p y , so these data vectors can be disassembled into p y same data vectors from the perspective of the same bits.
  • each unit accumulation sequence is like the intermediate results 401 , 402 , 403 and 404 in FIG. 4 , and these intermediate results have been aligned.
  • step 1103 a plurality of unit accumulation sequences are summed up to obtain an inner product result.
  • this embodiment will first vector with the second vector After splitting into multiple data segments, the intermediate results obtained by performing inner product calculations respectively. Since there is only one intermediate result between the lowest bit operation and the highest bit operation, the lowest bit operation and the highest bit operation do not need to be added, just like x 0 y 0 (lowest bit) and x 7 y 3 in Figure 4 (highest bit), no need to add with other intermediate results, just output directly. In other words, only the operation from the second lowest bit to the second highest bit needs to perform the addition operation.
  • This embodiment adopts the design of synchronously calculating the non-carry and carry to reduce the operation delay time.
  • the sum of intermediate results without carry and carry is obtained at the same time, and then the sum of intermediate results with carry or the sum of intermediate results without carry is selected according to whether the calculation result of the previous digit is carry.
  • the accumulated output is the first vector with the second vector inner product result.
  • step 1004 the inner product result is integrated into calculation results of multiple operands.
  • the core controller controls the memory agent to integrate or reduce the inner product result into calculation results of multiple operands and send it to the kernel memory agent.
  • step 1005 the calculation result is stored in the off-chip memory.
  • the kernel memory agent sends calculation results in parallel, first sending the lowest bits of these calculation results at the same time, and then sending the second-lowest bits of these calculation results at the same time, and in this way until the highest bits of these calculation results are sent at the same time.
  • Another embodiment of the present invention is a computer-readable storage medium, on which is stored computer program code for calculation with arbitrary precision.
  • the computer program code is run by a processor, the method shown in FIG. 10 or FIG. 11 is executed.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer readable memory.
  • the software product when the solution of the present invention is embodied in the form of a software product (such as a computer-readable storage medium), the software product can be stored in a memory, and it can include several instructions to make a computer device (such as a personal computer, server or network device, etc.) execute some or all of the steps of the method described in the embodiment of the present invention.
  • the aforementioned memory may include but not limited to U disk, flash disk, read only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), mobile hard disk, magnetic disk or optical disk, etc., which can store programs.
  • the medium of the code may include but not limited to U disk, flash disk, read only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), mobile hard disk, magnetic disk or optical disk, etc.
  • the present invention proposes a novel architecture to efficiently handle arbitrary precision calculations. No matter how high the precision of the operand is, the present invention can disassemble the operand, use the index to process the fixed-length bit stream in parallel, avoid bit-level redundancy, such as sparsity or repeated calculation, and do not need to configure a high-bit-width
  • the hardware can achieve the effect of flexible application and large bit width calculation.
  • the electronic equipment or device of the present invention may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, Internet of Things terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment.
  • Said vehicles include airplanes, ships and/or vehicles;
  • said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods;
  • said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present invention can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical treatment. Further, the electronic device or device of the present invention can also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as cloud, edge, and terminal.
  • electronic devices or devices with high computing power according to the solution of the present invention can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.
  • the present invention expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present invention is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present invention, those skilled in the art can understand that some of the steps can be performed in other order or at the same time. Further, those skilled in the art can understand that the embodiments described in the present invention can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present invention. In addition, according to different schemes, the description of some embodiments of the present invention also has different emphases. In view of this, those skilled in the art may understand the parts not described in detail in a certain embodiment of the present invention, and may also refer to relevant descriptions of other embodiments.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit.
  • the aforementioned components or units may be located at the same location or distributed over multiple network units.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present invention.
  • multiple units in this embodiment of the present invention may be integrated into one unit, or each unit exists physically independently.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits.
  • the physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors.
  • various devices such as computing devices or other processing devices described herein may be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (Enhanced Dynamic Random Access Memory, EDRAM), High Bandwidth Memory (High Bandwidth Memory , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • RRAM variable resistance memory
  • DRAM Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • High Bandwidth Memory High Bandwidth Memory
  • HBM High Bandwidth Memory
  • HMC Hybrid Memory Cube
  • ROM and RAM etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Complex Calculations (AREA)

Abstract

The present invention relates to an arbitrary-precision computing device and method, and a computer readable storage medium. A core memory agent reads a plurality of operands from an off-chip memory; a core controller splits the plurality of operands into a plurality of vectors; a processing array comprises a plurality of processing components; each processing component performs inner product on a first vector and a second vector according to the lengths of the first vector and the second vector so as to obtain inner product results; the core controller integrates the inner product results into calculation results of the plurality of operands; the core memory agent stores the calculation results in the off-chip memory.

Description

内积处理部件、任意精度计算设备、方法及可读存储介质Inner product processing component, arbitrary precision computing device, method and readable storage medium
相关申请的交叉引用Cross References to Related Applications
本公开要求于2021年10月20日申请的、申请号为202111221317.4、发明名称为“内积处理部件、任意精度计算设备、方法及可读存储介质”的中国专利申请的优先权。This disclosure claims the priority of the Chinese patent application filed on October 20, 2021 with the application number 202111221317.4 and the title of the invention is "Inner Product Processing Component, Arbitrary Precision Computing Equipment, Method, and Readable Storage Medium".
技术领域technical field
本发明一般地涉及计算机领域。更具体地,本发明涉及内积处理部件、任意精度计算设备、方法及可读存储介质。The present invention relates generally to the field of computers. More specifically, the present invention relates to an inner product processing component, an arbitrary-precision computing device, a method, and a readable storage medium.
背景技术Background technique
任意精确计算是利用任意位数来表示操作数,在许多技术领域至关重要,例如超新星模拟、气候模拟、原子模拟、人工智能、行星轨道计算等。这些领域需要处理数百、甚至数千或数百万位数的数据,这样大范围的数据位数处理远远超出了传统处理器的硬件能力。Arbitrary precision calculation is the use of arbitrary digits to represent operands, which is crucial in many technical fields, such as supernova simulation, climate simulation, atomic simulation, artificial intelligence, planetary orbit calculation, etc. These fields need to process hundreds, even thousands or millions of bits of data, such a large range of data bits processing is far beyond the hardware capabilities of traditional processors.
即使现有技术使用高位宽的处理器,也无法处理任意精确计算操作所需的可变长度,原因在于:最优比特宽在不同算法之间变化很大,且比特宽的细微差异会导致显著的成本差异。再者,现有技术还提出了许多提高体系结构级计算效率的技术,主要是纯效计算(effectual-only computation)和近似计算,前者只执行基本计算,其中无效的计算像是稀疏化和重复数据会被跳过或消除,后者使用较不准确的数据像是低位宽数据或量化后数据,来代替原始的准确数据的计算。然而,对于纯效计算来说,要找到重复数据十分困难且昂贵,对于近似计算来说,它直观地与任意精确计算的目的相矛盾,任意精确计算需要精确的计算来获得较高的精度。最后,这些现有技术不可避免地都会导致大量低效的内存访问。Even state-of-the-art processors using high bit-widths cannot handle the variable lengths required for arbitrary precise computational operations because the optimal bit-width varies widely between algorithms and small differences in bit-width can lead to significant cost difference. Furthermore, the prior art also proposes many techniques to improve computational efficiency at the architecture level, mainly effectual-only computation and approximate computation. Data is skipped or eliminated, the latter using less accurate data such as low-bit-width data or quantized data instead of the original accurate data for calculations. However, for purely efficient computations it is difficult and expensive to find duplicates, and for approximate computations it intuitively contradicts the purpose of arbitrarily exact computations, which require exact computations to achieve high precision. In the end, these existing techniques inevitably lead to a large number of inefficient memory accesses.
因此,一种高效的任意精确计算方案是迫切需要的。Therefore, an efficient arbitrarily precise computation scheme is urgently needed.
发明内容Contents of the invention
为了至少部分地解决背景技术中提到的技术问题,本发明的方案提供了一种内积处理部件、任意精度计算设备、方法及可读存储介质。In order to at least partly solve the technical problems mentioned in the background art, the solutions of the present invention provide an inner product processing component, an arbitrary precision computing device, a method, and a readable storage medium.
在一个方面中,本发明揭露一种用以内积第一向量与第二向量的处理部件,包括:转换单元、多个内积单元及合成单元。转换单元用以根据第一向量的长度及位宽生成多个模式向量。每个内积单元基于第二向量在长度方向上的数据向量为索引,累加多个模式向量中的特定模式向量,以形成单位累加数列。合成单元用以加总多个单位累加数列,以获得内积结果。In one aspect, the present invention discloses a processing unit for inner product of a first vector and a second vector, including: a conversion unit, a plurality of inner product units and a combination unit. The conversion unit is used for generating multiple pattern vectors according to the length and bit width of the first vector. Each inner product unit accumulates a specific pattern vector among the plurality of pattern vectors based on the data vector in the length direction of the second vector as an index to form a unit accumulation sequence. The synthesis unit is used to add up multiple unit accumulation sequences to obtain an inner product result.
在另一个方面,本发明揭露一种任意精度计算加速器,连接至片外内存,任意精度计算加速器包括:核内存代理器、核控制器及处理阵列。核内存代理器用以自片外内存读取多个操作数。核控制器用以将多个操作数拆分成多个向量,多个向量包括第一向量及第二向量。处理阵列包括多个处理部件,处理部件用以根据第一向量及第二向量的长度,内积第一向量与第二向量,以获得内积结果。其中,核控制器将内积结果整合成多个操作数的计算结果,核内存代理器将计算结果存储至片外内存。In another aspect, the present invention discloses an arbitrary-precision computing accelerator connected to an off-chip memory. The arbitrary-precision computing accelerator includes: a kernel memory agent, a kernel controller, and a processing array. The kernel memory agent is used to read multiple operands from off-chip memory. The core controller is used for splitting multiple operands into multiple vectors, and the multiple vectors include a first vector and a second vector. The processing array includes a plurality of processing units for inner producting the first vector and the second vector according to the lengths of the first vector and the second vector to obtain an inner product result. Among them, the core controller integrates the inner product result into the calculation result of multiple operands, and the core memory agent stores the calculation result in the off-chip memory.
在另一个方面,本发明揭露一种集成电路装置,包括前述的任意精度计算加速器、处理装置及片外内存。处理装置用以控制任意精度计算加速器,片外内存包括LLC。其中,任意精度计算加速器与处理装置通过LLC联系。In another aspect, the present invention discloses an integrated circuit device including the aforementioned arbitrary precision computing accelerator, a processing device and an off-chip memory. The processing device is used to control the arbitrary precision computing accelerator, and the off-chip memory includes LLC. Wherein, the arbitrary-precision computing accelerator communicates with the processing device through the LLC.
在另一个方面,本发明揭露一种板卡,包括前述的集成电路装置。In another aspect, the present invention discloses a board including the aforementioned integrated circuit device.
在另一个方面,本发明揭露一种内积第一向量与第二向量的方法,包括:根据第一向量的长度及位宽生成多个模式向量;基于第二向量在长度方向上的数据向量为索引,累加多个模式向量中的特定模式向量,以形成多个单位累加数列;以及加总多个单位累加数列,以获得内积结果。In another aspect, the present invention discloses a method for inner producting a first vector and a second vector, comprising: generating a plurality of pattern vectors according to the length and bit width of the first vector; based on the data vector in the length direction of the second vector accumulating specific pattern vectors among the plurality of pattern vectors as an index to form a plurality of unit accumulation sequences; and summing up the plurality of unit accumulation sequences to obtain an inner product result.
在另一个方面,本发明揭露一种任意精度计算方法,包括:自片外内存读取多个操作数;将多个操作数拆分成多个向量,多个向量包括第一向量及第二向量;根据第一向量及第二向量的 长度,内积第一向量与第二向量,以获得内积结果;将内积结果整合成多个操作数的计算结果;以及将计算结果存储至片外内存。In another aspect, the present invention discloses an arbitrary precision calculation method, including: reading multiple operands from off-chip memory; splitting the multiple operands into multiple vectors, the multiple vectors including the first vector and the second vector; according to the lengths of the first vector and the second vector, inner product the first vector and the second vector to obtain the inner product result; integrate the inner product result into a calculation result of multiple operands; and store the calculation result in a slice external memory.
在另一个方面,本发明揭露一种计算机可读存储介质,其上存储有任意精度计算的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行前述的方法。In another aspect, the present invention discloses a computer-readable storage medium, on which computer program codes for arbitrary precision calculations are stored, and when the computer program codes are executed by a processing device, the aforesaid methods are executed.
本发明提出一种处理任意精度计算方案,并行处理不同的比特流,其部署了完整的比特串行数据路径,以灵活弹性地执行高精度计算。本发明充分利用简易硬件配置,减少重复计算,进而实现低能耗的任意精确计算。The present invention proposes a scheme for processing arbitrary-precision calculations, and processes different bit streams in parallel, deploying a complete bit-serial data path to perform high-precision calculations flexibly and flexibly. The invention makes full use of simple hardware configuration, reduces repeated calculations, and further realizes arbitrary accurate calculations with low energy consumption.
附图说明Description of drawings
通过参考附图阅读下文的详细描述,本发明示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本发明的若干实施方式,并且相同或对应的标号表示相同或对应的部分。其中:The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the drawings, several embodiments of the present invention are shown by way of illustration and not limitation, and the same or corresponding reference numerals indicate the same or corresponding parts. in:
图1是示出本发明实施例的板卡的结构图;Fig. 1 is the structural diagram showing the plate card of the embodiment of the present invention;
[根据细则91更正 30.08.2022] 
图2A至2C是示出本发明实施例的集成电路装置的结构图;
[Corrected 30.08.2022 under Rule 91]
2A to 2C are structural diagrams illustrating an integrated circuit device according to an embodiment of the present invention;
图3是示出本发明实施例的计算装置的内部结构示意图;3 is a schematic diagram showing the internal structure of a computing device according to an embodiment of the present invention;
图4是示出示例性乘法运算的示意图;Figure 4 is a schematic diagram illustrating an exemplary multiplication operation;
图5是示出本发明实施例的转换单元的示意图;FIG. 5 is a schematic diagram illustrating a conversion unit according to an embodiment of the present invention;
图6是示出本发明实施例的生成单元的示意图;Fig. 6 is a schematic diagram showing a generation unit of an embodiment of the present invention;
图7是示出本发明实施例的内积单元的示意图;Fig. 7 is a schematic diagram showing an inner product unit according to an embodiment of the present invention;
图8是示出本发明实施例的合成单元的示意图;Figure 8 is a schematic diagram illustrating a synthesis unit of an embodiment of the present invention;
图9是示出本发明实施例的全加器组的示意图;Fig. 9 is a schematic diagram showing a full adder group of an embodiment of the present invention;
图10是示出本发明另一实施例的任意精度计算的流程图;以及FIG. 10 is a flowchart illustrating arbitrary precision calculations of another embodiment of the present invention; and
图11是示出本发明另一实施例的内积第一向量与第二向量的流程图。FIG. 11 is a flow chart illustrating the inner product of the first vector and the second vector according to another embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without making creative efforts belong to the protection scope of the present invention.
应当理解,本发明的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本发明的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second", "third" and "fourth" in the claims, description and drawings of the present invention are used to distinguish different objects, rather than to describe a specific order . The terms "comprising" and "comprising" used in the description and claims of the present invention indicate the presence of described features, integers, steps, operations, elements and/or components, but do not exclude one or more other features, integers , steps, operations, elements, components, and/or the presence or addition of collections thereof.
还应当理解,在此本发明说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本发明。如在本发明说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本发明说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the terms used in the description of the present invention are for the purpose of describing specific embodiments only, and are not intended to limit the present invention. As used in the specification and claims herein, the singular forms "a", "an" and "the" are intended to include the plural forms unless the context clearly dictates otherwise. It should be further understood that the term "and/or" used in the description and claims of the present invention refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations.
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。As used in this specification and claims, the term "if" may be interpreted as "when" or "once" or "in response to determining" or "in response to detecting" depending on the context.
下面结合附图来详细描述本发明的具体实施方式。The specific implementation manner of the present invention will be described in detail below in conjunction with the accompanying drawings.
任意精度计算在许多科技领域中都起到关键作用。举例来说,看似平凡的方程式x 3+y 3+z 3=3,利用计算机求解会需要200位以上的精度;在伊辛理论(Ising theory)中,计算积分需要1000位以上的精度;而计算双曲空间(hyperbolic space)中的结点余空间(knot complement)的体积则涉及高达60000位精度。一个非常微小的精度误差都可能导致计算结果的巨大差异,因此任意精度计算在计算机领域是十分严肃的技术课题。 Arbitrary-precision computing plays a key role in many fields of technology. For example, for the seemingly trivial equation x 3 +y 3 +z 3 =3, it requires more than 200 digits of precision to solve it by computer; in the Ising theory (Ising theory), the calculation of the integral requires more than 1000 digits of precision; Computing the volume of the knot complement in hyperbolic space involves up to 60,000 bits of precision. A very small precision error may lead to a huge difference in the calculation results, so arbitrary precision calculation is a very serious technical issue in the computer field.
本发明提出了一种高效的任意精度计算加速器架构,其主要参考内积运算的计算形式,突出加速器架构的操作内并行性(intra-parallelism)和操作间并行性(inter-parallelism),以实现操作数的乘法运算。The present invention proposes a high-efficiency arbitrary-precision computing accelerator architecture, which mainly refers to the calculation form of the inner product operation, and highlights the intra-parallelism and inter-parallelism of the accelerator architecture to achieve Multiplication of operands.
图1示出本发明实施例的一种板卡10的结构示意图。如图1所示,板卡10包括芯片101,其是一种系统级芯片(System on Chip,SoC),或称片上系统,集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域,云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present invention. As shown in Figure 1, the board card 10 includes a chip 101, which is a system-on-chip (System on Chip, SoC), or system-on-chip, integrated with one or more combined processing devices, and the combined processing device is an artificial The intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining. In particular, deep learning technology is widely used in the field of cloud intelligence. A notable feature of cloud intelligence applications is the large amount of input data, which has high requirements for the storage capacity and computing power of the platform. The board 10 of this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing capabilities.
芯片101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景,对外接口装置102可以具有不同的接口形式,例如PCIe接口等。The chip 101 is connected to an external device 103 through an external interface device 102 . The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like. The data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 . The calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102 . According to different application scenarios, the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
板卡10还包括用于存储数据的存储器件104,其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106配置用于对芯片101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机(Micro Controller Unit,MCU)。The board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 . The storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101 . To this end, in an application scenario, the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
图2是示出此实施例的芯片101中的组合处理装置的结构图。如图2中所示,组合处理装置包括计算装置201、处理装置202、片外内存203、通信节点204及接口装置205。在此实施例中,有几种集成方案可以用来协同计算装置201、处理装置202、片外内存203的工作,其中图2A示出LLC集成方案,图2B示出SoC集成方案,图2C示出IO集成方案。FIG. 2 is a block diagram showing the combined processing means in the chip 101 of this embodiment. As shown in FIG. 2 , the combined processing device includes a computing device 201 , a processing device 202 , an off-chip memory 203 , a communication node 204 and an interface device 205 . In this embodiment, there are several integration solutions that can be used to coordinate the work of the computing device 201, the processing device 202, and the off-chip memory 203, wherein FIG. 2A shows an LLC integration solution, FIG. 2B shows an SoC integration solution, and FIG. 2C shows Provide an IO integration solution.
计算装置201配置成执行用户指定的操作,主要实现为多核智能处理器,用以执行深度学习或机器学习的计算,其可以与处理装置202进行交互,以共同完成用户指定的操作。计算装置201内含前述的任意精度计算加速器,用以处理线性计算,更详细来说是应用在如卷积中的操作数乘法运算。The computing device 201 is configured to execute user-specified operations, and is mainly implemented as a multi-core intelligent processor for performing deep learning or machine learning calculations, which can interact with the processing device 202 to jointly complete user-specified operations. The computing device 201 includes the aforementioned arbitrary-precision computing accelerator for processing linear computations, more specifically operand multiplication operations such as convolution.
处理装置202作为通用的处理器,执行包括但不限于数据搬运、对计算装置201的开启和/或停止、非线性计算等基本控制。根据实现方式的不同,处理装置202可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。当将计算装置201和处理装置202整合共同考虑时,二者视为形成异构多核结构。As a general-purpose processor, the processing device 202 performs basic controls including but not limited to data transfer, starting and/or stopping of the computing device 201 , nonlinear calculation, and the like. According to different implementations, the processing device 202 may be one or more types of a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or other general-purpose and/or special-purpose processors. Processors, including but not limited to digital signal processors (digital signal processors, DSPs), application specific integrated circuits (application specific integrated circuits, ASICs), field-programmable gate arrays (field-programmable gate arrays, FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. When considering the integration of the computing device 201 and the processing device 202 together, they are considered to form a heterogeneous multi-core structure.
片外内存203用以存储待处理与处理完的数据,其层次根据延迟时间从小到大,可以划分为:一级缓存(L1)、二级缓存(L2)、三级缓存(L3,又称为LLC)与实体内存。实体内存为DDR,大小通常为16G或更大。当计算装置201或处理装置202欲从片外内存203读取数据时,由于L1的速度最快,故通常会优先访问L1,如果数据未存放在L1,接着访问L2,如果数据亦未存放在L2,继续访问L3,如果数据仍未存放在L3,最后访问DDR。片外内存203的缓存层次结构是通过将最常访问的数据存储在缓存中来加快数据访问速度。与缓存相比,DDR相当慢。随着缓存级别的增加(L1→L2→LLC→DDR),访问延迟越来越高,但存储空间越来越大。The off-chip memory 203 is used to store the data to be processed and processed, and its hierarchy can be divided into: first-level cache (L1), second-level cache (L2), and third-level cache (L3, also known as for LLC) and physical memory. The physical memory is DDR, usually 16G or larger. When the computing device 201 or the processing device 202 intends to read data from the off-chip memory 203, because L1 is the fastest, it usually accesses L1 first. If the data is not stored in L1, then access L2. If the data is not stored in the L2, continue to access L3, if the data is still not stored in L3, finally access DDR. The cache hierarchy of the off-chip memory 203 speeds up data access by storing the most frequently accessed data in the cache. DDR is pretty slow compared to cache. As the cache level increases (L1→L2→LLC→DDR), the access latency is higher and higher, but the storage space is larger and larger.
通信节点204是片上网络(network-on-chip,NoC)中的路由节点或路由器,当计算装置201或处理装置202产生一个数据包后,会通过特定的接口发送到通信节点204中,通信节点204读取数据包的头微片中的地址信息,利用特定的路由算法计算出最佳路由路径,从而建立可靠的传输路径将数据包送到目的节点(例如片外内存203)。同样地,当计算装置201或处理装置202需从片外内存203读取数据包时,通信节点204亦会计算出最佳路由路径,将数据包从片外内存 203发送到计算装置201或处理装置202。The communication node 204 is a routing node or router in a network-on-chip (NoC). When the computing device 201 or the processing device 202 generates a data packet, it will be sent to the communication node 204 through a specific interface. The communication node 204 reads the address information in the header flake of the data packet, and uses a specific routing algorithm to calculate the best routing path, thereby establishing a reliable transmission path to send the data packet to the destination node (such as the off-chip memory 203). Similarly, when the computing device 201 or the processing device 202 needs to read the data packet from the off-chip memory 203, the communication node 204 will also calculate the optimal routing path, and send the data packet from the off-chip memory 203 to the computing device 201 or the processing device 202.
接口装置205是组合处理装置对外的输入输出接口,当组合处理装置与外部设备交换信息时,由于外部设备种类繁多,每种设备对传输的信息的要求各不相同,接口装置205会根据数据传输的发送方与接收方的要求,执行设置数据缓冲以解决两者速度差异所带来的不协调问题、设置信号电平转换、设置信息转换逻辑以满足对各自格式的要求、设置时序控制电路来同步发送方与接收方的工作及提供地址转码等任务。The interface device 205 is the external input and output interface of the combination processing device. When the combination processing device exchanges information with external equipment, due to the wide variety of external equipment, each type of equipment has different requirements for the information to be transmitted. The interface device 205 will transmit information according to the data. According to the requirements of the sender and receiver, set the data buffer to solve the incoordination problem caused by the speed difference between the two, set the signal level conversion, set the information conversion logic to meet the requirements of their respective formats, and set the timing control circuit to Synchronize the work of the sender and receiver and provide address transcoding and other tasks.
图2A的LLC集成指的是计算装置201与处理装置202通过LLC联系,图2B的SoC集成是通过通信节点204来集成计算装置201、处理装置202与片外内存203。图2C的IO集成是通过接口装置205来集成计算装置201、处理装置202与片外内存203。这3种集成方式仅为示例,本发明并不限制集成的方式。The LLC integration in FIG. 2A refers to the communication between the computing device 201 and the processing device 202 through LLC. The SoC integration in FIG. 2B is to integrate the computing device 201 , the processing device 202 and the off-chip memory 203 through the communication node 204 . The IO integration in FIG. 2C is to integrate the computing device 201 , the processing device 202 and the off-chip memory 203 through the interface device 205 . These three integration methods are only examples, and the present invention does not limit the integration methods.
此实施例较佳地选择LLC集成方案。由于深度学习和机器学习的核心是卷积算子,卷积算子的基础是内积运算,内积运算又是由乘法与加法组合而成,因此,计算装置201的主要任务是大量的乘法和加法等低级运算,在执行神经网络模型的训练与推理时,计算装置201与处理装置202需要密集的交互,将计算装置201与处理装置202集成到LLC中,通过LLC共享数据,以达到较低的交互成本。再者,由于高精度数据可能具有数百万位,L1与L2的容量有限,通过L1与L2交互会导致容量不足的问题。计算装置201利用LLC的相对大容量来缓存高精度数据,以节省重复访问的时间。This embodiment preferably chooses the LLC integration scheme. Since the core of deep learning and machine learning is the convolution operator, the basis of the convolution operator is the inner product operation, and the inner product operation is a combination of multiplication and addition. Therefore, the main task of the computing device 201 is a large number of multiplications. When performing neural network model training and reasoning, the computing device 201 and the processing device 202 need intensive interaction. The computing device 201 and the processing device 202 are integrated into the LLC, and the data is shared through the LLC to achieve a higher Low interaction cost. Furthermore, since the high-precision data may have millions of bits, the capacity of L1 and L2 is limited, and the interaction between L1 and L2 will lead to a problem of insufficient capacity. The computing device 201 utilizes the relatively large capacity of the LLC to cache high-precision data to save time for repeated access.
图3示出计算装置201的内部结构示意图,其包括核内存代理器301、核控制器302及处理阵列303。FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 , which includes a core memory agent 301 , a core controller 302 and a processing array 303 .
核内存代理器301作为计算装置201访问片外内存203的管理端。当核内存代理器301自片外内存203读取操作数时,操作数的起始地址在核内存代理器301中被设置,核内存代理器301通过自增加地址来同时、连续、串行读取多个操作数,其读取方式是一次性地自这些操作数的低位逐次往高位读取,例如需要读取3个操作数时,先根据各操作数的起始地址串行读取第一操作数的最低位512比特,再串行读取第二操作数的低位512比特,接着串行读取第三操作数的低位512比特,最低位读取完成后,通过自增加地址(增加512比特),接着串行读取各次低位512比特,依此方式直到这3个操作数的最高位被读取。当核内存代理器301将计算结果存储回片外内存203时,则以并行发送,例如核内存代理器301需要发送3个计算结果至片外内存203,则同时发送这3个计算结果的最低位比特,再同时发送这3个计算结果的次低位比特,依此方式直到这3个计算结果的最高位比特同时发送完毕。一般来说,这些操作数是以矩阵或向量的形式来表示的。The kernel memory agent 301 serves as a management terminal for the computing device 201 to access the off-chip memory 203 . When the kernel memory agent 301 reads the operand from the off-chip memory 203, the starting address of the operand is set in the kernel memory agent 301, and the kernel memory agent 301 reads simultaneously, continuously and serially by self-increasing addresses To take multiple operands, the reading method is to read from the lower bits of these operands to the higher bits one by one. The lowest 512 bits of an operand, then serially read the lower 512 bits of the second operand, and then serially read the lower 512 bits of the third operand, after the lowest reading is completed, through the self-increment address (increase 512 bits), and then serially read the lower 512 bits each time, and so on until the highest bits of the three operands are read. When the kernel memory agent 301 stores the calculation result back to the off-chip memory 203, then send it in parallel. Bits, and then send the second-lowest bits of the three calculation results at the same time, and in this way until the highest-order bits of the three calculation results are sent at the same time. Typically, these operands are represented in the form of matrices or vectors.
核控制器302基于处理阵列303中的处理部件的运算能力与数量,控制将每个操作数拆分成多个数据段,也就是多个向量,使得核内存代理器301以数据段为单位发送至处理阵列303。The core controller 302 is based on the computing power and quantity of the processing components in the processing array 303, and controls to split each operand into multiple data segments, that is, multiple vectors, so that the core memory agent 301 sends data segments in units of to processing array 303 .
处理阵列303用以执行两个操作数的乘法计算,举例来说,第一操作数可以拆分成x 0至x 7等8个数据段,第二操作数可以拆分成y 0至y 3等4个数据段,当第一操作数与第二操作数执行乘法运算时,算法展开如图4所示。处理阵列303便是通过拆分第一操作数与第二操作数,分别进行内积计算,再将中间结果401、402、403及404移位对齐加总,以获得乘法运算的计算结果。 The processing array 303 is used to perform the multiplication calculation of two operands. For example, the first operand can be divided into 8 data segments such as x 0 to x 7 , and the second operand can be divided into y 0 to y 3 Waiting for 4 data segments, when the multiplication operation is performed between the first operand and the second operand, the algorithm unfolds as shown in Figure 4. The processing array 303 divides the first operand and the second operand, performs inner product calculation respectively, and then shifts, aligns and sums the intermediate results 401 , 402 , 403 and 404 to obtain the calculation result of the multiplication operation.
为清楚地阐述技术方案,以下统一将上述数据段视为向量来表示,两数据段相乘即是两向量(第一向量及第二向量)做内积,其中第一向量来自第一操作数,第二向量来自第二操作数。In order to clearly explain the technical solution, the above-mentioned data segments are collectively represented as vectors below, and the multiplication of two data segments is the inner product of two vectors (the first vector and the second vector), wherein the first vector comes from the first operand , the second vector from the second operand.
处理阵列303包括多个处理部件304,这些处理部件304以阵列方式排列,图中示例性展示4×8个处理部件304,本发明不限制处理部件304的个数。每个处理部件304用以根据第一向量的长度及第二向量的长度,内积第一向量与第二向量,以获得内积结果。最后,核控制器302控制内存代理器301将内积结果整合或归约成多个操作数的计算结果,发送给核内存代理器301,核内存代理器301将计算结果存储至片外内存203。The processing array 303 includes a plurality of processing units 304 arranged in an array. The figure shows 4×8 processing units 304 as an example, and the number of the processing units 304 is not limited in the present invention. Each processing unit 304 is configured to inner product the first vector and the second vector according to the length of the first vector and the length of the second vector to obtain an inner product result. Finally, the core controller 302 controls the memory proxy 301 to integrate or reduce the inner product result into calculation results of multiple operands and send them to the core memory proxy 301, and the core memory proxy 301 stores the calculation results in the off-chip memory 203 .
具体来说,计算装置201在控制上采用递推分解算法(recursive decomposition),当计算装置201接收到来自处理装置202的指令来执行任意精度计算时,核控制器302将乘法的操作数平均拆分为多个向量,并将它们发送到处理阵列303进行计算,每个处理部件304负责一组向量 的计算,例如第一向量与第二向量的内积。在此实施例中,每个处理部件304会基于本身的硬件资源,将一组向量进一步拆分成更小的内积计算单元,以方便进行内积计算。计算装置201在数据路径上采用多比特流,即每个操作数以每周期1比特的速度从核内存代理器301导入处理部件303,但多个操作数同时并行传输,在计算结束后,处理部件304以比特串行方式发送内积结果到核内存代理器301。Specifically, the computing device 201 adopts a recursive decomposition algorithm (recursive decomposition) in control. When the computing device 201 receives an instruction from the processing device 202 to perform arbitrary precision calculations, the core controller 302 averagely decomposes the operands of the multiplication Divided into multiple vectors, and send them to the processing array 303 for calculation, each processing unit 304 is responsible for the calculation of a group of vectors, such as the inner product of the first vector and the second vector. In this embodiment, each processing unit 304 further splits a group of vectors into smaller inner product calculation units based on its own hardware resources, so as to facilitate inner product calculation. Computing device 201 adopts multi-bit streams on the data path, that is, each operand is imported from core memory agent 301 to processing unit 303 at a speed of 1 bit per cycle, but multiple operands are transmitted in parallel at the same time. After the calculation is completed, the processing The component 304 sends the inner product result to the kernel memory agent 301 in a bit-serial manner.
作为计算装置201的核心计算单元,处理部件304的主要任务是内积计算。处理部件304是基于比特索引向量内积的流程分成3个阶段来处理,第一阶段为模式生成阶段,第二阶段为模式索引阶段,第三阶段为加权合成阶段。As the core calculation unit of the calculation device 201, the main task of the processing unit 304 is inner product calculation. The processing unit 304 divides the process into three stages based on the inner product of bit index vectors. The first stage is the pattern generation stage, the second stage is the pattern index stage, and the third stage is the weighted synthesis stage.
以第一向量
Figure PCTCN2022100304-appb-000001
与第二向量
Figure PCTCN2022100304-appb-000002
的内积为例,假设第一向量
Figure PCTCN2022100304-appb-000003
与第二向量
Figure PCTCN2022100304-appb-000004
的大小分别为N×p x与N×p y,其中N为第一向量
Figure PCTCN2022100304-appb-000005
与第二向量
Figure PCTCN2022100304-appb-000006
的长度,更详细来说是行元素数量,p x为第一向量
Figure PCTCN2022100304-appb-000007
的位宽,p y为第二向量
Figure PCTCN2022100304-appb-000008
的位宽。在此实施例中,欲进行第一向量
Figure PCTCN2022100304-appb-000009
与第二向量
Figure PCTCN2022100304-appb-000010
的内积,先将第一向量
Figure PCTCN2022100304-appb-000011
转置,再与第二向量
Figure PCTCN2022100304-appb-000012
做内积,即(p x×N)·(N×p y),以生成p x×p y的内积结果。
Take the first vector
Figure PCTCN2022100304-appb-000001
with the second vector
Figure PCTCN2022100304-appb-000002
As an example, assuming that the first vector
Figure PCTCN2022100304-appb-000003
with the second vector
Figure PCTCN2022100304-appb-000004
The sizes are N×p x and N×p y , where N is the first vector
Figure PCTCN2022100304-appb-000005
with the second vector
Figure PCTCN2022100304-appb-000006
The length of, more specifically, the number of row elements, p x is the first vector
Figure PCTCN2022100304-appb-000007
The bit width of , p y is the second vector
Figure PCTCN2022100304-appb-000008
bit width. In this example, the first vector
Figure PCTCN2022100304-appb-000009
with the second vector
Figure PCTCN2022100304-appb-000010
Inner product of , the first vector
Figure PCTCN2022100304-appb-000011
transpose, and then with the second vector
Figure PCTCN2022100304-appb-000012
Do the inner product, ie (p x ×N)·(N× py ), to generate the inner product result of p x ×p y .
此实施例将第二向量
Figure PCTCN2022100304-appb-000013
拆解为:
This example converts the second vector
Figure PCTCN2022100304-appb-000013
Dismantled as:
Figure PCTCN2022100304-appb-000014
Figure PCTCN2022100304-appb-000014
其中K是一个固定不变且大小为N×2 N二进制矩阵,B col是一个大小为2 N×p y的二进制矩阵,C是p y加权向量。 Among them, K is a fixed binary matrix with a size of N×2 N , B col is a binary matrix with a size of 2 N ×p y , and C is a weighted vector of p y .
在第一向量
Figure PCTCN2022100304-appb-000015
的长度方向上各元素的排列共有2 N种模式,以N为2来说,即第一向量
Figure PCTCN2022100304-appb-000016
的长度为2,K根据第一向量
Figure PCTCN2022100304-appb-000017
的长度分为2 N个单位向量,以排列出长度为2的所有可能单位向量,因此K为大小为2×2 2的二进制矩阵,用以涵盖所有长度为2的元素组合的所有可能性,长度为2的元素组合有
Figure PCTCN2022100304-appb-000018
等4种可能性,故K的固定形式为:
in the first vector
Figure PCTCN2022100304-appb-000015
There are 2 N patterns in the arrangement of each element in the length direction of , taking N as 2, that is, the first vector
Figure PCTCN2022100304-appb-000016
The length is 2, K according to the first vector
Figure PCTCN2022100304-appb-000017
The length of is divided into 2 N unit vectors to arrange all possible unit vectors with a length of 2, so K is a binary matrix with a size of 2×2 2 to cover all possibilities of combinations of elements with a length of 2, A combination of elements of length 2 has
Figure PCTCN2022100304-appb-000018
4 possibilities, so the fixed form of K is:
Figure PCTCN2022100304-appb-000019
Figure PCTCN2022100304-appb-000019
换言之,一旦第一向量
Figure PCTCN2022100304-appb-000020
与第二向量
Figure PCTCN2022100304-appb-000021
的长度确定了,K的大小及元素值便确定了。
In other words, once the first vector
Figure PCTCN2022100304-appb-000020
with the second vector
Figure PCTCN2022100304-appb-000021
The length of K is determined, and the size and element value of K are determined.
B col是一位有效向量(one-hot vector),每一列只有1个元素为1,其余元素为0,而哪个元素为1取决于第二向量
Figure PCTCN2022100304-appb-000022
的该列对应至K的哪列。为方便说明,示例性地设定第一向量
Figure PCTCN2022100304-appb-000023
与第二向量
Figure PCTCN2022100304-appb-000024
为:
B col is an effective vector (one-hot vector), each column has only one element as 1, and the rest of the elements are 0, and which element is 1 depends on the second vector
Figure PCTCN2022100304-appb-000022
This column of corresponds to which column of K. For the convenience of illustration, the first vector is exemplarily set
Figure PCTCN2022100304-appb-000023
with the second vector
Figure PCTCN2022100304-appb-000024
for:
Figure PCTCN2022100304-appb-000025
Figure PCTCN2022100304-appb-000025
Figure PCTCN2022100304-appb-000026
Figure PCTCN2022100304-appb-000026
将第二向量
Figure PCTCN2022100304-appb-000027
与K进行比较可以发现,第二向量
Figure PCTCN2022100304-appb-000028
的第一列
Figure PCTCN2022100304-appb-000029
为K的第四列,第二向量
Figure PCTCN2022100304-appb-000030
的第二列
Figure PCTCN2022100304-appb-000031
为K的第三列,第二向量
Figure PCTCN2022100304-appb-000032
的第三列
Figure PCTCN2022100304-appb-000033
为K的第四列,第二向量
Figure PCTCN2022100304-appb-000034
的第四列
Figure PCTCN2022100304-appb-000035
为K的第一列,故当第二向量
Figure PCTCN2022100304-appb-000036
以K·B col来表示时,B col为大小为2 2×4的索引矩阵如下:
the second vector
Figure PCTCN2022100304-appb-000027
Compared with K, it can be found that the second vector
Figure PCTCN2022100304-appb-000028
first column of
Figure PCTCN2022100304-appb-000029
is the fourth column of K, the second vector
Figure PCTCN2022100304-appb-000030
the second column of
Figure PCTCN2022100304-appb-000031
is the third column of K, the second vector
Figure PCTCN2022100304-appb-000032
the third column of
Figure PCTCN2022100304-appb-000033
is the fourth column of K, the second vector
Figure PCTCN2022100304-appb-000034
the fourth column of
Figure PCTCN2022100304-appb-000035
is the first column of K, so when the second vector
Figure PCTCN2022100304-appb-000036
When represented by K·B col , B col is an index matrix with a size of 2 2 ×4 as follows:
Figure PCTCN2022100304-appb-000037
Figure PCTCN2022100304-appb-000037
B col的第一列只有第四个元素为1,表示第二向量
Figure PCTCN2022100304-appb-000038
的第一列为K的第四列;B col的第二列只有第三个元素为1,表示第二向量
Figure PCTCN2022100304-appb-000039
的第二列为K的第三列;B col的第三列只有第四个元素为1,表示第二向量
Figure PCTCN2022100304-appb-000040
的第三列为K的第四列;B col的第四列只有第一个元素为1,表示第二向量
Figure PCTCN2022100304-appb-000041
的第四列为K的第一列。综上所述,只要K确定了,B col的元素值也确定了。
Only the fourth element of the first column of B col is 1, indicating the second vector
Figure PCTCN2022100304-appb-000038
The first column of K is the fourth column of K; the second column of B col only has the third element as 1, indicating the second vector
Figure PCTCN2022100304-appb-000039
The second column of K is the third column of K; only the fourth element of the third column of B col is 1, indicating the second vector
Figure PCTCN2022100304-appb-000040
The third column of K is the fourth column of K; only the first element of the fourth column of B col is 1, indicating the second vector
Figure PCTCN2022100304-appb-000041
The fourth column of is the first column of K. To sum up, as long as K is determined, the element value of B col is also determined.
C是p y加权向量,用以反映第二向量
Figure PCTCN2022100304-appb-000042
的幂次,也就是位宽。由于p y为4,表示第二向量
Figure PCTCN2022100304-appb-000043
的幂次为4,故C为:
C is the p y weighting vector to reflect the second vector
Figure PCTCN2022100304-appb-000042
to the power of , that is, the bit width. Since p y is 4, the second vector
Figure PCTCN2022100304-appb-000043
The power of is 4, so C is:
Figure PCTCN2022100304-appb-000044
Figure PCTCN2022100304-appb-000044
此实施例通过上述的方式来拆解第二向量
Figure PCTCN2022100304-appb-000045
使得第二向量
Figure PCTCN2022100304-appb-000046
中的各元素可以用K与B col两个二进制矩阵来表示。换言之,此实施例将
Figure PCTCN2022100304-appb-000047
的内积运算转换成
Figure PCTCN2022100304-appb-000048
的运算。
This embodiment disassembles the second vector in the above-mentioned way
Figure PCTCN2022100304-appb-000045
such that the second vector
Figure PCTCN2022100304-appb-000046
Each element in can be represented by two binary matrices K and B col . In other words, this embodiment will
Figure PCTCN2022100304-appb-000047
The inner product operation is converted into
Figure PCTCN2022100304-appb-000048
operation.
处理部件304便是用以基于前述的转换来实现向量内积
Figure PCTCN2022100304-appb-000049
的。在模式生成阶段,处理部件304获得
Figure PCTCN2022100304-appb-000050
的各种可能性,即生成模式向量
Figure PCTCN2022100304-appb-000051
在模式索引阶段,处理部件304计算
Figure PCTCN2022100304-appb-000052
在加权合成阶段,处理部件304根据权重C来累积索引模式。如此的设计使得不论精度多高的操作数都能转换成索引模式执行内积来减少重复计算,以避免任意精度计算对高带宽的要求。
The processing unit 304 is used to implement the vector inner product based on the aforementioned conversion
Figure PCTCN2022100304-appb-000049
of. During the schema generation phase, the processing component 304 obtains
Figure PCTCN2022100304-appb-000050
The various possibilities of generating pattern vectors
Figure PCTCN2022100304-appb-000051
During the schema indexing phase, the processing unit 304 calculates
Figure PCTCN2022100304-appb-000052
In the weighted synthesis stage, the processing component 304 accumulates index patterns according to the weight C. Such a design enables operands no matter how high the precision is to be converted into an index mode to perform inner products to reduce repeated calculations and avoid high bandwidth requirements for arbitrary precision calculations.
图3进一步示出处理部件304的结构示意图。为实现前述3个阶段,处理部件304包括处理部件内存代理单元305、处理部件控制单元306、转换单元307、多个内积单元308及合成单元309。FIG. 3 further shows a schematic structural diagram of the processing unit 304 . To realize the aforementioned three stages, the processing unit 304 includes a processing unit memory proxy unit 305 , a processing unit control unit 306 , a conversion unit 307 , a plurality of inner product units 308 and a synthesis unit 309 .
处理部件内存代理单元305作为处理部件304访问核内存代理器301的接口端,用以接收需要进行内积运算的两向量,例如前述的第一向量
Figure PCTCN2022100304-appb-000053
与第二向量
Figure PCTCN2022100304-appb-000054
The processing unit memory proxy unit 305 is used as the interface for the processing unit 304 to access the kernel memory proxy 301 to receive the two vectors that need to be inner producted, such as the aforementioned first vector
Figure PCTCN2022100304-appb-000053
with the second vector
Figure PCTCN2022100304-appb-000054
处理部件控制单元306用以协调并管理处理部件304中各单元的工作。The processing unit control unit 306 is used to coordinate and manage the work of each unit in the processing unit 304 .
转换单元307用以实现模式生成阶段。自处理部件内存代理单元305接收第一向量
Figure PCTCN2022100304-appb-000055
并以硬件实现二进制矩阵K,执行
Figure PCTCN2022100304-appb-000056
以生成多个模式向量
Figure PCTCN2022100304-appb-000057
图5示出转换单元307的示意图,转换单元307包括:N个比特流输入端501、生成组件502及2 N个比特流输出端503。
The conversion unit 307 is used to implement the pattern generation stage. Receive the first vector from the processing element memory proxy unit 305
Figure PCTCN2022100304-appb-000055
And implement the binary matrix K in hardware, execute
Figure PCTCN2022100304-appb-000056
to generate multiple pattern vectors
Figure PCTCN2022100304-appb-000057
FIG. 5 shows a schematic diagram of the conversion unit 307 , and the conversion unit 307 includes: N bit stream input terminals 501 , a generation component 502 and 2 N bit stream output terminals 503 .
N个比特流输入端501用以对应至第一向量
Figure PCTCN2022100304-appb-000058
的长度N,分别接收N个数据向量。图5以第一向量
Figure PCTCN2022100304-appb-000059
的长度为4进行说明,第一向量
Figure PCTCN2022100304-appb-000060
包括x 0、x 1、x 2、x 3等4个数据向量,每个数据向量的位宽为p x,也就是每个数据向量具有p x个位数。
N bitstream input terminals 501 are used to correspond to the first vector
Figure PCTCN2022100304-appb-000058
The length N, receive N data vectors respectively. Figure 5 takes the first vector
Figure PCTCN2022100304-appb-000059
The length is 4 to illustrate that the first vector
Figure PCTCN2022100304-appb-000060
It includes four data vectors such as x 0 , x 1 , x 2 , and x 3 , and the bit width of each data vector is p x , that is, each data vector has p x digits.
生成组件502为执行
Figure PCTCN2022100304-appb-000061
的核心元件。响应K具有2 N个单位向量,生成组件502包括2 N个生成单元,每个生成单元模拟一个单位向量,以分别生成2 N个模式向量
Figure PCTCN2022100304-appb-000062
如图5所示,第一向量
Figure PCTCN2022100304-appb-000063
拆分成x 0、x 1、x 2、x 3等4个数据向量,并行地自生成组件502的左侧输入。由于内积运算在二进制中其实就是各比特做加法运算,故生成组件502在硬件上直接模拟K中的所有单位向量,与x 0、x 1、x 2、x 3的各比特依序相加。更详细来说,每一周期同时输入x 0、x 1、x 2、x 3的同位比特,例如第一周期同时输入x 0、x 1、x 2、x 3的最低位比特,第二周期同时输入x 0、x 1、x 2、x 3的次低位比特,以此方式直到第p x周期同时输入x 0、x 1、x 2、x 3的最高位比特为止。所需带宽每周期仅为N比特,在此例子中所需带宽每周期仅为4比特。
Generate component 502 for execution
Figure PCTCN2022100304-appb-000061
core components. The response K has 2 N unit vectors, and the generating component 502 includes 2 N generating units, each of which simulates a unit vector to generate 2 N pattern vectors respectively
Figure PCTCN2022100304-appb-000062
As shown in Figure 5, the first vector
Figure PCTCN2022100304-appb-000063
Split into four data vectors such as x 0 , x 1 , x 2 , and x 3 , and input them from the left side of the generation component 502 in parallel. Since the inner product operation in binary is actually an addition operation of each bit, the generation component 502 directly simulates all unit vectors in K on the hardware, and adds them to the bits of x 0 , x 1 , x 2 , and x 3 in sequence . In more detail, the parity bits of x 0 , x 1 , x 2 , and x 3 are input at the same time in each cycle, for example, the least significant bits of x 0 , x 1 , x 2 , and x 3 are input at the same time in the first cycle, and in the second cycle The next low-order bits of x 0 , x 1 , x 2 , and x 3 are input simultaneously, in this way until the highest-order bits of x 0 , x 1 , x 2 , and x 3 are simultaneously input in the p x-th cycle. The required bandwidth is only N bits per cycle, which in this example is only 4 bits per cycle.
在第一向量
Figure PCTCN2022100304-appb-000064
的长度为4的情况下,生成组件502包括16个生成单元,分别模拟K中的16个单位向量,这些单元向量为(0000)、(0001)、(0010)、(0011)、(0100)、(0101)、(0110)、(0111)、(1000)、(1001)、(1010)、(1011)、(1100)、(1101)、(1110)及(1111)。
in the first vector
Figure PCTCN2022100304-appb-000064
When the length of is 4, the generating component 502 includes 16 generating units, respectively simulating 16 unit vectors in K, and these unit vectors are (0000), (0001), (0010), (0011), (0100) , (0101), (0110), (0111), (1000), (1001), (1010), (1011), (1100), (1101), (1110) and (1111).
图6示出单位向量为(1011)的生成单元504的示意图。以生成单元504为例,其模拟的是单位向量(1011),故生成单元504包括3个元素暂存器601、加法器602及进位暂存器603。3个元素暂存器601接收并暂存数据向量对应至所模拟的单位向量的比特值,也就是x 0、x 1、x 3的比特值,直接忽略x 2的比特值,以此结构来实现: FIG. 6 shows a schematic diagram of the generation unit 504 with a unit vector of (1011). Taking the generation unit 504 as an example, what it simulates is a unit vector (1011), so the generation unit 504 includes three element registers 601, an adder 602, and a carry register 603. The three element registers 601 receive and temporarily The stored data vector corresponds to the bit value of the simulated unit vector, that is, the bit value of x 0 , x 1 , and x 3 , and the bit value of x 2 is directly ignored, and this structure is implemented:
Figure PCTCN2022100304-appb-000065
Figure PCTCN2022100304-appb-000065
暂存器601中的数值会被送至加法器602进行累加,累加后如果出现进位,则进位的数值被暂存在进位暂存器603,与下一周期输入的x 0、x 1、x 3的比特值相加,直到第p x周期将x 0、x 1、x 3的最高位比特相加为止。每个生成单元都根据同样的技术逻辑进行设计,本领域技术人员基于 图6中实现单位向量为(1011)的生成单元504的结构,无须创造性劳动便可轻易推及其他生成单元的结构,故不赘述。需特别注意的是,有些生成单元无需设置加法器602与进位暂存器603,例如模拟单元向量(0000)、(0001)、(0010)、(0100)及(1000)的生成单元,这些生成单元在同一周期中仅有一个输入,不存在加法运算更不会发生进位的情况。 The value in the temporary register 601 will be sent to the adder 602 for accumulation. If a carry occurs after the accumulation, the value of the carry will be temporarily stored in the carry register 603, and will be compared with the x 0 , x 1 , and x 3 input in the next cycle. The bit values of x 0 , x 1 , and x 3 are added up until the p xth cycle adds the most significant bits of x 0 , x 1 , and x 3 . Each generating unit is designed according to the same technical logic. Those skilled in the art can easily deduce the structure of other generating units without creative work based on the structure of generating unit 504 with unit vector (1011) in FIG. 6 . I won't go into details. It should be noted that some generation units do not need to be provided with adder 602 and carry register 603, such as the generation units of analog unit vectors (0000), (0001), (0010), (0100) and (1000), these generation units The unit has only one input in the same cycle, and there is no addition and no carry.
回到图5,2 N个比特流输出端503分别连接至每个生成单元的加法器602的输出,用以输出2 N个模式向量
Figure PCTCN2022100304-appb-000066
在图5中,由于N为4,16个比特流输出端503总计输出16个模式向量
Figure PCTCN2022100304-appb-000067
这些模式向量
Figure PCTCN2022100304-appb-000068
的位宽有可能是p x(如果最高位比特相加不进位),或是p x+1(如果最高位比特相加后进位)。从图5可以看出,模式向量
Figure PCTCN2022100304-appb-000069
为x 0、x 1、x 2、x 3的所有加法运算可能性组合,即:
Returning to Fig. 5, 2 N bit stream output ports 503 are respectively connected to the output of the adder 602 of each generation unit to output 2 N pattern vectors
Figure PCTCN2022100304-appb-000066
In Fig. 5, since N is 4, 16 bit stream output terminals 503 output 16 pattern vectors in total
Figure PCTCN2022100304-appb-000067
These pattern vectors
Figure PCTCN2022100304-appb-000068
The bit width of may be p x (if the highest bits are added without carry), or p x +1 (if the highest bits are added and then carried). As can be seen from Figure 5, the pattern vector
Figure PCTCN2022100304-appb-000069
is all possible addition combinations of x 0 , x 1 , x 2 , x 3 , namely:
z 0=0 z 0 =0
z 1=x 0 z 1 =x 0
z 2=x 1 z 2 =x 1
z 3=x 0+x 1 z 3 =x 0 +x 1
z 4=x 2 z 4 =x 2
z 5=x 0+x 2 z 5 =x 0 +x 2
z 6=x 1+x 2 z 6 =x 1 +x 2
z 7=x 0+x 1+x 2 z 7 =x 0 +x 1 +x 2
z 8=x 3 z 8 =x 3
z 9=x 0+x 3 z 9 =x 0 +x 3
z 10=x 1+x 3 z 10 =x 1 +x 3
z 11=x 0+x 1+x 3 z 11 =x 0 +x 1 +x 3
z 12=x 2+x 3 z 12 =x 2 +x 3
z 13=x 0+x 2+x 3 z 13 =x 0 +x 2 +x 3
z 14=x 1+x 2+x 3 z 14 =x 1 +x 2 +x 3
z 15=x 0+x 1+x 2+x 3 z 15 =x 0 +x 1 +x 2 +x 3
模式向量
Figure PCTCN2022100304-appb-000070
被发送至内积单元308,此实施例的内积单元308有多个,每个内积单元308相当于一个处理器核,用以实现模式索引阶段与加权合成阶段,本发明不限制内积单元308的数量。内积单元308自处理部件内存代理单元305接收第二向量
Figure PCTCN2022100304-appb-000071
以第二向量
Figure PCTCN2022100304-appb-000072
的长度方向上的数据向量为索引,根据每个索引从所有的模式向量
Figure PCTCN2022100304-appb-000073
中选择对应的特定模式向量,累加这些特定模式向量,在每个周期生成一比特的中间结果,在连续的p x或p x+1个周期形成单位累加数列。上述运算便是在执行
Figure PCTCN2022100304-appb-000074
pattern vector
Figure PCTCN2022100304-appb-000070
It is sent to the inner product unit 308. There are multiple inner product units 308 in this embodiment, and each inner product unit 308 is equivalent to a processor core for realizing the mode index stage and the weighted synthesis stage. The present invention does not limit the inner product The number of units 308 . The inner product unit 308 receives the second vector from the processing element memory proxy unit 305
Figure PCTCN2022100304-appb-000071
take the second vector
Figure PCTCN2022100304-appb-000072
The data vectors in the length direction are indices, according to each index from all pattern vectors
Figure PCTCN2022100304-appb-000073
Select the corresponding specific pattern vectors, accumulate these specific pattern vectors, generate a one-bit intermediate result in each cycle, and form a unit accumulation sequence in consecutive p x or p x +1 cycles. The above operation is performed
Figure PCTCN2022100304-appb-000074
图7示出此实施例的内积单元308的示意图。为了实现
Figure PCTCN2022100304-appb-000075
内积单元308包括p y个多路复用器701及p y-1个串行全加器702。
FIG. 7 shows a schematic diagram of the inner product unit 308 of this embodiment. In order to achieve
Figure PCTCN2022100304-appb-000075
The inner product unit 308 includes p y multiplexers 701 and p y −1 serial full adders 702 .
p y个多路复用器701用以实现模式索引阶段。每个多路复用器701接收所有的模式向量
Figure PCTCN2022100304-appb-000076
(z0至z15),根据第二向量
Figure PCTCN2022100304-appb-000077
的长度方向上的同位数据向量让所有模式向量
Figure PCTCN2022100304-appb-000078
中的特定模式向量通过。由于第二向量
Figure PCTCN2022100304-appb-000079
的长度为N,故第二向量
Figure PCTCN2022100304-appb-000080
可以拆解成N个数据向量,由于N为4,因此第二向量
Figure PCTCN2022100304-appb-000081
可以拆解成y 0、y 1、y 2、y 3等4个数据向量,且每个数据向量的位宽为p y,因此这些数据向量以同位比特的角度来看可以拆解成p y个同位数据向量。举例来说,y 0、y 1、y 2、y 3等4个数据向量的最高位比特形成最高位同位数据向量703,y 0、y 1、y 2、y 3等4个数据向量的次高位比特形成次高位同位数据向量704,以此类推,y 0、y 1、y 2、y 3等4个数据向量的最低位比特形成最低位同位数据向量705。
p y multiplexers 701 are used to implement the pattern indexing stage. Each multiplexer 701 receives all pattern vectors
Figure PCTCN2022100304-appb-000076
(z0 to z15), according to the second vector
Figure PCTCN2022100304-appb-000077
The co-located data vectors in the length direction of let all pattern vectors
Figure PCTCN2022100304-appb-000078
The specific pattern vector in the pass. Since the second vector
Figure PCTCN2022100304-appb-000079
The length of is N, so the second vector
Figure PCTCN2022100304-appb-000080
It can be disassembled into N data vectors. Since N is 4, the second vector
Figure PCTCN2022100304-appb-000081
It can be disassembled into 4 data vectors such as y 0 , y 1 , y 2 , and y 3 , and the bit width of each data vector is p y , so these data vectors can be disassembled into p y from the perspective of parity bits co-located data vectors. For example, the most significant bits of the four data vectors y 0 , y 1 , y 2 , and y 3 form the highest bit parity data vector 703, and the order of the four data vectors such as y 0 , y 1 , y 2 , and y 3 The high-order bits form the next-highest parity data vector 704 , and so on, the lowest-order bits of the four data vectors y 0 , y 1 , y 2 , and y 3 form the lowest-order parity data vector 705 .
多路复用器701判断输入的同位数据向量与二进制矩阵K的哪个单位向量相同,输出相同单位向量所对应的特定模式向量。例如,最高位同位数据向量703作为选择信号输入至第一多路复用器,假设最高位同位数据向量703为(0101),与图5中的单位向量505相同,则第一多路复用器将输出与单位向量505相对应的特定模式向量z 5。再例如,次高位同位数据向量704作为选择信号输入至第二多路复用器,假设次高位同位数据向量704为(0010),与图5中的单位向量506相同,则第二多路复用器将输出与单位向量506相对应的特定模式向量z 2。最后,最低位同位数据向量705作为选择信号输入至第p y多路复用器,假设最低位同位数据向量705为(1110),与图5中的单位向量507相同,则第p y多路复用器将输出与单位向量507相对应的特定模式向量z 14。 至此完成
Figure PCTCN2022100304-appb-000082
的运算。
The multiplexer 701 judges which unit vector of the binary matrix K is the same as the input data vector of the same position, and outputs a specific pattern vector corresponding to the same unit vector. For example, the highest bit parity data vector 703 is input to the first multiplexer as a selection signal, assuming that the highest bit parity data vector 703 is (0101), which is the same as the unit vector 505 in Fig. 5, then the first multiplexer The converter will output a specific pattern vector z 5 corresponding to the unit vector 505 . For another example, the second highest bit parity data vector 704 is input to the second multiplexer as a selection signal, assuming that the second highest bit parity data vector 704 is (0010), which is the same as the unit vector 506 in FIG. 5 , then the second multiplexer The user will output a specific pattern vector z 2 corresponding to the unit vector 506 . Finally, the lowest bit parity data vector 705 is input to the p y multiplexer as a selection signal, assuming that the lowest bit parity data vector 705 is (1110), which is the same as the unit vector 507 in Figure 5, then the p y multiplexer The multiplexer will output a specific pattern vector z 14 corresponding to the unit vector 507 . so far completed
Figure PCTCN2022100304-appb-000082
operation.
串行全加器702实现加权合成阶段。p y-1个串行全加器702依图中方式串行连接,接收多路复用器701输出特定模式向量,依序累加这些特定模式向量,以获得单位累加数列。需特别注意的是,为符合从低位开始累加并进位(如有)至下一位使得下一位可以正确地累加并进位,最低位同位数据向量705所对应的特定模式向量必须安排输入至最外侧的串行全加器702,使得低位的同位数据向量所对应的特定模式向量优先被累加,越高位的同位数据向量所对应的特定模式向量则安排输入至越内侧的串行全加器702,最高位同位数据向量703所对应的特定模式向量必须安排输入至最内侧的串行全加器702,使得越高位的同位数据向量所对应的特定模式向量越滞后被累加,如此才能确保累加的正确性,也就是依照p y加权向量C以反映第二向量
Figure PCTCN2022100304-appb-000083
的幂次。单位累加数列是在
Figure PCTCN2022100304-appb-000084
的基础上进一步实现C的加权。至此获得如图4中的中间结果401、402、403及404。
Serial full adder 702 implements the weighted synthesis stage. p y −1 serial full adders 702 are connected in series as shown in the figure, and the receiving multiplexer 701 outputs specific pattern vectors, and these specific pattern vectors are accumulated sequentially to obtain a unit accumulation sequence. It should be noted that, in order to comply with accumulation and carry (if any) from the lower bit to the next bit so that the next bit can be correctly accumulated and carried, the specific mode vector corresponding to the lowest bit parity data vector 705 must be arranged to be input to the lowest bit. The outer serial full adder 702 enables the specific pattern vectors corresponding to the low-order same-bit data vectors to be accumulated preferentially, and the specific pattern vectors corresponding to the higher-order same-bit data vectors are arranged to be input to the inner serial full adder 702 , the specific pattern vector corresponding to the highest-order data vector 703 must be arranged to be input to the innermost serial full adder 702, so that the specific pattern vector corresponding to the higher-order data vector is accumulated more laggingly, so as to ensure the accumulation Correctness, that is, weighting the vector C according to p y to reflect the second vector
Figure PCTCN2022100304-appb-000083
to the power of . The accumulative sequence of units is in
Figure PCTCN2022100304-appb-000084
The weighting of C is further realized on the basis of . So far, intermediate results 401 , 402 , 403 and 404 as shown in FIG. 4 are obtained.
合成单元309用以执行如图4中的加总计算405。合成单元309接收来自各个内积单元308的单位累加数列,每个单位累加数列就如同图4中的中间结果401、402、403及404,这些中间结果在内积单元308中已对齐,接着合成单元309加总这些对齐后的单位累加数列,进而获得第一向量
Figure PCTCN2022100304-appb-000085
与第二向量
Figure PCTCN2022100304-appb-000086
的内积结果。
The combining unit 309 is used for performing the sum calculation 405 as shown in FIG. 4 . Combining unit 309 receives the unit accumulation sequence from each inner product unit 308, each unit accumulation sequence is like intermediate results 401, 402, 403 and 404 in FIG. 4, these intermediate results have been aligned in inner product unit 308, and then synthesized Unit 309 sums up these aligned unit accumulation sequences to obtain the first vector
Figure PCTCN2022100304-appb-000085
with the second vector
Figure PCTCN2022100304-appb-000086
inner product result.
图8示出此实施例的合成单元309的示意图。图中的合成单元309示例性地接收8个内积单元308的输出,即单位累加数列801至808。这些单位累加数列801至808是第一向量
Figure PCTCN2022100304-appb-000087
与第二向量
Figure PCTCN2022100304-appb-000088
拆分成8个数据段后,分别交由8个内积单元308进行内积计算所得的中间结果。合成单元309包括7个全加器组809至815。由于最低位的运算816与最高位的运算817仅有一个中间结果,因此最低位的运算816与最高位的运算817不需要加法器组,如同图4中的x 0y 0(最低位)及x 7y 3(最高位),无需和其他中间结果相加,直接输出即可。换言之,只有次低位至次高位的运算需要全加器组,以执行如图4所示的加总计算405。
FIG. 8 shows a schematic diagram of the synthesis unit 309 of this embodiment. The synthesis unit 309 in the figure exemplarily receives the outputs of 8 inner product units 308 , that is, the unit accumulation sequence 801 to 808 . These unit accumulation sequences 801 to 808 are the first vector
Figure PCTCN2022100304-appb-000087
with the second vector
Figure PCTCN2022100304-appb-000088
After splitting into 8 data segments, the intermediate results obtained by the inner product calculation are handed over to the eight inner product units 308 respectively. The synthesis unit 309 includes seven full adder groups 809 to 815 . Since the lowest bit operation 816 and the highest bit operation 817 have only one intermediate result, the lowest bit operation 816 and the highest bit operation 817 do not need an adder group, as in Fig. 4 x 0 y 0 (lowest bit) and x 7 y 3 (highest bit), no need to add with other intermediate results, just output directly. In other words, only the operation from the second-lowest bit to the second-highest bit requires the set of full adders to perform the summing calculation 405 shown in FIG. 4 .
图9示出全加器组810至815的示意图。全加器组810至815包括第一全加器901与第二全加器902,第一全加器901与第二全加器902分别包括多路复用器903及904,其中多路复用器903的输入端连接加法器的进位输出与数值0,多路复用器904的输入端连接加法器的进位输出与数值1,该数值0与1分别用以模拟前一位数的中间结果加总后未进位与进位,故第一全加器901用以生成前一位数未进位的中间结果总和,第二全加器902用以生成前一位数进位的中间结果总和。这样的结构可以不用等待前一位数的中间结果来决定是否进位,此实施例改采同步计算未进位与进位的设计能降低运算延迟时间。全加器组810至815还包括多路复用器905,两个中间结果总和均输入至多路复用器905,多路复用器905会根据前一位数的计算结果是否进位,来选择输出进位的中间结果总和或是未进位的中间结果总和。累加过后的输出818即为第一向量
Figure PCTCN2022100304-appb-000089
与第二向量
Figure PCTCN2022100304-appb-000090
的内积结果。
FIG. 9 shows a schematic diagram of the bank of full adders 810 to 815 . The full adder group 810 to 815 includes a first full adder 901 and a second full adder 902, and the first full adder 901 and the second full adder 902 include multiplexers 903 and 904 respectively, wherein the multiplexer The input terminal of the multiplier 903 is connected to the carry output of the adder and the value 0, and the input terminal of the multiplexer 904 is connected to the carry output of the adder and the value 1, and the values 0 and 1 are respectively used to simulate the middle of the previous digit. The results are summed without carry and carry, so the first full adder 901 is used to generate the sum of the intermediate results of the previous digit without carry, and the second full adder 902 is used to generate the sum of the intermediate results of the previous digit without carry. Such a structure can decide whether to carry without waiting for the intermediate result of the previous digit. In this embodiment, the design of synchronously calculating the non-carry and carry can reduce the operation delay time. The full adder group 810 to 815 also includes a multiplexer 905, the sum of the two intermediate results is input to the multiplexer 905, and the multiplexer 905 will select according to whether the calculation result of the previous digit is carried. Output the carried sum of intermediate results or the uncarried sum of intermediate results. The accumulated output 818 is the first vector
Figure PCTCN2022100304-appb-000089
with the second vector
Figure PCTCN2022100304-appb-000090
inner product result.
回到图8,由于最低位的运算不可能产生进位,因此次低位的全加器组809仅包括第一全加器901,直接生成未进位的中间结果,无需设置第二全加器902及多路复用器905。Returning to Fig. 8, since the operation of the lowest bit is impossible to generate a carry, the next-lowest full adder group 809 only includes the first full adder 901, which directly generates the intermediate result without carrying out, without setting the second full adder 902 and multiplexer 905 .
根据图8、图9及其相关说明,当此实施例的合成单元309欲加总M个单位累加数列时,将配置M-1个全加器组,其中包括M-1个第一全加器901、M-2个第二全加器902及M-2个多路复用器905。According to Fig. 8, Fig. 9 and related explanations, when the synthesis unit 309 of this embodiment intends to sum up M unit accumulation sequences, M-1 full adder groups will be configured, including M-1 first full adders 901, M-2 second full adders 902 and M-2 multiplexers 905.
在其他情况下,合成单元309可以弹性选择开启或关闭全加器组的运作,例如第一向量
Figure PCTCN2022100304-appb-000091
与第二向量
Figure PCTCN2022100304-appb-000092
所产生的单位累加数列小于M个时,便可适当关闭特定数量的全加器组,以灵活地支持各种可能的拆分数量,扩大合成单元309的应用场景。
In other cases, the synthesis unit 309 can flexibly choose to enable or disable the operation of the full adder group, for example, the first vector
Figure PCTCN2022100304-appb-000091
with the second vector
Figure PCTCN2022100304-appb-000092
When the generated unit accumulation sequences are less than M, a specific number of full adder groups can be properly closed to flexibly support various possible split numbers and expand the application scenarios of the combining unit 309 .
回到图3,在合成单元309获得第一向量
Figure PCTCN2022100304-appb-000093
与第二向量
Figure PCTCN2022100304-appb-000094
的内积结果后,发送至处理部件内存代理单元305,处理部件内存代理单元305接收内积结果并将其发送至核内存代理器301,核内存代理器301将整合所有处理部件304的内积结果,以生成计算结果,发送至片外内存203,以完成第一操作数与第二操作数的乘积运算。
Returning to Fig. 3, the first vector is obtained in the synthesis unit 309
Figure PCTCN2022100304-appb-000093
with the second vector
Figure PCTCN2022100304-appb-000094
After the inner product result, it is sent to the processing unit memory proxy unit 305, and the processing unit memory proxy unit 305 receives the inner product result and sends it to the kernel memory proxy 301, and the kernel memory proxy 301 will integrate the inner product of all processing units 304 As a result, the calculation result is generated and sent to the off-chip memory 203 to complete the product operation of the first operand and the second operand.
基于上述的结构,此实施例的计算装置201根据操作数的长度执行不同数量的内积运算。进一步地,处理阵列303可以控制索引在纵向的处理部件304间共享,并控制模式向量在横向的 处理部件304间共享,以高效地进行运算。Based on the above structure, the computing device 201 of this embodiment performs different numbers of inner product operations according to the length of the operands. Further, the processing array 303 can control the index to be shared among the vertical processing units 304, and control the mode vector to be shared among the horizontal processing units 304, so as to perform operations efficiently.
在数据路径管理上,此实施例采用两级架构,即核内存代理器301和处理部件内存代理单元305。操作数在LLC中的起始地址记录于核内存代理器301中,核内存代理器301通过自增加地址来同时、连续、串行自LLC读取多个操作数。源地址是自增长的,因此数据块的顺序是确定的。核控制器302决定哪些处理部件304接收数据块,处理部件控制单元306再决定哪些内积单元308接收这些数据块。In terms of data path management, this embodiment adopts a two-level architecture, that is, a core memory agent 301 and a processing component memory agent unit 305 . The starting address of the operand in the LLC is recorded in the kernel memory agent 301, and the kernel memory agent 301 simultaneously, continuously, and serially reads multiple operands from the LLC by self-increasing the address. The source address is self-increasing, so the order of data blocks is deterministic. The core controller 302 determines which processing elements 304 receive the data blocks, and the processing element control unit 306 then determines which inner product units 308 receive the data blocks.
本发明的另一个实施例是一种任意精度计算方法,可以利用前述实施例的硬件结构来实现。图10示出此实施例的流程图。Another embodiment of the present invention is an arbitrary precision calculation method, which can be realized by using the hardware structure of the foregoing embodiments. Fig. 10 shows a flowchart of this embodiment.
在步骤1001中,自片外内存读取多个操作数。当自片外内存读取操作数时,操作数的起始地址在核内存代理器中被设置,核内存代理器通过自增加地址来同时、连续、串行读取多个操作数,其读取方式是一次性地自这些操作数的低位逐次往高位读取。In step 1001, a plurality of operands are read from off-chip memory. When reading operands from the off-chip memory, the starting address of the operands is set in the kernel memory agent, and the kernel memory agent reads multiple operands simultaneously, continuously, and serially through self-increasing addresses. The fetching method is to read from the lower bits of these operands to the higher bits one by one.
在步骤1002中,将多个操作数拆分成多个向量,多个向量包括第一向量及第二向量。核控制器基于处理阵列中的处理部件的运算能力与数量,控制将每个操作数拆分成多个数据段,也就是多个向量,使得核内存代理器以数据段为单位发送至处理阵列。In step 1002, multiple operands are split into multiple vectors, and the multiple vectors include a first vector and a second vector. Based on the computing power and quantity of processing components in the processing array, the core controller controls to split each operand into multiple data segments, that is, multiple vectors, so that the core memory agent sends data segments to the processing array .
在步骤1003中,根据第一向量及第二向量的长度,内积第一向量与第二向量,以获得内积结果。处理阵列包括多个处理部件,这些处理部件以阵列方式排列,每个处理部件根据第一向量的长度及第二向量的长度,内积第一向量与第二向量,以获得内积结果。更详细来说,在此步骤中,先执行模式生成阶段,再执行模式索引阶段,最后执行加权合成阶段。In step 1003, according to the lengths of the first vector and the second vector, the first vector and the second vector are inner producted to obtain an inner product result. The processing array includes a plurality of processing units arranged in an array, and each processing unit inner-products the first vector and the second vector according to the length of the first vector and the length of the second vector to obtain an inner product result. In more detail, in this step, the pattern generation stage is executed first, then the pattern index stage is executed, and finally the weighted synthesis stage is executed.
以第一向量
Figure PCTCN2022100304-appb-000095
与第二向量
Figure PCTCN2022100304-appb-000096
的内积为例,假设第一向量
Figure PCTCN2022100304-appb-000097
与第二向量
Figure PCTCN2022100304-appb-000098
的大小分别为N×p x与N×p y,其中N为第一向量
Figure PCTCN2022100304-appb-000099
与第二向量
Figure PCTCN2022100304-appb-000100
的长度,p x为第一向量
Figure PCTCN2022100304-appb-000101
的位宽,p y为第二向量
Figure PCTCN2022100304-appb-000102
的位宽。此实施例同样将第二向量
Figure PCTCN2022100304-appb-000103
拆解为:
Take the first vector
Figure PCTCN2022100304-appb-000095
with the second vector
Figure PCTCN2022100304-appb-000096
As an example, assuming that the first vector
Figure PCTCN2022100304-appb-000097
with the second vector
Figure PCTCN2022100304-appb-000098
The sizes are N×p x and N×p y , where N is the first vector
Figure PCTCN2022100304-appb-000099
with the second vector
Figure PCTCN2022100304-appb-000100
The length of p x is the first vector
Figure PCTCN2022100304-appb-000101
The bit width of , p y is the second vector
Figure PCTCN2022100304-appb-000102
bit width. This embodiment also converts the second vector
Figure PCTCN2022100304-appb-000103
Dismantled as:
Figure PCTCN2022100304-appb-000104
Figure PCTCN2022100304-appb-000104
其中K是一个固定不变且大小为N×2 N二进制矩阵,B col是一个大小为2 N×p y的二进制矩阵,C是p y加权向量,K、B col、C的定义与前述实施例无异,故不赘述。此实施例通过上述的方式来拆解第二向量
Figure PCTCN2022100304-appb-000105
使得第二向量
Figure PCTCN2022100304-appb-000106
中的各元素可以用K与B col两个二进制矩阵来表示。换言之,此实施例将
Figure PCTCN2022100304-appb-000107
的内积运算转换成
Figure PCTCN2022100304-appb-000108
的运算。
Among them, K is a fixed binary matrix with a size of N×2 N , B col is a binary matrix with a size of 2 N ×p y , C is a weighted vector of p y , and the definition of K, B col and C is the same as the aforementioned implementation The examples are the same, so I won't go into details. This embodiment disassembles the second vector in the above-mentioned way
Figure PCTCN2022100304-appb-000105
such that the second vector
Figure PCTCN2022100304-appb-000106
Each element in can be represented by two binary matrices K and B col . In other words, this embodiment will
Figure PCTCN2022100304-appb-000107
The inner product operation is converted into
Figure PCTCN2022100304-appb-000108
operation.
在模式生成阶段,此实施例获得
Figure PCTCN2022100304-appb-000109
的各种可能性,即生成模式向量
Figure PCTCN2022100304-appb-000110
在模式索引阶段,此实施例计算
Figure PCTCN2022100304-appb-000111
在加权合成阶段,再根据权重C来累积索引模式。如此的设计使得不论精度多高的操作数都能转换成索引模式执行内积来减少重复计算,以避免任意精度计算对高带宽的要求。图11进一步示出内积第一向量与第二向量的流程图。
During the schema generation phase, this embodiment obtains
Figure PCTCN2022100304-appb-000109
The various possibilities of generating pattern vectors
Figure PCTCN2022100304-appb-000110
During the schema indexing phase, this embodiment computes
Figure PCTCN2022100304-appb-000111
In the weighted synthesis stage, the index patterns are accumulated according to the weight C. Such a design enables operands no matter how high the precision is to be converted into an index mode to perform inner products to reduce repeated calculations and avoid high bandwidth requirements for arbitrary precision calculations. FIG. 11 further shows a flowchart of the inner product of the first vector and the second vector.
在步骤1101中,根据第一向量的长度及位宽生成多个模式向量。首先,对应至第一向量
Figure PCTCN2022100304-appb-000112
的长度N,分别接收N个数据向量。接着响应K具有2 N个单位向量,利用硬件模拟每个单位向量,以分别生成2 N个模式向量
Figure PCTCN2022100304-appb-000113
由于内积运算在二进制中其实就是各比特做加法运算,故此实施例的生成组件直接模拟K中的所有单位向量,与第一向量
Figure PCTCN2022100304-appb-000114
的数据向量的各比特依序相加。更详细来说,每一周期同时输入第一向量
Figure PCTCN2022100304-appb-000115
的数据向量的同位比特,例如第一周期同时输入数据向量的最低位比特,第二周期同时输入数据向量的次低位比特,以此方式直到第p x周期同时输入数据向量的最高位比特为止。所需带宽每周期仅为N比特。
In step 1101, a plurality of pattern vectors are generated according to the length and bit width of the first vector. First, corresponding to the first vector
Figure PCTCN2022100304-appb-000112
The length N, receive N data vectors respectively. Then the response K has 2 N unit vectors, each unit vector is simulated by hardware to generate 2 N pattern vectors respectively
Figure PCTCN2022100304-appb-000113
Since the inner product operation is actually the addition operation of each bit in binary, the generation component of this embodiment directly simulates all unit vectors in K, and the first vector
Figure PCTCN2022100304-appb-000114
The bits of the data vector of are added sequentially. In more detail, each cycle simultaneously inputs the first vector
Figure PCTCN2022100304-appb-000115
For example, the lowest bit of the data vector is input at the same time in the first cycle, and the second-lowest bit of the data vector is input at the same time in the second cycle, in this way until the highest bit of the data vector is input at the p xth cycle at the same time. The required bandwidth is only N bits per cycle.
在模拟单位向量时,先接收并暂存对应至该单位向量的数据向量的比特值,这些比特值会被累加,累加后如果出现进位,则进位的数值被暂存在进位暂存器,与下一周期输入的数据向量的比特值相加,直到第p x周期将数据向量的最高位比特值相加为止。 When simulating a unit vector, first receive and temporarily store the bit values corresponding to the data vector of the unit vector, these bit values will be accumulated, if a carry occurs after accumulation, the value of the carry will be temporarily stored in the carry register, and the next The bit values of the data vectors input in one cycle are added until the p xth cycle adds the most significant bit values of the data vectors.
最后,接收累加后的结果,即为模式向量
Figure PCTCN2022100304-appb-000116
综上所述,模式向量
Figure PCTCN2022100304-appb-000117
为第一向量
Figure PCTCN2022100304-appb-000118
的数据向量的所有加法运算可能性组合。
Finally, receive the accumulated result, which is the pattern vector
Figure PCTCN2022100304-appb-000116
In summary, the pattern vector
Figure PCTCN2022100304-appb-000117
is the first vector
Figure PCTCN2022100304-appb-000118
All combinations of addition possibilities for a data vector of .
在步骤1102中,基于第二向量
Figure PCTCN2022100304-appb-000119
在长度方向上的数据向量为索引,累加多个模式向量中的特定模式向量,以形成多个单位累加数列。此步骤实现模式索引阶段与加权合成阶段。以第二向量
Figure PCTCN2022100304-appb-000120
的长度方向上的数据向量为索引,根据每个索引从所有的模式向量
Figure PCTCN2022100304-appb-000121
中选择对应的特定模式向 量,累加这些特定模式向量,在每个周期生成一比特的中间结果,在连续的p x或p x+1个周期形成单位累加数列。上述运算便是在执行
Figure PCTCN2022100304-appb-000122
In step 1102, based on the second vector
Figure PCTCN2022100304-appb-000119
The data vector in the length direction is used as an index, and specific pattern vectors among the plurality of pattern vectors are accumulated to form a plurality of unit accumulation sequence. This step implements the pattern indexing phase and the weighted synthesis phase. take the second vector
Figure PCTCN2022100304-appb-000120
The data vectors in the length direction are indices, according to each index from all pattern vectors
Figure PCTCN2022100304-appb-000121
Select the corresponding specific pattern vectors, accumulate these specific pattern vectors, generate a one-bit intermediate result in each cycle, and form a unit accumulation sequence in consecutive p x or p x +1 cycles. The above operation is performed
Figure PCTCN2022100304-appb-000122
更详细来说,,根据第二向量
Figure PCTCN2022100304-appb-000123
的长度方向上的同位数据向量让所有模式向量
Figure PCTCN2022100304-appb-000124
中的特定模式向量通过。由于第二向量
Figure PCTCN2022100304-appb-000125
的长度为N,故第二向量
Figure PCTCN2022100304-appb-000126
可以拆解成N个数据向量,每个数据向量的位宽为p y,因此这些数据向量以同位比特的角度来看可以拆解成p y个同位数据向量。
In more detail, according to the second vector
Figure PCTCN2022100304-appb-000123
The co-located data vectors in the length direction of let all pattern vectors
Figure PCTCN2022100304-appb-000124
The specific pattern vector in the pass. Since the second vector
Figure PCTCN2022100304-appb-000125
The length of is N, so the second vector
Figure PCTCN2022100304-appb-000126
It can be disassembled into N data vectors, and the bit width of each data vector is p y , so these data vectors can be disassembled into p y same data vectors from the perspective of the same bits.
接着,判断输入的同位数据向量与二进制矩阵K的哪个单位向量相同,输出相同单位向量所对应的特定模式向量。至此完成
Figure PCTCN2022100304-appb-000127
的运算。
Next, it is judged which unit vector of the binary matrix K is the same as the input data vector of the same position, and a specific pattern vector corresponding to the same unit vector is output. so far completed
Figure PCTCN2022100304-appb-000127
operation.
最后,依序累加这些特定模式向量,以获得单位累加数列。需特别注意的是,应确保累加的正确性,也就是依照p ?加权向量C以反映第二向量
Figure PCTCN2022100304-appb-000128
的幂次。单位累加数列是在
Figure PCTCN2022100304-appb-000129
的基础上进一步实现C的加权。每个单位累加数列就如同图4中的中间结果401、402、403及404,这些中间结果已完成对齐。
Finally, these specific pattern vectors are sequentially accumulated to obtain a unit accumulation sequence. Special attention should be paid to ensure the correctness of the accumulation, that is, to weight the vector C according to p ? to reflect the second vector
Figure PCTCN2022100304-appb-000128
to the power of . The accumulative sequence of units is in
Figure PCTCN2022100304-appb-000129
The weighting of C is further realized on the basis of . Each unit accumulation sequence is like the intermediate results 401 , 402 , 403 and 404 in FIG. 4 , and these intermediate results have been aligned.
在步骤1103中,加总多个单位累加数列,以获得内积结果。为了实现同步计算,此实施例将第一向量
Figure PCTCN2022100304-appb-000130
与第二向量
Figure PCTCN2022100304-appb-000131
拆分成多个数据段后,分别进行内积计算所得的中间结果。由于最低位的运算与最高位的运算仅有一个中间结果,因此最低位的运算与最高位的运算不需进行加法运算,如同图4中的x 0y 0(最低位)及x 7y 3(最高位),无需和其他中间结果相加,直接输出即可。换言之,只有次低位至次高位的运算需要执行加法运算。
In step 1103, a plurality of unit accumulation sequences are summed up to obtain an inner product result. In order to achieve synchronous calculation, this embodiment will first vector
Figure PCTCN2022100304-appb-000130
with the second vector
Figure PCTCN2022100304-appb-000131
After splitting into multiple data segments, the intermediate results obtained by performing inner product calculations respectively. Since there is only one intermediate result between the lowest bit operation and the highest bit operation, the lowest bit operation and the highest bit operation do not need to be added, just like x 0 y 0 (lowest bit) and x 7 y 3 in Figure 4 (highest bit), no need to add with other intermediate results, just output directly. In other words, only the operation from the second lowest bit to the second highest bit needs to perform the addition operation.
此实施例采同步计算未进位与进位的设计以降低运算延迟时间。未进位与进位的中间结果总和同时获得,再根据前一位数的计算结果是否进位,来选择输出进位的中间结果总和或是未进位的中间结果总和。累加过后的输出即为第一向量
Figure PCTCN2022100304-appb-000132
与第二向量
Figure PCTCN2022100304-appb-000133
的内积结果。
This embodiment adopts the design of synchronously calculating the non-carry and carry to reduce the operation delay time. The sum of intermediate results without carry and carry is obtained at the same time, and then the sum of intermediate results with carry or the sum of intermediate results without carry is selected according to whether the calculation result of the previous digit is carry. The accumulated output is the first vector
Figure PCTCN2022100304-appb-000132
with the second vector
Figure PCTCN2022100304-appb-000133
inner product result.
回到图10,在步骤1004中,将内积结果整合成多个操作数的计算结果。核控制器控制内存代理器将内积结果整合或归约成多个操作数的计算结果,发送给核内存代理器。Returning to Fig. 10, in step 1004, the inner product result is integrated into calculation results of multiple operands. The core controller controls the memory agent to integrate or reduce the inner product result into calculation results of multiple operands and send it to the kernel memory agent.
在步骤1005中,将计算结果存储至片外内存。核内存代理器并行发送计算结果,先同时发送这些计算结果的最低位比特,再同时发送这些计算结果的次低位比特,依此方式直到这些计算结果的最高位比特同时发送完毕。In step 1005, the calculation result is stored in the off-chip memory. The kernel memory agent sends calculation results in parallel, first sending the lowest bits of these calculation results at the same time, and then sending the second-lowest bits of these calculation results at the same time, and in this way until the highest bits of these calculation results are sent at the same time.
本发明另一个实施例为一种计算机可读存储介质,其上存储有任意精度计算的计算机程序代码,当所述计算机程序代码由处理器运行时,执行如图10或图11的方法。在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本发明的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本发明实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。Another embodiment of the present invention is a computer-readable storage medium, on which is stored computer program code for calculation with arbitrary precision. When the computer program code is run by a processor, the method shown in FIG. 10 or FIG. 11 is executed. In some implementation scenarios, the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer readable memory. Based on this, when the solution of the present invention is embodied in the form of a software product (such as a computer-readable storage medium), the software product can be stored in a memory, and it can include several instructions to make a computer device (such as a personal computer, server or network device, etc.) execute some or all of the steps of the method described in the embodiment of the present invention. The aforementioned memory may include but not limited to U disk, flash disk, read only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), mobile hard disk, magnetic disk or optical disk, etc., which can store programs. The medium of the code.
本发明提出一种新颖的架构,用以有效地处理任意精度计算。不论操作数的精度多高,本发明都可以将操作数进行拆解,利用索引并行处理固定长度的比特流,避免比特级冗余,像是稀疏性或重复计算等问题,无需配置高位宽的硬件,便可达到灵活运用和大位宽计算的效果。The present invention proposes a novel architecture to efficiently handle arbitrary precision calculations. No matter how high the precision of the operand is, the present invention can disassemble the operand, use the index to process the fixed-length bit stream in parallel, avoid bit-level redundancy, such as sparsity or repeated calculation, and do not need to configure a high-bit-width The hardware can achieve the effect of flexible application and large bit width calculation.
根据不同的应用场景,本发明的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本发明的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本发明的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本发明方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务 器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。According to different application scenarios, the electronic equipment or device of the present invention may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, Internet of Things terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment. Said vehicles include airplanes, ships and/or vehicles; said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods; said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph. The electronic equipment or device of the present invention can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical treatment. Further, the electronic device or device of the present invention can also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as cloud, edge, and terminal. In one or more embodiments, electronic devices or devices with high computing power according to the solution of the present invention can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.
需要说明的是,为了简明的目的,本发明将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本发明的方案并不受所描述的动作的顺序限制。因此,依据本发明的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本发明所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本发明某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本发明对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本发明某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。It should be noted that, for the purpose of brevity, the present invention expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present invention is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present invention, those skilled in the art can understand that some of the steps can be performed in other order or at the same time. Further, those skilled in the art can understand that the embodiments described in the present invention can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present invention. In addition, according to different schemes, the description of some embodiments of the present invention also has different emphases. In view of this, those skilled in the art may understand the parts not described in detail in a certain embodiment of the present invention, and may also refer to relevant descriptions of other embodiments.
在具体实现方面,基于本发明的公开和教导,本领域技术人员可以理解本发明所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。In terms of specific implementation, based on the disclosure and teaching of the present invention, those skilled in the art can understand that several embodiments disclosed in the present invention can also be implemented in other ways not disclosed herein. For example, with respect to each unit in the above-mentioned electronic device or device embodiment, this paper divides them on the basis of considering logical functions, but there may be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions in units or components may be selectively disabled. As far as the connection relationship between different units or components is concerned, the connections discussed above in conjunction with the drawings may be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection using an interface, where the communication interface may support electrical, optical, acoustic, magnetic or other forms of signal transmission.
在本发明中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本发明实施例所述方案的目的。另外,在一些场景中,本发明实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。In the present invention, a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit. The aforementioned components or units may be located at the same location or distributed over multiple network units. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present invention. In addition, in some scenarios, multiple units in this embodiment of the present invention may be integrated into one unit, or each unit exists physically independently.
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。In other implementation scenarios, the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits. The physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors. In view of this, various devices (such as computing devices or other processing devices) described herein may be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs. Further, the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (Enhanced Dynamic Random Access Memory, EDRAM), High Bandwidth Memory (High Bandwidth Memory , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
以上对本发明实施例进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。The embodiments of the present invention have been described in detail above, and specific examples have been used in this paper to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only used to help understand the method and core idea of the present invention; at the same time, for Those skilled in the art will have changes in the specific implementation and scope of application according to the idea of the present invention. In summary, the contents of this specification should not be construed as limiting the present invention.

Claims (26)

  1. 一种用以内积第一向量与第二向量的处理部件,包括:A processing unit for inner producting a first vector and a second vector, comprising:
    转换单元,用以根据所述第一向量的长度及位宽生成多个模式向量;a conversion unit, configured to generate multiple pattern vectors according to the length and bit width of the first vector;
    多个内积单元,每个内积单元基于所述第二向量在所述长度方向上的数据向量为索引,累加所述多个模式向量中的特定模式向量,以形成单位累加数列;以及a plurality of inner product units, each inner product unit is based on the data vector of the second vector in the length direction as an index, and accumulates a specific pattern vector in the plurality of pattern vectors to form a unit accumulation sequence; and
    合成单元,用以加总多个单位累加数列,以获得内积结果。Synthetic unit, which is used to add up multiple unit accumulation sequences to obtain the inner product result.
  2. 根据权利要求1所述的处理部件,其中当所述长度为N时,所述转换单元生成2 N个模式向量,N为正整数。 The processing unit according to claim 1, wherein when the length is N, the conversion unit generates 2 N pattern vectors, and N is a positive integer.
  3. 根据权利要求2所述的处理部件,其中所述第一向量根据所述长度分为N个数据向量,所述转换单元包括:The processing unit according to claim 2, wherein the first vector is divided into N data vectors according to the length, and the conversion unit comprises:
    N个比特流输入端,用以分别接收所述N个数据向量;以及N bit stream input ends, for respectively receiving the N data vectors; and
    生成组件,包括2 N个生成单元,每个生成单元模拟对应至所述长度的2 N个单位向量的其中之一,所述2 N个生成单元分别生成所述2 N个模式向量。 The generating component includes 2 N generating units, each generating unit simulates one of the 2 N unit vectors corresponding to the length, and the 2 N generating units respectively generate the 2 N pattern vectors.
  4. 根据权利要求3所述的处理部件,其中所述生成单元包括:The processing unit according to claim 3, wherein said generating unit comprises:
    元素暂存器,用以接收并暂存所述数据向量对应至所模拟的单位向量的比特值;An element register, used to receive and temporarily store the bit value corresponding to the simulated unit vector from the data vector;
    加法器,用以累加所述比特值;以及an adder for accumulating the bit values; and
    进位暂存器,用以暂存来自累加后的进位值。The carry register is used to temporarily store the carry value after accumulation.
  5. 根据权利要求4所述的处理部件,其中所述转换单元还包括:The processing unit according to claim 4, wherein said converting unit further comprises:
    2 N个比特流输出端,用以分别连接至所述加法器的输出,以输出所述2 N个模式向量。 The 2 N bit stream output ports are respectively connected to the output of the adder to output the 2 N pattern vectors.
  6. 根据权利要求5所述的处理部件,其中所述2 N模式向量为所述数据向量的所有加法运算可能性组合。 The processing unit of claim 5, wherein said 2 N pattern vectors are all addition possibility combinations of said data vectors.
  7. 根据权利要求2或5所述的处理部件,其中所述2 N模式向量的位宽为所述第一向量的位宽及所述第一向量的位宽加一的其中之一。 The processing unit according to claim 2 or 5, wherein the bit width of the 2 N pattern vector is one of the bit width of the first vector and the bit width of the first vector plus one.
  8. 根据权利要求2所述的处理部件,其中所述转换单元的带宽为每周期N比特。The processing unit according to claim 2, wherein the conversion unit has a bandwidth of N bits per cycle.
  9. 根据权利要求1所述的处理部件,其中每个内积单元包括:The processing unit according to claim 1, wherein each inner product unit comprises:
    多个多路复用器,分别接收所述多个模式向量,根据所述第二向量在长度方向上的同位数据向量让所述多个模式向量中的特定模式向量通过;以及a plurality of multiplexers, respectively receiving the plurality of pattern vectors, and allowing specific pattern vectors in the plurality of pattern vectors to pass through according to the co-located data vectors of the second vector in the length direction; and
    多个串行全加器,用以加权合成所述特定模式向量,以获得所述单位累加数列。A plurality of serial full adders are used for weighting and synthesizing the specific pattern vector to obtain the unit accumulation sequence.
  10. 根据权利要求9所述的处理部件,其中所述特定模式向量为与所述同位数据向量相同的单位向量所对应的模式向量。The processing unit according to claim 9, wherein the specific pattern vector is a pattern vector corresponding to the same unit vector as the colocated data vector.
  11. 根据权利要求9所述的处理部件,其中所述多路复用器的数量与所述第二向量的位宽相同,所述串行全加器的数量为所述第二向量的位宽减一。The processing unit according to claim 9, wherein the number of the multiplexers is the same as the bit width of the second vector, and the number of the serial full adders is reduced by the bit width of the second vector one.
  12. 根据权利要求9所述的处理部件,其中最低位同位数据向量所对应的特定模式向量输入至最外侧的串行全加器,最高位同位数据向量所对应的特定模式向量输入至最内侧的串行全加器。The processing unit according to claim 9, wherein the specific pattern vector corresponding to the lowest bit parity data vector is input to the outermost serial full adder, and the specific pattern vector corresponding to the highest bit parity data vector is input to the innermost serial row full adder.
  13. 根据权利要求1所述的处理部件,其中所述合成单元包括多个全加器组,用以针对所述多个单位累加数列对齐后,执行次低位至次高位的加总运算。The processing unit according to claim 1 , wherein the combination unit comprises a plurality of full adder groups, for performing the addition operation from the second lowest bit to the second highest bit after the alignment of the plurality of unit accumulation sequences.
  14. 根据权利要求13所述的处理部件,其中所述全加器组包括第一全加器,用以生成未进位的中间结果。The processing unit according to claim 13, wherein said set of full adders comprises a first full adder for generating uncarried intermediate results.
  15. 根据权利要求14所述的处理部件,其中所述全加器组还包括:The processing unit according to claim 14, wherein said full adder bank further comprises:
    第二全加器,用以生成进位的中间结果;以及a second full adder to generate carry-in intermediate results; and
    多路复用器,用以根据前一位的中间结果,选择输出所述进位的中间结果及所述未进位的中间结果其中之一。The multiplexer is used to select and output one of the carried intermediate result and the non-carried intermediate result according to the preceding intermediate result.
  16. 根据权利要求15所述的处理部件,其中当所述单位累加数列为M个时,所述全加器组的数量为M-1个,所述第一全加器的数量为M-1个,所述第二全加器的数量为M-2个,所述多路复用器的数量为M-2个。The processing unit according to claim 15, wherein when the unit accumulation sequence is M, the number of the full adder group is M-1, and the number of the first full adder is M-1 , the number of the second full adders is M-2, and the number of the multiplexers is M-2.
  17. 一种任意精度计算加速器,连接至片外内存,所述任意精度计算加速器包括:An arbitrary-precision computing accelerator connected to an off-chip memory, the arbitrary-precision computing accelerator comprising:
    核内存代理器,用以自所述片外内存读取多个操作数;a kernel memory agent, for reading a plurality of operands from said off-chip memory;
    核控制器,用以将所述多个操作数拆分成多个向量,所述多个向量包括第一向量及第二向量;以及a core controller, configured to split the plurality of operands into a plurality of vectors, the plurality of vectors comprising a first vector and a second vector; and
    处理阵列,包括多个处理部件,所述处理部件用以根据所述第一向量及所述第二向量的长度,内积所述第一向量与所述第二向量,以获得内积结果;A processing array, including a plurality of processing units, the processing units are used to inner product the first vector and the second vector according to the lengths of the first vector and the second vector, to obtain an inner product result;
    其中,所述核控制器将所述内积结果整合成所述多个操作数的计算结果,所述核内存代理器将所述计算结果存储至所述片外内存。Wherein, the core controller integrates the inner product result into calculation results of the plurality of operands, and the core memory agent stores the calculation results in the off-chip memory.
  18. 根据权利要求17所述的任意精度计算加速器,其中所述多个操作数的起始地址在所述核内存代理器中被设置,所述核内存代理器通过自增加地址来串行读取所述多个操作数。The arbitrary-precision computing accelerator according to claim 17, wherein the start addresses of the plurality of operands are set in the kernel memory agent, and the kernel memory agent serially reads all operands by self-incrementing addresses multiple operands.
  19. 根据权利要求18所述的任意精度计算加速器,其中所述核内存代理器读取所述多个操作数的方式是一次性地自所述多个操作数的低位逐次往高位读取。The arbitrary-precision computing accelerator according to claim 18, wherein the manner of reading the plurality of operands by the kernel memory agent is to read from lower bits to higher bits of the plurality of operands one by one.
  20. 根据权利要求17所述的任意精度计算加速器,其中所述核内存代理器将所述计算结果并行发送至所述片外内存。The arbitrary precision computing accelerator according to claim 17, wherein the kernel memory agent sends the computing results to the off-chip memory in parallel.
  21. 根据权利要求17所述的任意精度计算加速器,其中每个处理部件包括:The arbitrary precision computing accelerator of claim 17, wherein each processing element comprises:
    转换单元,用以根据所述第一向量的长度及位宽生成多个模式向量;a conversion unit, configured to generate multiple pattern vectors according to the length and bit width of the first vector;
    多个内积单元,每个内积单元基于所述第二向量在所述长度方向上的数据向量为索引,累加所述多个模式向量中的特定模式向量,以形成单位累加数列;以及a plurality of inner product units, each inner product unit is based on the data vector of the second vector in the length direction as an index, and accumulates a specific pattern vector in the plurality of pattern vectors to form a unit accumulation sequence; and
    合成单元,用以加总多个单位累加数列,以获得所述内积结果。A synthesis unit is used for summing up a plurality of unit accumulation sequences to obtain the inner product result.
  22. 一种集成电路装置,包括:An integrated circuit device comprising:
    根据权利要求17至21任一项所述的任意精度计算加速器;The arbitrary precision computing accelerator according to any one of claims 17 to 21;
    处理装置,用以控制所述任意精度计算加速器;以及processing means for controlling said arbitrary precision computing accelerator; and
    片外内存,包括LLC;Off-chip memory, including LLC;
    其中,所述任意精度计算加速器与所述处理装置通过所述LLC联系。Wherein, the arbitrary precision computing accelerator communicates with the processing device through the LLC.
  23. 一种板卡,包括根据权利要求22所述的集成电路装置。A board comprising the integrated circuit device according to claim 22.
  24. 一种内积第一向量与第二向量的方法,包括:A method for inner producting a first vector and a second vector, comprising:
    根据所述第一向量的长度及位宽生成多个模式向量;generating multiple pattern vectors according to the length and bit width of the first vector;
    基于所述第二向量在所述长度方向上的数据向量为索引,累加所述多个模式向量中的特定模式向量,以形成多个单位累加数列;以及Based on the data vector of the second vector in the length direction as an index, accumulating a specific pattern vector in the plurality of pattern vectors to form a plurality of unit accumulation sequences; and
    加总所述多个单位累加数列,以获得内积结果。summing up the plurality of unit accumulation sequences to obtain an inner product result.
  25. 一种任意精度计算方法,包括:An arbitrary precision calculation method comprising:
    自片外内存读取多个操作数;Read multiple operands from off-chip memory;
    将所述多个操作数拆分成多个向量,所述多个向量包括第一向量及第二向量;splitting the plurality of operands into a plurality of vectors, the plurality of vectors comprising a first vector and a second vector;
    根据所述第一向量及所述第二向量的长度,内积所述第一向量与所述第二向量,以获得内积结果;Inner producting the first vector and the second vector according to the lengths of the first vector and the second vector to obtain an inner product result;
    将所述内积结果整合成所述多个操作数的计算结果;以及integrating the inner product result into a calculation result of the plurality of operands; and
    将所述计算结果存储至所述片外内存。The calculation result is stored in the off-chip memory.
  26. 一种计算机可读存储介质,其上存储有任意精度计算的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行权利要求24或25所述的方法。A computer-readable storage medium on which is stored computer program code for arbitrary precision calculations, which, when executed by a processing device, performs the method of claim 24 or 25.
PCT/CN2022/100304 2021-10-20 2022-06-22 Inner product processing component, arbitrary-precision computing device and method, and readable storage medium WO2023065701A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111221317.4 2021-10-20
CN202111221317.4A CN114003198B (en) 2021-10-20 2021-10-20 Inner product processing unit, arbitrary precision calculation device, method, and readable storage medium

Publications (1)

Publication Number Publication Date
WO2023065701A1 true WO2023065701A1 (en) 2023-04-27

Family

ID=79923295

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100304 WO2023065701A1 (en) 2021-10-20 2022-06-22 Inner product processing component, arbitrary-precision computing device and method, and readable storage medium

Country Status (2)

Country Link
CN (2) CN115437602A (en)
WO (1) WO2023065701A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115437602A (en) * 2021-10-20 2022-12-06 中科寒武纪科技股份有限公司 Arbitrary-precision calculation accelerator, integrated circuit device, board card and method
CN115080916B (en) * 2022-07-14 2024-06-18 北京有竹居网络技术有限公司 Data processing method, device, electronic equipment and computer readable medium
CN118349213B (en) * 2024-06-14 2024-09-27 中昊芯英(杭州)科技有限公司 Data processing device, method, medium and computing equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090049113A1 (en) * 2007-08-17 2009-02-19 Adam James Muff Method and Apparatus for Implementing a Multiple Operand Vector Floating Point Summation to Scalar Function
CN109165732A (en) * 2018-02-05 2019-01-08 上海寒武纪信息科技有限公司 Processing with Neural Network device and its method for executing the multiply-add instruction of vector
CN109213962A (en) * 2017-07-07 2019-01-15 华为技术有限公司 Arithmetic accelerator
CN114003198A (en) * 2021-10-20 2022-02-01 中科寒武纪科技股份有限公司 Inner product processing unit, arbitrary precision calculation device, method, and readable storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101082860A (en) * 2007-07-03 2007-12-05 浙江大学 Multiply adding up device
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
US10338919B2 (en) * 2017-05-08 2019-07-02 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
CN110110283A (en) * 2018-02-01 2019-08-09 北京中科晶上科技股份有限公司 A kind of convolutional calculation method
CN112711738A (en) * 2019-10-25 2021-04-27 安徽寒武纪信息科技有限公司 Computing device and method for vector inner product and integrated circuit chip
CN112084023A (en) * 2020-08-21 2020-12-15 安徽寒武纪信息科技有限公司 Data parallel processing method, electronic equipment and computer readable storage medium
CN112487750B (en) * 2020-11-30 2023-06-16 西安微电子技术研究所 Convolution acceleration computing system and method based on in-memory computing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090049113A1 (en) * 2007-08-17 2009-02-19 Adam James Muff Method and Apparatus for Implementing a Multiple Operand Vector Floating Point Summation to Scalar Function
CN109213962A (en) * 2017-07-07 2019-01-15 华为技术有限公司 Arithmetic accelerator
CN109165732A (en) * 2018-02-05 2019-01-08 上海寒武纪信息科技有限公司 Processing with Neural Network device and its method for executing the multiply-add instruction of vector
CN114003198A (en) * 2021-10-20 2022-02-01 中科寒武纪科技股份有限公司 Inner product processing unit, arbitrary precision calculation device, method, and readable storage medium

Also Published As

Publication number Publication date
CN114003198A (en) 2022-02-01
CN114003198B (en) 2023-03-24
CN115437602A (en) 2022-12-06

Similar Documents

Publication Publication Date Title
CN109219821B (en) Arithmetic device and method
WO2023065701A1 (en) Inner product processing component, arbitrary-precision computing device and method, and readable storage medium
US11704125B2 (en) Computing device and method
CN109003132B (en) Advertisement recommendation method and related product
TWI795519B (en) Computing apparatus, machine learning computing apparatus, combined processing device, neural network chip, electronic device, board, and method for performing machine learning calculation
CN109522052B (en) Computing device and board card
US20200117614A1 (en) Computing device and method
CN110163361B (en) Computing device and method
CN109165041A (en) Processing with Neural Network device and its method for executing vector norm instruction
CN109032670A (en) Processing with Neural Network device and its method for executing vector duplicate instructions
CN112686379B (en) Integrated circuit device, electronic apparatus, board and computing method
US11775808B2 (en) Neural network computation device and method
WO2022134873A1 (en) Data processing device, data processing method, and related product
CN107368459B (en) Scheduling method of reconfigurable computing structure based on arbitrary dimension matrix multiplication
CN112966729A (en) Data processing method and device, computer equipment and storage medium
CN112766473A (en) Arithmetic device and related product
WO2022001500A1 (en) Computing apparatus, integrated circuit chip, board card, electronic device, and computing method
WO2022001497A1 (en) Computing apparatus, integrated circuit chip, board card, electronic device and computing method
CN111260070B (en) Operation method, device and related product
WO2020108486A1 (en) Data processing apparatus and method, chip, and electronic device
CN112766471A (en) Arithmetic device and related product
CN111291884A (en) Neural network pruning method and device, electronic equipment and computer readable medium
WO2022143799A1 (en) Integrated circuit apparatus for matrix multiplication operation, computing device, system, and method
WO2022001496A1 (en) Computing apparatus, integrated circuit chip, board card, electronic device, and computing method
CN113033788B (en) Data processor, method, device and chip

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22882324

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE