WO2022001439A1 - Computing apparatus, integrated circuit chip, board and computing method - Google Patents

Computing apparatus, integrated circuit chip, board and computing method Download PDF

Info

Publication number
WO2022001439A1
WO2022001439A1 PCT/CN2021/094468 CN2021094468W WO2022001439A1 WO 2022001439 A1 WO2022001439 A1 WO 2022001439A1 CN 2021094468 W CN2021094468 W CN 2021094468W WO 2022001439 A1 WO2022001439 A1 WO 2022001439A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing circuits
data
processing
circuit
instruction
Prior art date
Application number
PCT/CN2021/094468
Other languages
French (fr)
Chinese (zh)
Inventor
刘少礼
陶劲桦
刘道福
周聖元
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Publication of WO2022001439A1 publication Critical patent/WO2022001439A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode

Definitions

  • This disclosure relates generally to the field of data processing. More particularly, the present disclosure relates to a computing device, an integrated circuit chip, a board, and a method of performing computing operations using the aforementioned computing device.
  • Existing artificial intelligence operations often include a large number of data operations, such as convolution operations, image processing, etc. As the amount of data increases, the amount of operations and storage involved in data operations such as matrix operations will increase sharply due to the increase in the size of the data.
  • a general-purpose processor such as a central processing unit (“CPU") or a graphics processing unit (“GPU”) is usually used for computing.
  • CPU central processing unit
  • GPU graphics processing unit
  • general-purpose processors often have high power consumption due to their general-purpose features and high device redundancy, thus resulting in limited performance.
  • the existing operation processing circuit usually adopts a fixed hardware architecture.
  • the data scale expands or the data format changes it may not only be unable to support a certain type of operation, but also cause its operation performance to be extremely high during the operation. Limited, or even inoperable.
  • the present disclosure provides a solution that supports multiple types of operations, improves operation efficiency, and saves operation cost and overhead. Specifically, the present disclosure provides the aforementioned solutions in the following aspects.
  • the present disclosure provides a computing device comprising a control circuit and a plurality of processing circuits, wherein: the control circuit is configured to obtain an instruction and parse the instruction, and send the parsed instruction to a plurality of one or more of the processing circuits; and the plurality of processing circuits configured to be connected in a one-dimensional or multi-dimensional array and to perform multi-threaded operations in accordance with the received parsed instructions.
  • the present disclosure provides an integrated circuit chip comprising a computing device of various embodiments of the foregoing and later described.
  • the present disclosure provides a board including the aforementioned integrated circuit chip.
  • the present disclosure provides a method of performing an arithmetic operation using a computing device, wherein the computing device includes a control circuit and a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, the method comprising: utilizing The control circuit obtains the instruction and parses the instruction, and sends the parsed instruction to one or more processing circuits in the plurality of processing circuits; and utilizes the one or more processing circuits to The parsed instructions perform multithreaded operations.
  • FIG. 1 is an overall architecture diagram illustrating a computing device according to an embodiment of the present disclosure
  • FIG. 2 is an exemplary specific architecture diagram illustrating a computing device according to an embodiment of the present disclosure
  • FIG. 3 is an example block diagram illustrating a single type of processing circuit array of a computing device according to an embodiment of the present disclosure
  • FIG. 4 is an example block diagram illustrating various types of processing circuit arrays of a computing device according to an embodiment of the present disclosure
  • 5a, 5b, 5c and 5d are schematic diagrams illustrating various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure
  • 6a, 6b, 6c and 6d are schematic diagrams illustrating further various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure
  • Figures 7a, 7b, 7c and 7d are schematic diagrams illustrating various loop structures of processing circuits according to embodiments of the present disclosure.
  • FIGS. 8a, 8b and 8c are schematic diagrams illustrating further various loop structures of processing circuits according to embodiments of the present disclosure.
  • Figures 9a, 9b, 9c and 9d are schematic diagrams illustrating data splicing operations performed by a preprocessing circuit according to an embodiment of the present disclosure
  • 10a, 10b and 10c are schematic diagrams illustrating data compression operations performed by a post-processing circuit according to an embodiment of the present disclosure
  • FIG. 11 is a simplified flowchart illustrating a method of using a computing device to perform an arithmetic operation according to an embodiment of the present disclosure
  • FIG. 12 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure.
  • FIG. 13 is a schematic structural diagram illustrating a board according to an embodiment of the present disclosure.
  • FIG. 1 is a general architectural diagram illustrating a computing device 100 according to an embodiment of the present disclosure.
  • the computing device 100 of the present disclosure may include a control circuit 102 and a plurality of processing circuits 104 .
  • the control circuit may be configured to obtain and parse the instruction, and may send the parsed instruction to one or more of the plurality of processing circuits.
  • the fetched instructions may include one or more opcodes, and each opcode may represent one or more specific operations to be performed by one or more processing circuits.
  • Each opcode can be represented in any suitable form.
  • an opcode can be represented by an English abbreviation such as "ADD” or "MUL” to express that what is to be performed is an "addition” or “multiplication” operation.
  • the operation code can also be represented by an English abbreviation such as "AM” that cannot directly determine the specific operation literally.
  • the opcode may include or involve different types of operations, for example, may include arithmetic operations such as addition or multiplication, logical operations, comparison operations, or table lookup operations, or any of the foregoing types of operations. combination.
  • each opcode may correspond to one or more microinstructions obtained in the process of parsing the instruction.
  • the parsed instruction of the present disclosure may include one or more micro-instructions corresponding to an opcode in the instruction to indicate one or more specific operations to be performed by the processing circuit.
  • the control circuit 102 may be configured to acquire instruction identification information in the instruction, and send the parsed instruction to the one or more of the plurality of processing circuits, wherein one or more processing circuits are identified in the instruction identification information.
  • the parsed instruction here may be an instruction decoded by the control circuit or an parsed instruction that has not been decoded by the control circuit.
  • a corresponding decoding circuit may be included in the processing circuit to decode the parsed instruction, for example, to obtain a plurality of micro-instructions.
  • the control circuit in the process of parsing the instruction, may be configured to decode the instruction, and according to the result of the decoding and the operating states of the plurality of processing circuits, The parsed instructions are sent to one or more of the plurality of processing circuits.
  • multiple processing circuits may all support non-specific operations of the same type. Therefore, in order to improve the utilization rate and operation efficiency of the processing circuit, the parsed instruction may be sent to the processing circuit whose occupancy rate is not high or is in an idle state.
  • the plurality of processing circuits 104 may be configured to be connected in a one-dimensional or multi-dimensional array structure and to perform multi-threaded operations according to the received parsed instructions.
  • the plurality of processing circuits may be configured to receive and execute the parsed instructions in a single instruction multithreading ("SIMT") fashion.
  • SIMMT single instruction multithreading
  • the multi-dimensional array may comprise a two-dimensional array and/or a three-dimensional array (as shown in FIGS. 5 and 6 ).
  • each processing circuit in the aforementioned one-dimensional or multi-dimensional array can be connected to other processing circuits in a specified direction and a predetermined spacing pattern within a certain range.
  • multiple processing circuits may be logically connected in series to form one or more closed loops (as shown in Figures 7 and 8).
  • connection mode between the multiple processing circuits may be a hard-wired connection through a hardware structure.
  • connection manner between the multiple processing circuits may also be a logical connection manner configured according to parsed instructions, such as microinstructions.
  • FIG. 2 is a diagram illustrating an example specific architecture of a computing device 200 according to an embodiment of the present disclosure.
  • the computing device 200 not only includes the control circuit 102 and the plurality of processing circuits 104 of the computing device 100 in FIG. 1 , but also further shows a plurality of circuits included in the processing circuit, and additionally a plurality of other devices. Since the functions of the control circuit and the processing circuit have been described in detail above with reference to FIG. 1 , they will not be repeated below.
  • the processing circuit 104 may include a logic operation circuit 1041, which may be configured to perform a logic operation according to the parsed instruction and the received data when performing the multi-threaded operation, for example, perform a logic operation on the received data Logical operations such as AND or NOT, shift operations, or compare operations.
  • the processing circuit 104 may also include an arithmetic operation circuit 1043, which may be configured to perform arithmetic operations, such as linear operations such as addition, subtraction, or multiplication.
  • processing circuitry 104 may also include storage circuitry 1042 including data storage circuitry and/or predicate storage circuitry, wherein the data storage circuitry may be configured to store operational data (eg, pixels) of the processing circuitry and At least one of the intermediate operation results.
  • the predicate storage circuit may be configured to store the predicate storage circuit serial number and predicate information of each of the processing circuits obtained using the parsed instruction.
  • the storage circuit 1042 may be implemented by using a memory such as a register or a static random access memory (“SRAM”) according to actual needs.
  • the predicate storage circuit may include a 1-bit register for storing predicate information.
  • the predicate storage circuit in the processing circuit may include 32 1-bit registers sequentially numbered from 00000 to 11111.
  • the processing circuit can read the predicate information in the register corresponding to the serial number "00101" according to the register serial number "00101" specified in the received parsed instruction.
  • the predicate storage circuit may be configured to update the predicate information according to the parsed instruction.
  • the predicate information may be directly updated according to the configuration information in the parsed instruction, or the configuration information may be acquired according to the configuration information storage address provided in the parsed instruction, so as to update the predicate information.
  • the predicate storage circuit may also update the predicate information according to the comparison result of each of the processing circuits, which is a form of operation result in the context of the present disclosure.
  • the predicate information may be updated using input data received by the processing circuit compared to stored data in its data storage circuit. When the input data is greater than the stored data, the predicate information of the processing circuit is set to 1. Conversely, when the input data is smaller than the stored data, the predicate information is set to 0, or its original value is kept unchanged.
  • each processing circuit may determine whether the processing circuit executes the operation of the parsed instruction according to the information in the parsed instruction. Further, each of the processing circuits may be configured to obtain the predicate information corresponding to the predicate storage circuit according to the sequence number of the predicate storage circuit in the parsed instruction, and determine the predicate information according to the predicate information. Whether the processing circuit executes the parsed instruction. For example, when the value of the predicate information read by the processing circuit according to the sequence number of the predicate storage circuit specified in the parsed instruction is 1, it means that the processing circuit executes the parsed instruction.
  • the processing circuit may be to make the processing circuit read the data pointed in the instruction, and store the read data into the data storage circuit of the processing circuit. Conversely, when the value of the predicate information read by the processing circuit according to the sequence number of the predicate storage circuit specified in the parsed instruction is 0, it means that the processing circuit does not execute the parsed instruction.
  • the computing device 200 of the present disclosure may also include data processing circuitry 106 , which may include at least one of pre-processing circuitry 1061 and post-processing circuitry 1062 .
  • the preprocessing circuit 1061 may be configured to perform a preprocessing operation (described later in conjunction with FIG. 7b ) on the operation data before the processing circuit performs the operation, such as performing a data splicing or data placement operation.
  • the post-processing circuit 1062 may be configured to perform a post-processing operation on the result of the operation after the processing circuit performs the operation, such as performing a data restoration or data compression operation.
  • the computing device 200 may further include a main storage circuit 108, which can not only receive and store data from the control circuit as input data of the processing circuit, but also be used to transmit and store multiple data between processing circuits.
  • the main storage circuit 108 may be further divided into at least one of a main storage module 1081 and a main cache module 1082 according to the storage method or the characteristics of the stored data.
  • the main storage module 1081 may be configured to store data (eg, input pixels) to be performed operations in the processing circuit and operation results (eg, output pixels) after performing operations.
  • the main cache module 1082 may be configured to cache intermediate operation results after operations performed in the plurality of processing circuits.
  • the main storage circuit can not only perform internal storage, but also support the function of data interaction with storage devices outside the computing device of the present disclosure, for example, it can communicate with external storage devices through direct memory access ("DMA"). device for data exchange.
  • DMA direct memory access
  • FIG. 3 is an example block diagram illustrating a single type of processing circuit array of a computing device according to an embodiment of the present disclosure.
  • the computing device shown not only includes the control circuit 102, the main storage circuit 108, the data processing circuit 106, and a plurality of processing circuits 104 of the same type shown in FIG.
  • Multiple processing circuits of the same type may be arranged through physical connections to form a two-dimensional array, eg, multiple processing circuits are arranged to form a two-dimensional array.
  • the plurality of processing circuits of the present disclosure may be divided according to type for performing different types of data processing operations.
  • the plurality of processing circuits may be divided into first type processing circuits and second type processing circuits (as shown in FIG. 4).
  • the first type of processing circuit may be adapted to perform at least one of arithmetic operations and logical operations
  • the second type of processing circuit may be adapted to perform at least one of comparison operations and table lookup operations .
  • FIG. 4 is an example block diagram illustrating various types of processing circuit arrays of a computing device according to an embodiment of the present disclosure.
  • the computing device shown in FIG. 4 includes a control circuit 102 , a main storage circuit 108 and a plurality of processing circuits 104 of different types.
  • the computing device may also include data handling circuitry 106 as shown in FIGS. 2 and 3 .
  • the computing device architecture shown in FIG. 4 is similar to the computing device architecture shown in FIGS. 2 and 3 , so the technical details of the computing device 200 described in conjunction with FIGS. 2 and 3 are also applicable to the computing device 200 shown in FIG. 4 . computing device shown.
  • the plurality of processing circuits of the present disclosure may include, for example, a plurality of first-type processing circuits and a plurality of second-type processing circuits (as shown in the figure, the processing circuits with different color backgrounds have different types) ).
  • the plurality of processing circuits may be arranged through physical connections to form a two-dimensional array. It can be understood that the arrangement of the two types of processing circuits shown in FIG. 4 is merely exemplary and not limiting, and other arrangements may be conceived by those skilled in the art based on the teachings of the present disclosure.
  • a plurality of processing circuits of the first type may be arranged on the left and right sides of the array, and a plurality of processing circuits of the second type may be arranged in the middle area of the array.
  • a plurality of first type processing circuits may be arranged in the middle area of the array, and a plurality of second type processing circuits may be arranged in the surrounding areas of the array.
  • a plurality of first-type processing circuits and second-type processing circuits may also be interspersed in an array.
  • the types of processing circuits disclosed in the present disclosure may not be limited to the two shown in the figures, but may have more types of processing circuits to implement different types of computing operations.
  • first type processing circuits processing circuits 104 with a light background shown in the figure
  • M and N is a positive integer greater than 0.
  • the first type of processing circuit can be used to perform arithmetic operations and logical operations, for example, can include linear operations such as addition, subtraction and multiplication, comparison operations and non-linear operations such as AND-OR, or any combination of the aforementioned types of operations. .
  • the processing circuit array has a total of (M*2+M*2+N*2+8) second-type processing circuits (as shown in the figure).
  • the processing circuit 104 is shown with a dark background).
  • the second type of processing circuit may be used to perform non-linear operations such as comparison operations, table lookup operations or shift operations on the received data.
  • the storage circuits applied to both the first type of processing circuit and the second type of processing circuit may have different storage scales and storage modes.
  • the predicate storage circuit in the first type of processing circuit may utilize a plurality of numbered registers to store predicate information.
  • the first-type processing circuit can access the predicate information in the register of the corresponding number according to the register number specified in the received parsed instruction.
  • the second type of processing circuit may store the predicate information in a static random access memory ("SRAM").
  • the second type processing circuit may determine that the predicate information is stored in static random access memory ("SRAM") according to the offset of the location of the predicate information specified in the received parsed instruction storage address, and can perform predetermined read or write operations on the predicate information in the storage address.
  • SRAM static random access memory
  • FIGS. 5a, 5b, 5c and 5d are schematic diagrams illustrating various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure.
  • the multiple processing circuits of the present disclosure may be connected in a one-dimensional or multi-dimensional array topology.
  • the multi-dimensional array may be a two-dimensional array, and the processing circuits located in the two-dimensional array may be arranged in a row direction, a column direction or a diagonal direction thereof. In at least one direction, it is connected with the remaining one or more of the processing circuits in the same row, the same column or the same diagonal line in a predetermined two-dimensional interval pattern.
  • the predetermined two-dimensional spacing pattern may be associated with the number of processing circuits spaced in the connection.
  • Figures 5a to 5c exemplarily show the topology of various forms of two-dimensional arrays between a plurality of processing circuits.
  • processing circuits are connected to form a simple two-dimensional array. Specifically, one processing circuit is used as the center of the two-dimensional array, and one processing circuit is connected to each of the four horizontal and vertical directions relative to the processing circuit, thereby forming a two-dimensional array with three rows and three columns. . Further, since the processing circuits located in the center of the two-dimensional array are respectively directly connected with the processing circuits adjacent to the previous and next columns of the same row, and the processing circuits adjacent to the previous row and the next row of the same column, the number of spaced processing circuits ( abbreviated as "Number of Intervals") is 0.
  • each processing circuit is connected to its adjacent processing circuits in the preceding and following rows, and the preceding and following columns, namely, The number of intervals connected to adjacent processing circuits is all zero.
  • the first processing circuit located in each row or column in the two-dimensional Torus array is also connected to the last processing circuit of the row or column, and the number of intervals between the processing circuits connected end to end in each row or column is equal to is 2.
  • the processing circuits with four rows and four columns may also be connected to form a two-dimensional array in which the number of intervals between adjacent processing circuits is 0, and the number of intervals between non-adjacent processing circuits is 1.
  • adjacent processing circuits in the same row or in the same column are directly connected, that is, the number of intervals is 0, and the processing circuits in the same row or in the same column that are not adjacent are connected to the processing circuit in the number of intervals.
  • different numbers of intervals may also be connected to the processing circuits in the diagonal direction.
  • a three-dimensional Torus array is based on the two-dimensional Torus array, and uses a spacing pattern similar to that between rows and columns for interlayer connection. For example, firstly, the processing circuits in the same row and the same column of adjacent layers are directly connected, that is, the number of intervals is 0. Next, connect the processing circuits of the first layer and the last layer in the same column, that is, the number of intervals is 2. Finally, a three-dimensional Torus array with four layers, four rows and four columns can be formed.
  • connection relationship of other multi-dimensional arrays of processing circuits can be formed on the basis of two-dimensional arrays by adding new dimensions and increasing the number of processing circuits.
  • the solutions of the present disclosure may also configure logical connections to processing circuits by using configuration instructions.
  • the disclosed solution may selectively connect some processing circuits or selectively bypass some processing circuits through configuration instructions to form one or more processing circuits.
  • a logical connection can also be adjusted according to actual operation requirements (eg, data type conversion).
  • the solutions of the present disclosure can configure the connection of the processing circuits, including, for example, configuring into a matrix or configuring into one or more closed computing loops.
  • FIGS. 6a, 6b, 6c and 6d are schematic diagrams illustrating further various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure.
  • FIGS. 6 a to 6 d are still another exemplary connection relationship of a multi-dimensional array formed by a plurality of processing circuits shown in FIGS. 5 a to 5 d .
  • the technical details described in conjunction with Figs. 5a to 5d also apply to the content shown in Figs. 6a to 6d.
  • the processing circuit of the two-dimensional array includes a central processing circuit located in the center of the two-dimensional array and three processing circuits respectively connected to the central processing circuit in four directions in the same row and in the same column. Therefore, the number of bays connected between the central processing circuit and the remaining processing circuits is 0, 1 and 2, respectively.
  • the processing circuit of the two-dimensional array includes a central processing circuit located in the center of the two-dimensional array, three processing circuits in two opposite directions parallel to the processing circuit, and two processing circuits in the same column as the processing circuit A processing circuit in the opposite direction. Therefore, the number of intervals between the central processing circuit and the processing circuits in the same row is 0 and 2 respectively, and the number of intervals between the central processing circuit and the processing circuits in the same column is all 0.
  • a multi-dimensional array formed by a plurality of processing circuits may be a three-dimensional array formed by a plurality of layers. Wherein each layer of the three-dimensional array may comprise a two-dimensional array of a plurality of the processing circuits arranged along its row and column directions. Further, the processing circuits located in the three-dimensional array may be in a predetermined three-dimensional spaced pattern with a row, column, diagonal or The remaining one or more processing circuits on different layers are connected. Further, the predetermined three-dimensional spacing pattern and the number of mutually spaced processing circuits in the connection may be related to the number of spaced layers. The connection mode of the three-dimensional array will be further described below with reference to FIG. 6c and FIG. 6d.
  • Figure 6c shows a multi-layer, multi-row and multi-column three-dimensional array formed by connecting a plurality of processing circuits.
  • the processing circuit located at the lth layer, the rth row, and the cth column (represented as (l, r, c)) as an example, it is located at the center of the array, and is in the same layer as the previous column (l, r, The processing circuit at c-1) and the processing circuit at the next column (l, r, c+1), the processing circuit at the previous row (l, r-1, c) of the same layer and the same column and the processing circuit at the next row (l, r-1, c)
  • the processing circuit at r+1, c) and the processing circuit at the previous layer (l-1, r, c) and the processing circuit at the next layer (l+1, r, c) of different layers in the same column to connect.
  • FIG. 6d shows a three-dimensional array when the number of spaces connected between a plurality of processing circuits in the row direction, the column direction, and the layer direction is all one.
  • the processing circuit located at the center of the array (l, r, c) as an example, it is separated from (l, r, c-2) and (l, r, c+2) by one column before and after different columns in the same layer, respectively. ), and the processing circuits at (1, r-2, c) and (1, r+2, c) at the same layer and the same column and different rows are connected.
  • processing circuits at (l-2, r, c) and (l+2, r, c) are connected with the processing circuits at (l-2, r, c) and (l+2, r, c) at the same row and different layers in the same row before and after each other.
  • the processing circuits at (l, r, c-3) and (l, r, c-1) at the same level and one column apart are connected to each other, and (l, r, c+1) and ( The processing circuits at l, r, c+3) are connected to each other.
  • the processing circuits at (l, r-3, c) and (l, r-1, c) in the same layer and the same column are connected to each other, (l, r+1, c) and (l, r+ 3.
  • the processing circuits at c) are connected to each other.
  • the processing circuits at (l-3, r, c) and (l-1, r, c) in the same row and one layer are connected to each other, and (l+1, r, c) and (l+3)
  • the processing circuits at , r, c) are connected to each other.
  • connection relationship of the multi-dimensional array formed by a plurality of processing circuits has been exemplarily described above, and different loop structures formed by a plurality of processing circuits will be further exemplarily described below with reference to FIGS. 7-8 .
  • FIGS. 7a, 7b, 7c and 7d are schematic diagrams respectively illustrating various loop structures of processing circuits according to embodiments of the present disclosure.
  • a plurality of processing circuits can not only be connected in a physical connection relationship, but also can be configured to be connected in a logical relationship according to the received parsed instruction.
  • the plurality of processing circuits may be configured to be connected using the logical connection relationship to form a closed loop.
  • the four adjacent processing circuits are sequentially numbered "0, 1, 2 and 3".
  • the four processing circuits are sequentially connected in a clockwise direction from processing circuit 0, and processing circuit 3 is connected with processing circuit 0, so that the four processing circuits are connected in series to form a closed loop (referred to as "looping" for short).
  • the number of intervals between processing circuits is 0 or 2
  • the number of intervals between processing circuits 0 and 1 is 0, and the number of intervals between processing circuits 3 and 0 is 2.
  • the physical addresses of the four processing circuits in the illustrated loop may be 0-1-2-3, while their logical addresses are also 0-1-2-3. It should be noted that the connection sequence shown in FIG. 7a is only exemplary and non-limiting, and those skilled in the art can also connect the four processing circuits in a counterclockwise direction in series to form a closed circuit according to actual calculation needs. the loop.
  • a plurality of processing circuits may be combined into a processing circuit group to represent one data. For example, suppose a processing circuit can handle 8-bit data. When 32-bit data needs to be processed, four processing circuits can be combined into a processing circuit group, so that four 8-bit data can be connected to form a 32-bit data. Further, one processing circuit group formed by the aforementioned four 8-bit processing circuits can serve as one processing circuit 104 shown in FIG. 7b, so that higher bit-width arithmetic operations can be supported.
  • FIG. 7b shows the layout of the processing circuits shown is similar to that shown in Fig. 7a, but the number of intervals of connections between the processing circuits in Fig. 7b is different from that of Fig. 7a.
  • Figure 7b shows four processing circuits numbered sequentially 0, 1, 2 and 3 starting from processing circuit 0 in a clockwise direction, connecting processing circuit 1, processing circuit 3 and processing circuit 2 in sequence, and processing circuit 2 connected to processing circuit 2.
  • circuit 0 thus forming a closed loop in series. It can be seen from this loop that the number of intervals of the processing circuits shown in FIG. 7b is 0 or 1, eg, the interval between processing circuits 0 and 1 is 0, and the interval between processing circuits 1 and 3 is 1.
  • the physical addresses of the four processing circuits in the illustrated closed loop may be 0-1-2-3, while the logical addresses may be 0-1-3-2. Therefore, when data of high bit width needs to be split to be allocated to different processing circuits, the data sequence can be rearranged and allocated according to the logical addresses of the processing circuits.
  • the pre-processing circuit can rearrange the input data according to the physical addresses and logical addresses of the plurality of processing circuits, so as to meet the requirements of data operation. Assuming that four sequentially arranged processing circuits 0 to 3 are connected as shown in Figure 7a, since the physical and logical addresses of the connections are both 0-1-2-3, the pre-processing circuit can convert the input data ( For example, pixel data) aa0, aa1, aa2 and aa3 are sequentially transmitted to the corresponding processing circuits.
  • the circuit needs to rearrange the input data aa0, aa1, aa2 and aa3 into aa0-aa1-aa3-aa2 for transmission to the corresponding processing circuit.
  • the solution of the present disclosure can ensure the correctness of the data operation sequence.
  • the post-processing circuit described in conjunction with FIG. 2 can be used to restore and adjust the order of the operation output results to bb0- bb1-bb2-bb3, to ensure the consistency of arrangement between input data and output result data.
  • Figures 7c and 7d show that more processing circuits are arranged and connected in different ways, respectively, to form a closed loop.
  • the 16 processing circuits 104 numbered in the order of 0, 1 . . . 15, starting from processing circuit 0, are sequentially connected and combined every two processing circuits to form a processing circuit group.
  • processing circuit 0 is connected with processing circuit 1 to form a processing circuit group . . .
  • the processing circuit 14 is connected with the processing circuit 15 to form one processing circuit group, and finally eight processing circuit groups are formed.
  • the eight processing circuit groups can also be connected in a manner similar to the aforementioned processing circuits, including connection according to, for example, predetermined logical addresses, so as to form a closed loop of the processing circuit groups.
  • the plurality of processing circuits 104 are connected in an irregular or non-uniform manner to form a closed loop.
  • the number of intervals between the processing circuits can be 0 or 3 to form a closed loop, for example, the processing circuit 0 can be respectively connected with the processing circuit 1 (the interval number is 0) and the processing circuit 4 (the interval number is 0) The number is 3) connected.
  • the processing circuit of the present disclosure may be spaced by different numbers of processing circuits so as to be connected in a closed loop.
  • any number of intermediate intervals can also be selected for dynamic configuration, thereby connecting into a closed loop.
  • the connection of the plurality of processing circuits may be a hard connection formed by hardware, or may be a soft connection configured by software.
  • FIGS. 8a, 8b and 8c are schematic diagrams illustrating further various loop structures of processing circuits according to embodiments of the present disclosure.
  • multiple processing circuits may form a closed loop, and each processing circuit in the closed loop may be configured with a respective logical address.
  • the pre-processing circuit described in conjunction with FIG. 2 can be configured to perform corresponding splitting of the operational data and obtain after the splitting according to the type of the operational data (such as 32bit data, 16bit data or 8bit data) and the logical address.
  • the multiple sub-data of are respectively transferred to the corresponding processing circuits in the loop for subsequent operations.
  • the upper diagram of FIG. 8a shows that four processing circuits are connected to form a closed loop, and the physical addresses (which may also be referred to as physical coordinates in the context of this disclosure) of the four processing circuits in right-to-left order can be represented as 0 -1-2-3.
  • the lower diagram of Figure 8a shows that the logical addresses of the four processing circuits in the aforementioned loop are represented as 0-3-1-2 in order from right to left.
  • the processing circuit with the logical address "3" shown in the lower diagram of Fig. 8a has the physical address "1" shown in the upper diagram of Fig. 8a.
  • the granularity of the operation data is the lower 128 bits of the input data, such as the original sequence "15, 14, ... 2, 1, 0" in the figure (each number corresponds to 8 bits of data), and set this
  • the logical addresses of the 16 8-bit data are numbered from low to high in order from 0 to 15. Further, according to the logical addresses shown in the lower figure of Fig. 8a, the pre-processing circuit can use different logical addresses to encode or arrange the data according to different data types.
  • the logical addresses are (3,2,1,0), (7,6,5,4), (11,10,9,8) and (15,14) , 13, 12) can represent the 0th to 3rd 32bit data respectively.
  • the preprocessing circuit can transmit the 0th 32-bit data to the processing circuit whose logical address is "0" (the corresponding physical address is "0"), and can transmit the first 32-bit data to the logical address "1".
  • the second 32-bit data can be transferred to the processing circuit whose logical address is "2" (corresponding physical address is “3"), and the third The 32bit data is sent to the processing circuit whose logical address is "3" (the corresponding physical address is "1").
  • the mapping relationship between the logical address and the physical address of the final data is (15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)-> (11,10,9,8,7,6,5,4,15,14,13,12,3,2,1,0).
  • the logical addresses are (1,0), (3,2), (5,4), (7,6), (9,8), (11,10) ), (13,12) and (15,14) 8 numbers can represent the 0th to 7th 16bit data respectively.
  • the pre-processing circuit can transfer the 0th and 4th 16bit data to the processing circuit whose logical address is "0" (the corresponding physical address is “0"), and can transfer the 1st and 5th 16bit data.
  • the mapping relationship between the logical address and the physical address of the final data is:
  • the pre-processing circuit can transmit the 0th, 4th, 8th and 12th 8-bit data to the processing circuit whose logical address is “0” (the corresponding physical address is “0”); the 1st, 5th, 9th and 13th 8bit data can be transferred to the processing circuit whose logical address is "1" (the corresponding physical address is "2”); The 2nd, 6th, 10th and 14th 8bit data are transferred to the processing circuit with the logical address "2" (the corresponding physical address is "3”); the third, seventh, The 11th and 15th 8bit data are transferred to the processing circuit whose logical address is "3” (the corresponding physical address is “1").
  • mapping relationship between the logical address and the physical address of the final data is: (15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)- >(14,19,6,2,13,9,5,1,15,11,7,3,12,8,4,0).
  • Figure 8b shows that eight sequentially numbered processing circuits 0 to 7 are connected to form a closed loop, and the physical addresses of the eight processing circuits are 0-1-2-3-4-5-6- 7.
  • the lower diagram of Fig. 8b shows that the logical addresses of the aforementioned eight processing circuits are 0-7-1-6-2-5-3-4.
  • the processing circuit with the physical address "6" shown in the upper diagram of Fig. 8b corresponds to the logical address "3" shown in the lower diagram of Fig. 8b.
  • the pre-processing circuit rearranges the data and then transmits the data to the corresponding processing circuit.
  • the operation is similar to that of FIG. 8a, so the technical solution described in conjunction with FIG. 8a is also applicable to FIG. 8b. , the above data rearrangement operation process will not be repeated here.
  • the connection relationship of the processing circuits shown in FIG. 8b is similar to that shown in FIG. 8a, but the eight processing circuits shown in FIG. 8b are twice the number of processing circuits shown in FIG. 8a.
  • the granularity of the operational data described in conjunction with FIG. 8b may be twice that of the operational data described in conjunction with FIG. 8a.
  • the granularity of the operation data in this example can be 256 bits lower than that of the input data, for example, the original data sequence "31, 30, . , 0", each digit corresponds to an 8-bit ("bit") length.
  • the figures also show the arrangement results of the data in the looped processing circuits.
  • the data bit width of the operation is 32 bits
  • one 32-bit data in the processing circuit whose logical address is "1" is (7, 6, 5, 4)
  • the corresponding physical address of this processing circuit is "2”.
  • the data bit width of the operation is 16 bits
  • the two 16-bit data in the processing circuit whose logical address is "3" is (23, 22, 7, 6)
  • the corresponding physical address of the processing circuit is "6”.
  • the data bit width of the operation is 8 bits
  • the four 8-bit data in the processing circuit whose logical address is "6" is (30, 22, 14, 6), and the corresponding physical address of this processing circuit is "3".
  • FIG. 8c shows that twenty multi-type processing circuits numbered in the order of 0, 1 . . . 19 are connected to form a closed loop (the numbers shown in the figure are the physical addresses of the processing circuits). Sixteen processing circuits numbered from 0 to 15 are first type processing circuits, and four processing circuits numbered from 16 to 19 are second type processing circuits. Similarly, the physical address of each of the twenty processing circuits has a mapping relationship with the logical address of the corresponding processing circuit shown in the lower figure of FIG. 8c.
  • FIG. 8c also shows the result of operating the aforementioned original data for different data types supported by the processing circuit.
  • the data bit width of the operation is 32 bits
  • one 32-bit data in the processing circuit whose logical address is "1" is (7, 6, 5, 4)
  • the corresponding physical address of this processing circuit is "2”.
  • the data bit width of the operation is 16 bits
  • the two 16-bit data in the processing circuit whose logical address is "11” are (63, 62, 23, 22)
  • the corresponding physical address of the processing circuit is "9”.
  • the data bit width of the operation is 8 bits
  • the four 8-bit data in the processing circuit whose logical address is "17” is (77, 57, 37, 17)
  • the corresponding physical address of the processing circuit is "18”.
  • 9a, 9b, 9c and 9d are schematic diagrams illustrating data stitching operations performed by a preprocessing circuit according to an embodiment of the present disclosure.
  • the pre-processing circuit described in the present disclosure in conjunction with FIG. 2 can also be configured to select a data splicing mode from a plurality of data splicing modes according to the parsed instruction, so as to perform a splicing operation on the two input data.
  • the solution of the present disclosure divides and numbers the two data to be spliced according to the minimum data unit, and then extracts different minimum data units of the data based on specified rules to form different data. Stitching mode.
  • the minimum data unit here can be simply 1-bit or 1-bit data, or 2-bit, 4-bit, 8-bit, 16-bit or 32-bit or the length of the bit .
  • the scheme of the present disclosure can either extract alternately with the smallest data unit, or can extract in multiples of the smallest data unit, for example, alternately extract two data at a time from the two data. Partial data of one or three minimum data units are grouped together as a group.
  • the input data are In1 and In2, and when each square in the figure represents a minimum data unit, both input data have a bit width length of 8 minimum data units.
  • the minimum data unit may represent different number of bits (or number of bits). For example, for data with a bit width of 8 bits, the smallest data unit represents 1-bit data, and for data with a bit width of 16 bits, the smallest data unit represents 2-bit data. For another example, for data with a bit width of 32 bits, the minimum data unit represents 4-bit data.
  • the two input data In1 and In2 to be spliced are each composed of eight minimum data units sequentially numbered 1, 2, . . . , 8 from right to left.
  • Data splicing is performed according to the principle of parity interleaving with numbers from small to large, In1 followed by In2, and odd numbers followed by even numbers.
  • the data bit width of the operation is 8 bits
  • the data In1 and In2 each represent one 8-bit data
  • each minimum data unit represents 1-bit data (ie, one square represents 1-bit data).
  • the minimum data units numbered 1, 3, 5 and 7 of the data In1 are first extracted and arranged in the lower order.
  • the data In1 and In2 each represent a 16-bit data, and each minimum data unit at this time represents 2-bit data (ie, a square represents a 2-bit data).
  • the minimum data units numbered 1, 2, 5 and 6 of the data In1 can be extracted first and arranged in the lower order. Then, the smallest data units numbered 1, 2, 5, and 6 of the data In2 are sequentially arranged. Similarly, the minimum data units numbered 3, 4, 7 and 8 and the data In2 are sequentially arranged to form a 32-bit or 2 16-bit new data composed of the final 16 minimum data units. , as shown in the second row of squares in Figure 9b.
  • the data In1 and In2 each represent a 32-bit data
  • each minimum data unit represents 4-bit data (ie, a square represents a 4-bit data).
  • the bit width of the data and the aforementioned principle of interleaving and splicing the smallest data units with the same numbers as the data In1 and the same numbers as the data In2 can be extracted and arranged in the lower order. Then, extract the smallest data units numbered 5, 6, 7, and 8 with the same numbers as the data In2 and arrange them in sequence, thereby splicing to form a 64-bit or two 32-bit new data consisting of the final 16 smallest data units .
  • Exemplary data splicing manners of the present disclosure are described above in conjunction with FIGS. 9a-9c. However, it can be understood that in some computing scenarios, data splicing does not involve the above-mentioned staggered arrangement, but only a simple arrangement of two data while keeping their original data positions unchanged, such as shown in Figure 9d out. It can be seen from Figure 9d that the two data In1 and In2 do not perform the interleaving as shown in Figures 9a-9c, but only the last minimum data unit of the data In1 and the first minimum data unit of In2 The data units are concatenated to obtain a new data type with an increased (eg doubled) bit width. In some scenarios, the solution of the present disclosure can also perform group stitching based on data attributes. For example, neuron data or weight data with the same feature map can be formed into a group and then arranged to form a continuous part of the spliced data.
  • FIGS. 10a, 10b and 10c are schematic diagrams illustrating data compression operations performed by post-processing circuits according to embodiments of the present disclosure.
  • the compressing operation may include filtering the data with a mask or compressing by comparing a given threshold with the size of the data.
  • it can be divided and numbered in the smallest data unit as previously described. Similar to that described in connection with Figures 9a-9d, the minimum data unit may be, for example, 1-bit or 1-bit data, or a length of 2, 4, 8, 16 or 32 bits or bits. Exemplary descriptions for different data compression modes will be made below in conjunction with Figures 10a to 10c.
  • the original data consists of eight squares (ie, eight minimum data units) sequentially numbered 1, 2..., 8 from right to left, assuming that each minimum data unit can represent 1 bit data.
  • the post-processing circuit may filter the original data by using the mask to perform the data compression operation.
  • the bit width of the mask corresponds to the number of minimum data units of the original data. For example, if the aforementioned original data has 8 minimum data units, the bit width of the mask is 8 bits, and the minimum data unit numbered 1 corresponds to the lowest bit of the mask, and the minimum data unit numbered 2 corresponds to the next low. And so on, the smallest data unit numbered 8 corresponds to the most significant bit of the mask.
  • the compression principle may be set to extract the smallest data unit in the original data corresponding to the data bit whose mask is "1".
  • the numbers of the smallest data units corresponding to the mask value "1" are 1, 2, 5, and 8.
  • the minimum data units numbered 1, 2, 5 and 8 can be extracted and arranged in order from low to high to form new compressed data, as shown in the second row of Figure 10a.
  • Fig. 10b shows the original data similar to Fig. 10a, and it can be seen from the second row of Fig. 10b that the data sequence passed through the post-processing circuit maintains the original data arrangement order and content. It will thus be appreciated that the data compression of the present disclosure may also include a disabled mode or a non-compressed mode so that no compression operation is performed when the data passes through the post-processing circuit.
  • the original data consists of eight squares arranged in sequence, the number above each square indicates its number, and the order from right to left is 1, 2...8, and it is assumed that each minimum data unit can be is 8-bit data. Further, the number in each square represents the decimal value of that smallest data unit. Taking the smallest data unit numbered 1 as an example, its decimal value is "8", and the corresponding 8-bit data is "00001111”.
  • the compression principle can be set to extract all the smallest data units in the original data that are greater than or equal to the threshold "8".
  • the smallest data units numbered 1, 4, 7 and 8 can be extracted. Then, arrange all the extracted minimum data units in descending order of numbers to obtain the final data result, as shown in the second row in Figure 10c.
  • FIG. 11 is a simplified flow diagram illustrating a method 1100 of performing computational operations using a computing device, which may have the hardware architecture described in connection with FIGS. 1-4, according to an embodiment of the present disclosure.
  • the method 1100 may utilize the control circuit to obtain an instruction, and may parse the instruction, and send the parsed instruction to one of the plurality of processing circuits or Multiple processing circuits.
  • the control circuit may determine one or more processing circuits that perform an operation according to the instruction identification information in the instruction, and send the parsed instruction to one of the plurality of processing circuits. one or more to perform the corresponding operation specified by the parsed instruction.
  • the control circuit in the process of parsing the instruction, may perform a decoding operation on the instruction, and send the parsed instruction to the instruction according to the decoding result.
  • the control circuit can send parsed instructions to the processing circuits with low usage occupancy or in an idle state according to the operating states of the multiple processing circuits.
  • the parsed instruction may also be an parsed instruction that has not been decoded by the control circuit.
  • the one or more processing circuits may include corresponding decoding circuits to decode the received parsed instructions, for example, to generate multiple micro-instructions, so that one or more processing circuits can decode the received instructions according to the micro-instructions. Perform subsequent operations.
  • step 1120 method 1100 may utilize the one or more processing circuits to perform multi-threaded operations according to the parsed instructions.
  • the plurality of processing circuits may be configured to receive and execute the parsed instructions in a single instruction multithreading ("SIMT") fashion.
  • the plurality of processing circuits may be connected in a one-dimensional or multi-dimensional array topology, and the plurality of processing circuit arrays connected in series through the connection may form one or more closed loops.
  • a plurality of processing circuits may determine whether to execute the operation specified by the parsed instruction according to the received information (eg, predicate information) in the parsed instruction.
  • FIG. 12 is a structural diagram illustrating a combined processing apparatus 1200 according to an embodiment of the present disclosure.
  • the combined processing device 1200 includes a computing processing device 1202, an interface device 1204, other processing devices 1206, and a storage device 1208.
  • one or more computing devices 1210 may be included in the computing processing device, and the computing devices may be configured to perform the operations described herein in conjunction with FIG. 1 to FIG. 11 .
  • the computing processing devices of the present disclosure may be configured to perform user-specified operations.
  • the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor.
  • one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core.
  • multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user.
  • other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors.
  • processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the computing processing device of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.
  • the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices.
  • other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.
  • the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices.
  • the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device.
  • the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into a control cache on the computing and processing device chip.
  • the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.
  • the combined processing device of the present disclosure may also include a storage device.
  • the storage device is connected to the computing processing device and the other processing device, respectively.
  • a storage device may be used to store data of the computing processing device and/or the other processing device.
  • the data may be data that cannot be fully stored in an internal or on-chip storage device of a computing processing device or other processing device.
  • the present disclosure also discloses a chip (eg, chip 1302 shown in FIG. 13 ).
  • the chip is a System on Chip (SoC) and integrates one or more combined processing devices as shown in FIG. 12 .
  • the chip can be connected with other related components through an external interface device (such as the external interface device 1306 shown in FIG. 13 ).
  • the relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface.
  • other processing units such as video codecs
  • interface modules such as DRAM interfaces
  • the present disclosure also discloses a chip package structure including the above-mentioned chip.
  • the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 13 .
  • FIG. 13 is a schematic structural diagram illustrating a board 1300 according to an embodiment of the present disclosure.
  • the board includes a storage device 1304 for storing data, which includes one or more storage units 1310 .
  • the storage device can be connected and data transferred with the control device 1308 and the chip 1302 described above through, for example, a bus.
  • the board also includes an external interface device 1306, which is configured for data relay or transfer function between the chip (or a chip in a chip package structure) and an external device 1312 (such as a server or a computer, etc.).
  • the data to be processed can be transmitted to the chip by an external device through an external interface device.
  • the calculation result of the chip may be transmitted back to the external device via the external interface device.
  • the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.
  • control device in the board of the present disclosure may be configured to regulate the state of the chip.
  • control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.
  • MCU Micro Controller Unit
  • an electronic device or device which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.
  • the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal.
  • the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also focuses on the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device eg, a personal computer, a server or network equipment, etc.
  • the aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • the various types of devices described herein may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • a computing device comprising a control circuit and a plurality of processing circuits, wherein:
  • control circuit is configured to obtain and parse the instruction, and send the parsed instruction to one or more of the plurality of processing circuits;
  • the plurality of processing circuits are configured to be connected in a one-dimensional or multi-dimensional array structure and to perform multi-threaded operations in accordance with the received parsed instructions.
  • control circuit configured to:
  • the parsed instruction is sent to one or more of the plurality of processing circuits according to the instruction identification information.
  • control circuit configured to:
  • the parsed instruction is sent to one or more of the plurality of processing circuits according to the result of the decoding and the operating state of the plurality of processing circuits.
  • Clause 4 The computing device of clause 1, wherein the plurality of processing circuits are divided into multiple types of processing circuits to perform different types of data processing.
  • Clause 5 The computing device of clause 1, wherein the plurality of processing circuits are divided into first type processing circuits and second type processing circuits, wherein the first type processing circuits are adapted to perform at least arithmetic operations and logic at least one of an operation, and the second type of processing circuit is adapted to perform at least one of a comparison operation and a table look-up operation.
  • Clause 6 The computing device of clause 1, wherein the multidimensional array is a two-dimensional array and the processing circuits located in the two-dimensional array are in at least one of a row, column, or diagonal direction thereof. The above is connected with the remaining one or more of the processing circuits in the same row, the same column or the same diagonal in a predetermined two-dimensional interval pattern.
  • Clause 7 The computing device of clause 6, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.
  • the multidimensional array is a three-dimensional array of a plurality of layers, wherein each layer includes a plurality of the processes arranged in row, column, and diagonal directions
  • a two-dimensional array of circuits where:
  • the processing circuits located in the three-dimensional array are in a predetermined three-dimensional spaced pattern in at least one of row, column, diagonal, and layer directions with those on the same row, column, diagonal, or different layer. The remaining one or more processing circuits are connected.
  • Clause 9 The computing device of clause 8, wherein the predetermined three-dimensional spacing pattern is associated with a number of spacings and layers of spacing between processing circuits to be connected.
  • Clause 10 The computing device of any of clauses 6-9, wherein the plurality of processing circuits are configured to be connected by logical connections to form one or more closed loops.
  • Clause 11 The computing device of clause 10, wherein the plurality of processing circuits are configured to determine from the parsed instructions whether to connect by logical connections to form one or more closed loops.
  • Clause 12 The computing device of clause 1, wherein a plurality of the processing circuits are configured to form at least one group of processing circuits to process data according to a bit width of the received data.
  • Clause 13 The computing device of clause 12, wherein when a plurality of the processing circuit groups are formed to process data, the plurality of processing circuit groups are connected by logical connections according to the parsed instructions to form One or more closed loops.
  • each of the processing circuits comprises:
  • a logic operation circuit configured to perform a logic operation according to the parsed instruction and the received data when performing the multithreaded operation
  • a storage circuit including a data storage circuit, wherein the data storage circuit is configured to store at least one of operation data and intermediate operation results of the processing circuit.
  • Clause 15 The computing device of clause 14, wherein the storage circuit further comprises a predicate storage circuit, wherein the predicate storage circuit is configured to store a predicate storage for each of the processing circuits obtained using the parsed instruction Circuit number and predicate information.
  • the predicate information is updated according to the operation result of each of the processing circuits.
  • each of the processing circuits is configured to:
  • Whether the processing circuit executes the parsed instruction is determined according to the predicate information.
  • Clause 18 The computing device of clause 1, wherein the processing circuit further comprises an arithmetic operation circuit configured to perform arithmetic operation operations.
  • a data processing circuit comprising at least one of a pre-processing circuit and a post-processing circuit, wherein the pre-processing circuit is configured to perform a preprocessing operation on operation data before the processing circuit performs the operation, and the post-processing circuit is configured to After the processing circuit performs the operation, a post-processing operation is performed on the operation result.
  • each of the plurality of processing circuits in the closed loop is configured with a respective logical address
  • the pre-processing circuits being configured to operate according to the type and logic of the data address, the operation data is divided accordingly, and the multiple sub-data obtained after the division are respectively transmitted to the corresponding processing circuits in the loop for operation.
  • Clause 21 The computing device of Clause 19, wherein the preprocessing circuit is further configured to select a data splicing mode from a plurality of data splicing modes according to the parsed instruction to perform a splicing operation on the two input data .
  • Clause 22 The computing device of clause 21, wherein the post-processing circuit is further configured to perform a compression operation on the data, the compression operation comprising filtering the data with a mask or by comparing a given threshold to a data size. to filter.
  • Clause 23 The computing device of clause 1, further comprising:
  • a main storage circuit includes at least one of a main storage module and a main cache module, wherein the main storage module is configured to store the data used for performing the operation in the processing circuit and the operation result after the operation is performed, and the The main cache module is configured to cache the intermediate operation result after the operation is performed in the processing circuit.
  • Clause 24 The computing device of any of clauses 1-9 or 11-23, wherein the plurality of processing circuits are configured to receive and execute the parsed instructions in a SIMT manner.
  • Clause 27 A method of performing an arithmetic operation using a computing device, wherein the computing device includes a control circuit and a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, the method comprising:
  • control circuit Utilize the control circuit to obtain and parse the instruction, and send the parsed instruction to one or more of the plurality of processing circuits;
  • the one or more processing circuits are utilized to perform multi-threaded operations in accordance with the parsed instructions.
  • Clause 28 The method of clause 27, wherein in parsing the instruction, the method utilizes the control circuit to perform:
  • the parsed instruction is sent to one or more of the plurality of processing circuits according to the instruction identification information.
  • Clause 29 The method of clause 27, wherein in parsing the instruction, the method utilizes the control circuit to perform:
  • the parsed instruction is sent to one or more of the plurality of processing circuits according to the result of the decoding and the operating state of the plurality of processing circuits.
  • Clause 30 The method of clause 27, comprising dividing the plurality of processing circuits into multiple types of processing circuits to perform different types of data processing.
  • dividing the plurality of processing circuits into a plurality of types of processing circuits comprises dividing the plurality of processing circuits into a first type of processing circuits and a second type of processing circuits, wherein The first type of processing circuit is adapted to perform at least one of an arithmetic operation and a logical operation, and the second type of processing circuit is adapted to perform at least one of a comparison operation and a table look-up operation.
  • Clause 32 The method of clause 27, wherein the multidimensional array is a two-dimensional array, and the method comprises placing the processing circuits located in the two-dimensional array in its row, column, or diagonal directions At least one of the directions is connected to the remaining one or more of the processing circuits in a row, column or diagonal in a predetermined two-dimensional spaced pattern.
  • Clause 33 The method of clause 32, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.
  • Clause 35 The method of clause 34, wherein the predetermined three-dimensional spacing pattern is associated with a number of spacings and layers of spacing between processing circuits to be connected.
  • Clause 36 The method of any of clauses 32-35, comprising connecting the plurality of processing circuits through logical connections to form one or more closed loops.
  • Clause 37 The method of clause 36, wherein the method comprises determining from the parsed instructions whether to connect the plurality of processing circuits by logical connections to form one or more closed loops.
  • Clause 38 The method of clause 27, wherein a plurality of said processing circuits are configured to form at least one group of processing circuits to process data according to a bit width of the received data.
  • Clause 39 The method of clause 38, wherein when a plurality of the processing circuit groups are formed to process data, the method comprises connecting the plurality of processing circuit groups by logical connections according to the parsed instructions , to form one or more closed loops.
  • each of the processing circuits includes a logic operation circuit and a storage circuit, wherein the storage circuit includes a data storage circuit, wherein the method includes when performing the multithreaded operation,
  • the logic operation circuit is used to perform a logic operation according to the parsed instruction and the received data, and the data storage circuit is used to store at least one of operation data and an intermediate operation result of the processing circuit.
  • Clause 41 The method of clause 40, wherein the storage circuit further comprises a predicate storage circuit, wherein the method comprises using the predicate storage circuit to store each of the processing circuits fetched using the parsed instruction
  • the predicate stores the circuit number and predicate information.
  • Clause 42 The method of clause 41, further comprising utilizing the predicate storage circuit to perform the following steps:
  • the predicate information is updated according to the operation result of each of the processing circuits.
  • Whether the processing circuit executes the parsed instruction is determined according to the predicate information.
  • Clause 44 The method of clause 27, wherein the processing circuit further comprises an arithmetic operation circuit, the method comprising utilizing the arithmetic operation circuit to perform an arithmetic operation operation.
  • Clause 45 The method of clause 34, wherein the computing device further comprises a data processing circuit comprising at least one of a pre-processing circuit and a post-processing circuit, wherein the method comprises, before the processing circuit performs an operation, The preprocessing circuit is used to perform a preprocessing operation on the operation data, and after the processing circuit performs the operation, the postprocessing circuit is used to perform a postprocessing operation on the operation result.
  • each of the plurality of processing circuits in the closed loop is configured with a respective logical address
  • the method comprising utilizing the pre-processing circuit to perform an operation according to an operation of the data. type and logical address, the operation data is divided accordingly, and the multiple sub-data obtained after the division are respectively transmitted to the corresponding processing circuits in the loop for operation.
  • Clause 47 The method of clause 45, wherein the method further comprises utilizing the pre-processing circuit to select a data splicing mode from a plurality of data splicing modes according to the parsed instruction, to perform an analysis of the input two data splicing modes. Perform the stitching operation.
  • Clause 48 The method of clause 47, wherein the method further comprises using the post-processing circuit to perform a compression operation on the data, the compression operation comprising filtering the data using a mask or passing a given threshold and a data size comparison to filter.
  • Clause 49 The method of clause 27, wherein the computing device further comprises a main storage circuit comprising at least one of a main storage module and a main cache module, wherein the method comprises utilizing the main storage
  • the storage module is used to store the data used for performing the operation in the processing circuit and the operation result after the operation is performed, and the main buffer module is used to cache the intermediate operation result after the operation is performed in the processing circuit.
  • Clause 50 The method of any of clauses 27-49, wherein the method comprises utilizing the plurality of processing circuits to SIMT receive and execute the parsed instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)

Abstract

A computing apparatus, an integrated circuit chip, a board, and a method for executing arithmetic operations using the described computing apparatus. The computing apparatus may be included in a combined processing apparatus, and the combined processing apparatus may further comprise a universal interconnecting interface and other processing apparatuses. The computing apparatus interacts with the other processing apparatuses to jointly complete a computing operation designated by a user. The combined processing apparatus may further comprise a storage apparatus, and the storage apparatus is respectively connected to a device and the other processing apparatuses and is used for storing data of the device and the other processing apparatuses.

Description

计算装置、集成电路芯片、板卡和计算方法Computing device, integrated circuit chip, board and computing method
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2020年6月30日申请的,申请号为2020106181203,名称为“计算装置、集成电路芯片、板卡和计算方法”的中国专利申请的优先权,在此将其全文引入作为参考。This application claims the priority of the Chinese patent application filed on June 30, 2020, the application number is 2020106181203, and the title is "Computing Device, Integrated Circuit Chip, Board Card, and Computing Method", which is hereby incorporated by reference in its entirety. .
技术领域technical field
本披露一般地涉及数据处理领域。更具体地,本披露涉及一种计算装置、集成电路芯片、板卡和使用前述计算装置来执行计算操作的方法。This disclosure relates generally to the field of data processing. More particularly, the present disclosure relates to a computing device, an integrated circuit chip, a board, and a method of performing computing operations using the aforementioned computing device.
背景技术Background technique
现有的人工智能运算往往包含大量的数据运算,如卷积运算、图像处理等。随着数据量的增多,例如矩阵运算的数据运算所涉及的运算量和存储量都会由于数据规模的增大而急剧增加。现有的运算方式中,通常利用中央处理器(“CPU”)或者图像处理单元(“GPU”)等通用处理器进行运算。然而,通用处理器往往由于其通用性特征以及使用的器件冗余性较高,从而使其功耗开销较大,因此导致其使用性能受限。Existing artificial intelligence operations often include a large number of data operations, such as convolution operations, image processing, etc. As the amount of data increases, the amount of operations and storage involved in data operations such as matrix operations will increase sharply due to the increase in the size of the data. In an existing computing method, a general-purpose processor such as a central processing unit ("CPU") or a graphics processing unit ("GPU") is usually used for computing. However, general-purpose processors often have high power consumption due to their general-purpose features and high device redundancy, thus resulting in limited performance.
另外,现有的运算处理电路通常采用固定的硬件架构,当数据规模扩大或数据格式发生变化时,不仅可能会出现不能支持某类运算的情形,而且会在运算过程中造成其运算性能极大受限,甚至达到不能操作的情形。In addition, the existing operation processing circuit usually adopts a fixed hardware architecture. When the data scale expands or the data format changes, it may not only be unable to support a certain type of operation, but also cause its operation performance to be extremely high during the operation. Limited, or even inoperable.
发明内容SUMMARY OF THE INVENTION
为了至少解决上述现有技术中存在的缺陷,本披露提供了一种支持多种类型运算、提高运算效率并且节省运算成本和开销的解决方案。具体地,本披露在如下的多个方面中提供前述的解决方案。In order to at least solve the above-mentioned defects in the prior art, the present disclosure provides a solution that supports multiple types of operations, improves operation efficiency, and saves operation cost and overhead. Specifically, the present disclosure provides the aforementioned solutions in the following aspects.
在第一方面中,本披露提供一种计算装置,包括控制电路和多个处理电路,其中:所述控制电路配置成获取指令并对所述指令进行解析,并且将解析后的指令发送至多个处理电路中的一个或多个处理电路;以及所述多个处理电路配置成以一维或多维阵列的结构进行连接,并且根据接收到的解析后的指令来执行多线程操作。In a first aspect, the present disclosure provides a computing device comprising a control circuit and a plurality of processing circuits, wherein: the control circuit is configured to obtain an instruction and parse the instruction, and send the parsed instruction to a plurality of one or more of the processing circuits; and the plurality of processing circuits configured to be connected in a one-dimensional or multi-dimensional array and to perform multi-threaded operations in accordance with the received parsed instructions.
在第二方面中,本披露提供一种集成电路芯片,包括前述及其稍后描述的多个实施例的计算装置。In a second aspect, the present disclosure provides an integrated circuit chip comprising a computing device of various embodiments of the foregoing and later described.
在第三方面中,本披露提供一种板卡,包括前述的集成电路芯片。In a third aspect, the present disclosure provides a board including the aforementioned integrated circuit chip.
在第四方面中,本披露提供一种使用计算装置来执行运算操作的方法,其中所述计算装置包括控制电路和以一维或多维阵列结构连接的多个处理电路,所述方法包括:利用所述控制电路来获取指令并对所述指令进行解析,并将解析后的指令发送至所述多个处理电路中的一个或多个处理电路;以及利用所述一个或多个处理电路来根据解析后的指令执行多线程操作。In a fourth aspect, the present disclosure provides a method of performing an arithmetic operation using a computing device, wherein the computing device includes a control circuit and a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, the method comprising: utilizing The control circuit obtains the instruction and parses the instruction, and sends the parsed instruction to one or more processing circuits in the plurality of processing circuits; and utilizes the one or more processing circuits to The parsed instructions perform multithreaded operations.
通过利用本披露的计算装置、集成电路芯片、板卡和方法,可以克服固定硬件架构下的操作限制,提升包括例如人工智能领域在内的各类数据处理领域在数据处理和运算方面的运行效率,并降低数据操作的功耗开销和成本。By using the computing device, integrated circuit chip, board and method disclosed in the present disclosure, it is possible to overcome the operational limitations under a fixed hardware architecture, and to improve the operational efficiency of data processing and computing in various data processing fields, including the field of artificial intelligence. , and reduce the power overhead and cost of data operations.
附图说明Description of drawings
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present disclosure are shown by way of example and not limitation, and like or corresponding reference numerals refer to like or corresponding parts, wherein:
图1是示出根据本披露实施例的计算装置的总体架构图;FIG. 1 is an overall architecture diagram illustrating a computing device according to an embodiment of the present disclosure;
图2是示出根据本披露实施例的计算装置的示例具体架构图;FIG. 2 is an exemplary specific architecture diagram illustrating a computing device according to an embodiment of the present disclosure;
图3是示出根据本披露实施例的计算装置的单个类型处理电路阵列的示例结构图;3 is an example block diagram illustrating a single type of processing circuit array of a computing device according to an embodiment of the present disclosure;
图4是示出根据本披露实施例的计算装置的多种类型处理电路阵列的示例结构图;4 is an example block diagram illustrating various types of processing circuit arrays of a computing device according to an embodiment of the present disclosure;
图5a,5b,5c和5d是示出根据本披露实施例的多个处理电路的多种连接关系的示意图;5a, 5b, 5c and 5d are schematic diagrams illustrating various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure;
图6a,6b,6c和6d是示出根据本披露实施例的多个处理电路的另外多种连接关系的示意图;6a, 6b, 6c and 6d are schematic diagrams illustrating further various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure;
图7a,7b、7c和7d是示出根据本披露实施例的处理电路的多种环路结构的示意图;Figures 7a, 7b, 7c and 7d are schematic diagrams illustrating various loop structures of processing circuits according to embodiments of the present disclosure;
图8a,8b和8c是示出根据本披露实施例的处理电路的另外多种环路结构的示意图;Figures 8a, 8b and 8c are schematic diagrams illustrating further various loop structures of processing circuits according to embodiments of the present disclosure;
图9a,9b,9c和9d是示出根据本披露实施例的前处置电路所执行的数据拼接操作示意图;Figures 9a, 9b, 9c and 9d are schematic diagrams illustrating data splicing operations performed by a preprocessing circuit according to an embodiment of the present disclosure;
图10a,10b和10c是示出根据本披露实施例的后处置电路所执行的数据压缩操作示意图;10a, 10b and 10c are schematic diagrams illustrating data compression operations performed by a post-processing circuit according to an embodiment of the present disclosure;
图11是示出根据本披露实施例的使用计算装置来执行运算操作的方法的简化流程图;11 is a simplified flowchart illustrating a method of using a computing device to perform an arithmetic operation according to an embodiment of the present disclosure;
图12是示出根据本披露实施例的一种组合处理装置的结构图;以及FIG. 12 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure; and
图13是示出根据本披露实施例的一种板卡的结构示意图。FIG. 13 is a schematic structural diagram illustrating a board according to an embodiment of the present disclosure.
具体实施方式detailed description
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present disclosure.
下面结合附图来详细描述本披露的具体实施方式。The specific embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
图1是示出根据本披露实施例的计算装置100的总体架构图。如图1中所示,本披露的计算装置100可以包括控制电路102和多个处理电路104。在执行数据处理的过程中,所述控制电路可以配置成获取指令并对所述指令进行解析,并且可以将解析后的指令发送至所述多个处理电路中的一个或多个处理电路。FIG. 1 is a general architectural diagram illustrating a computing device 100 according to an embodiment of the present disclosure. As shown in FIG. 1 , the computing device 100 of the present disclosure may include a control circuit 102 and a plurality of processing circuits 104 . In performing data processing, the control circuit may be configured to obtain and parse the instruction, and may send the parsed instruction to one or more of the plurality of processing circuits.
根据本披露的方案,获取的指令可以包括一个或多个操作码,并且每个操作码可以表示将由一个或多个处理电路所要执行的一个或多个具体的操作。每个操作码可以任意一种合适的形式来表示。例如,可以通过“ADD”或“MUL”这样的英文缩写来表示操作码,以表达所要执行的是“加法”或“乘法”操作。另外,也可以通过“AM”这样无法从字面上直接确定具体操作的英文缩写来表示操作码。根据应用场景的不同,操作 码可以包括或者涉及不同类型的操作,例如可以包括加法操作或乘法操作等算术运算操作、逻辑运算操作、比较运算或者查表运算,或者前述各类运算的任意多种组合。进一步,在本披露中,每个操作码可以对应于在解析指令的过程中得到的一个或多个微指令。由此,本披露解析后的指令中可以包括对应于所述指令中的一个操作码的一个或多个微指令,以指示处理电路所要执行的一个或多个具体操作。According to aspects of the present disclosure, the fetched instructions may include one or more opcodes, and each opcode may represent one or more specific operations to be performed by one or more processing circuits. Each opcode can be represented in any suitable form. For example, an opcode can be represented by an English abbreviation such as "ADD" or "MUL" to express that what is to be performed is an "addition" or "multiplication" operation. In addition, the operation code can also be represented by an English abbreviation such as "AM" that cannot directly determine the specific operation literally. Depending on the application scenario, the opcode may include or involve different types of operations, for example, may include arithmetic operations such as addition or multiplication, logical operations, comparison operations, or table lookup operations, or any of the foregoing types of operations. combination. Further, in the present disclosure, each opcode may correspond to one or more microinstructions obtained in the process of parsing the instruction. Thus, the parsed instruction of the present disclosure may include one or more micro-instructions corresponding to an opcode in the instruction to indicate one or more specific operations to be performed by the processing circuit.
在一个实施例中,在解析所述指令的过程中,所述控制电路102可以配置成获取所述指令中的指令标识信息,并且根据所述指令标识信息将所述解析后的指令发送到所述多个处理电路中的一个或多个,其中所述指令标识信息中标识有一个或多个处理电路。进一步,根据应用场景的不同,这里解析后的指令可以是经控制电路译码后的指令或者也可以是未经控制电路译码的解析指令。当解析后的指令是未经控制电路译码的解析指令时,则处理电路内可以包括相应的译码电路来对解析后的指令执行译码,以便例如得到多个微指令。In one embodiment, in the process of parsing the instruction, the control circuit 102 may be configured to acquire instruction identification information in the instruction, and send the parsed instruction to the one or more of the plurality of processing circuits, wherein one or more processing circuits are identified in the instruction identification information. Further, according to different application scenarios, the parsed instruction here may be an instruction decoded by the control circuit or an parsed instruction that has not been decoded by the control circuit. When the parsed instruction is an parsed instruction that has not been decoded by the control circuit, a corresponding decoding circuit may be included in the processing circuit to decode the parsed instruction, for example, to obtain a plurality of micro-instructions.
在另一个实施例中,在解析所述指令的过程中,所述控制电路可以配置成对所述指令进行译码,并且根据所述译码的结果以及所述多个处理电路的操作状态,将所述解析后的指令发送给所述多个处理电路中的一个或多个。在该实施例中,多个处理电路可以都支持非特定的相同类型运算。因此,为了提高处理电路的利用率和运算效率,可以将解析后的指令发送给占用率不高或处于空闲态的处理电路。In another embodiment, in the process of parsing the instruction, the control circuit may be configured to decode the instruction, and according to the result of the decoding and the operating states of the plurality of processing circuits, The parsed instructions are sent to one or more of the plurality of processing circuits. In this embodiment, multiple processing circuits may all support non-specific operations of the same type. Therefore, in order to improve the utilization rate and operation efficiency of the processing circuit, the parsed instruction may be sent to the processing circuit whose occupancy rate is not high or is in an idle state.
在一个或多个实施例中,所述多个处理电路104可以配置成以一维或多维阵列的结构进行连接,并且根据接收到的解析后的指令来执行多线程操作。在一个实施例中,所述多个处理电路可以配置成以单指令多线程(“SIMT”)方式接收并执行所述解析后的指令。在另一个实施例中,当多个处理电路配置成以多维阵列结构进行连接时,多维阵列可以包括二维阵列和/或三维阵列(如图5与图6中示出的)。进一步,前述的一维或多维阵列中的每个处理电路可以在一定范围内与指定方向和预定间隔模式的其他处理电路进行连接。另外,多个处理电路可以通过逻辑连接而串接形成一个或多个闭合的环路(如图7与图8中示出的)。In one or more embodiments, the plurality of processing circuits 104 may be configured to be connected in a one-dimensional or multi-dimensional array structure and to perform multi-threaded operations according to the received parsed instructions. In one embodiment, the plurality of processing circuits may be configured to receive and execute the parsed instructions in a single instruction multithreading ("SIMT") fashion. In another embodiment, when multiple processing circuits are configured to be connected in a multi-dimensional array structure, the multi-dimensional array may comprise a two-dimensional array and/or a three-dimensional array (as shown in FIGS. 5 and 6 ). Further, each processing circuit in the aforementioned one-dimensional or multi-dimensional array can be connected to other processing circuits in a specified direction and a predetermined spacing pattern within a certain range. Additionally, multiple processing circuits may be logically connected in series to form one or more closed loops (as shown in Figures 7 and 8).
在不同的应用场景中,多个处理电路之间的连接方式可以是通过硬件结构进行连接的硬连线方式。附加地或可选地,多个处理电路之间的连接方式也可以是根据解析后的指令例如微指令进行配置的逻辑连接方式。通过上述的硬连接方式或逻辑连接方式,可以形成多种处理电路阵列的拓扑结构,以适应于执行对应的数据处理操作。In different application scenarios, the connection mode between the multiple processing circuits may be a hard-wired connection through a hardware structure. Additionally or alternatively, the connection manner between the multiple processing circuits may also be a logical connection manner configured according to parsed instructions, such as microinstructions. Through the above-mentioned hard connection manner or logical connection manner, various topology structures of the processing circuit arrays can be formed so as to be suitable for performing corresponding data processing operations.
图2是示出根据本披露实施例的计算装置200的示例具体架构图。从图2中可以看出,计算装置200不仅包括图1中的计算装置100的控制电路102和多个处理电路104,还进一步示出了处理电路中包含的多个电路,以及附加地多个其他器件。由于前文已经结合图1对控制电路和处理电路的功能进行了详细描述,下文将不再赘述。FIG. 2 is a diagram illustrating an example specific architecture of a computing device 200 according to an embodiment of the present disclosure. As can be seen from FIG. 2 , the computing device 200 not only includes the control circuit 102 and the plurality of processing circuits 104 of the computing device 100 in FIG. 1 , but also further shows a plurality of circuits included in the processing circuit, and additionally a plurality of other devices. Since the functions of the control circuit and the processing circuit have been described in detail above with reference to FIG. 1 , they will not be repeated below.
如图2所示,处理电路104可以包括逻辑运算电路1041,其可以配置成在执行所述多线程操作时,根据解析后的指令和接收到的数据执行逻辑运算,例如对接收到的数据执行与或非、移位操作或比较操作等逻辑运算操作。除了可以进行必要的逻辑运算外,处理电路104还可以包括算术运算电路1043,其可以配置成执行算术运算操作,例如加法、减法或乘法等线性运算。As shown in FIG. 2, the processing circuit 104 may include a logic operation circuit 1041, which may be configured to perform a logic operation according to the parsed instruction and the received data when performing the multi-threaded operation, for example, perform a logic operation on the received data Logical operations such as AND or NOT, shift operations, or compare operations. In addition to performing the necessary logical operations, the processing circuit 104 may also include an arithmetic operation circuit 1043, which may be configured to perform arithmetic operations, such as linear operations such as addition, subtraction, or multiplication.
在一个实施例中,处理电路104还可以包括存储电路1042,其包括数据存储电路和/或谓词存储电路,其中所述数据存储电路可以配置成存储所述处理电路的运算数据 (例如像素)与中间运算结果中的至少一项。进一步,所述谓词存储电路可以配置成存储利用所述解析后的指令获取的每个所述处理电路的谓词存储电路序号和谓词信息。在具体的存储应用中,存储电路1042可以根据实际需要采用寄存器或者静态随机存取存储器(“SRAM”)等存储器来实现。In one embodiment, processing circuitry 104 may also include storage circuitry 1042 including data storage circuitry and/or predicate storage circuitry, wherein the data storage circuitry may be configured to store operational data (eg, pixels) of the processing circuitry and At least one of the intermediate operation results. Further, the predicate storage circuit may be configured to store the predicate storage circuit serial number and predicate information of each of the processing circuits obtained using the parsed instruction. In a specific storage application, the storage circuit 1042 may be implemented by using a memory such as a register or a static random access memory ("SRAM") according to actual needs.
在一个应用场景中,谓词存储电路可以包括a个1位寄存器,以用于存储谓词信息。进一步,可以用b位的二进制数来表示a个1位寄存器的序号,其中b>=log 2(a)。例如,处理电路中的谓词存储电路可以包括从00000~11111顺序编号的32个1位寄存器。由此,该处理电路可以根据接收到的解析后的指令中指定的寄存器序号“00101”读取对应序号为“00101”的寄存器中的谓词信息。 In an application scenario, the predicate storage circuit may include a 1-bit register for storing predicate information. Further, the serial numbers of a 1-bit registers can be represented by a b-bit binary number, where b>=log 2 (a). For example, the predicate storage circuit in the processing circuit may include 32 1-bit registers sequentially numbered from 00000 to 11111. Thus, the processing circuit can read the predicate information in the register corresponding to the serial number "00101" according to the register serial number "00101" specified in the received parsed instruction.
在一个实施例中,所述谓词存储电路可以配置成根据所述解析后的指令对所述谓词信息进行更新。例如,可以根据解析后的指令中的配置信息直接更新谓词信息,或者也可以根据解析后的指令中提供的配置信息存储地址来获取配置信息,以便对谓词信息进行更新。在处理电路执行运算的过程中,谓词存储电路还可以根据每个所述处理电路的比较结果(其在本披露的上下文中是运算结果的一种形式)对所述谓词信息进行更新。例如,可以利用所述处理电路接收到的输入数据与其数据存储电路中的存储数据进行比较来更新谓词信息。当所述输入数据大于所述存储数据时,则设置该处理电路的谓词信息为1。反之,当所述输入数据小于所述存储数据时,则设置所述谓词信息为0,或者维持其原数值不变。In one embodiment, the predicate storage circuit may be configured to update the predicate information according to the parsed instruction. For example, the predicate information may be directly updated according to the configuration information in the parsed instruction, or the configuration information may be acquired according to the configuration information storage address provided in the parsed instruction, so as to update the predicate information. During the execution of the operation by the processing circuit, the predicate storage circuit may also update the predicate information according to the comparison result of each of the processing circuits, which is a form of operation result in the context of the present disclosure. For example, the predicate information may be updated using input data received by the processing circuit compared to stored data in its data storage circuit. When the input data is greater than the stored data, the predicate information of the processing circuit is set to 1. Conversely, when the input data is smaller than the stored data, the predicate information is set to 0, or its original value is kept unchanged.
在执行运算操作前,每个处理电路可以根据解析后的指令中的信息,来判断该处理电路是否执行该解析后的指令的操作。进一步,每个所述处理电路可以配置成根据所述解析后的指令中的所述谓词存储电路序号来获取对应于所述谓词存储电路的所述谓词信息,并且根据所述谓词信息来确定该所述处理电路是否执行所述解析后的指令。例如,当处理电路根据所述解析后的指令中指定的谓词存储电路序号读取的谓词信息的数值是1时,则表示该处理电路执行所述解析后的指令。例如,可以是令处理电路读取该指令中指向的数据,并且将读取的数据存入该处理电路的数据存储电路。反之,当处理电路根据所述解析后的指令中指定的谓词存储电路序号读取的谓词信息的数值是0时,则表示该处理电路不执行所述解析后的指令。Before performing the arithmetic operation, each processing circuit may determine whether the processing circuit executes the operation of the parsed instruction according to the information in the parsed instruction. Further, each of the processing circuits may be configured to obtain the predicate information corresponding to the predicate storage circuit according to the sequence number of the predicate storage circuit in the parsed instruction, and determine the predicate information according to the predicate information. Whether the processing circuit executes the parsed instruction. For example, when the value of the predicate information read by the processing circuit according to the sequence number of the predicate storage circuit specified in the parsed instruction is 1, it means that the processing circuit executes the parsed instruction. For example, it may be to make the processing circuit read the data pointed in the instruction, and store the read data into the data storage circuit of the processing circuit. Conversely, when the value of the predicate information read by the processing circuit according to the sequence number of the predicate storage circuit specified in the parsed instruction is 0, it means that the processing circuit does not execute the parsed instruction.
在一个实施例中,本披露的计算装置200还可以包括数据处置电路106,其可以包括前处置电路1061和后处置电路1062中的至少一个。所述前处置电路1061可以配置成在所述处理电路执行运算前对运算数据进行预处理操作(稍后结合图7b描述),例如执行数据拼接或数据摆放操作。所述后处置电路1062可以配置成在所述处理电路执行运算后对运算结果进行后处理操作,例如执行数据还原或数据压缩操作。In one embodiment, the computing device 200 of the present disclosure may also include data processing circuitry 106 , which may include at least one of pre-processing circuitry 1061 and post-processing circuitry 1062 . The preprocessing circuit 1061 may be configured to perform a preprocessing operation (described later in conjunction with FIG. 7b ) on the operation data before the processing circuit performs the operation, such as performing a data splicing or data placement operation. The post-processing circuit 1062 may be configured to perform a post-processing operation on the result of the operation after the processing circuit performs the operation, such as performing a data restoration or data compression operation.
为了实现数据的传递和存储,所述计算装置200还可以包括主存储电路108,其既可以接收并存储来自于控制电路的数据,以作为处理电路的输入数据,也可以用于传递并存储多个处理电路之间的数据。在一些应用场景中,可以根据存储方式或存储数据的特征将所述主存储电路108进一步划分成包括主存储模块1081和主缓存模块1082中的至少一个。所述主存储模块1081可以配置成存储用于处理电路中待执行运算的数据(例如输入像素)与执行运算后的运算结果(例如输出像素)。所述主缓存模块1082可以配置成缓存所述多个处理电路中执行运算后的中间运算结果。在一些应用场景中,主存储电路不仅可以进行内部的存储,还支持与本披露的计算装置外的 存储装置进行数据交互的功能,例如其可以通过直接存储器访问(“DMA”)与外部的存储装置进行数据交换。In order to realize the transmission and storage of data, the computing device 200 may further include a main storage circuit 108, which can not only receive and store data from the control circuit as input data of the processing circuit, but also be used to transmit and store multiple data between processing circuits. In some application scenarios, the main storage circuit 108 may be further divided into at least one of a main storage module 1081 and a main cache module 1082 according to the storage method or the characteristics of the stored data. The main storage module 1081 may be configured to store data (eg, input pixels) to be performed operations in the processing circuit and operation results (eg, output pixels) after performing operations. The main cache module 1082 may be configured to cache intermediate operation results after operations performed in the plurality of processing circuits. In some application scenarios, the main storage circuit can not only perform internal storage, but also support the function of data interaction with storage devices outside the computing device of the present disclosure, for example, it can communicate with external storage devices through direct memory access ("DMA"). device for data exchange.
图3是示出根据本披露实施例的计算装置的单个类型处理电路阵列的示例结构图。如图3所示,其所示出的计算装置不但包括图2所示出的控制电路102、主存储电路108、数据处置电路106和多个同类型的处理电路104,并且进一步示出了多个同类型的处理电路可以通过物理连接进行排布以形成二维阵列,例如多个处理电路排列形成二维阵列。鉴于前文结合图2对控制电路、主存储电路、数据处置电路和处理电路的功能进行了详细描述,此处将不再赘述。3 is an example block diagram illustrating a single type of processing circuit array of a computing device according to an embodiment of the present disclosure. As shown in FIG. 3, the computing device shown not only includes the control circuit 102, the main storage circuit 108, the data processing circuit 106, and a plurality of processing circuits 104 of the same type shown in FIG. Multiple processing circuits of the same type may be arranged through physical connections to form a two-dimensional array, eg, multiple processing circuits are arranged to form a two-dimensional array. In view of the detailed description of the functions of the control circuit, the main storage circuit, the data processing circuit, and the processing circuit with reference to FIG. 2 above, the details will not be repeated here.
如前所述,可以根据类型的不同对本披露的多个处理电路进行划分,以用于执行不同类型的数据处理操作。例如,所述多个处理电路可以被划分成第一类型处理电路和第二类型处理电路(如图4中所示出的)。在应用场景中,所述第一类型处理电路可以适用于执行算术运算和逻辑运算中的至少一项,而所述第二类型处理电路可以适用于执行比较运算和查表运算中的至少一项。As previously mentioned, the plurality of processing circuits of the present disclosure may be divided according to type for performing different types of data processing operations. For example, the plurality of processing circuits may be divided into first type processing circuits and second type processing circuits (as shown in FIG. 4). In an application scenario, the first type of processing circuit may be adapted to perform at least one of arithmetic operations and logical operations, and the second type of processing circuit may be adapted to perform at least one of comparison operations and table lookup operations .
图4是示出根据本披露实施例的计算装置的多种类型处理电路阵列的示例结构图。如图4所示出的计算装置包括控制电路102、主存储电路108和多个不同类型的处理电路104。可选地,该计算装置还可以包括如图2和3中所示出的数据处置电路106。鉴于此,图4所示出的计算装置架构与图2和图3所示出的计算装置架构类似,因此结合图2和图3所描述的计算装置200的技术细节也同样适用于图4所示出的计算装置。4 is an example block diagram illustrating various types of processing circuit arrays of a computing device according to an embodiment of the present disclosure. The computing device shown in FIG. 4 includes a control circuit 102 , a main storage circuit 108 and a plurality of processing circuits 104 of different types. Optionally, the computing device may also include data handling circuitry 106 as shown in FIGS. 2 and 3 . In view of this, the computing device architecture shown in FIG. 4 is similar to the computing device architecture shown in FIGS. 2 and 3 , so the technical details of the computing device 200 described in conjunction with FIGS. 2 and 3 are also applicable to the computing device 200 shown in FIG. 4 . computing device shown.
从图4中可以看出,本披露的多个处理电路可以包括例如多个第一类型处理电路和多个第二类型处理电路(如图中示出的不同颜色背景的处理电路具有不同的类型)。所述多个处理电路可以通过物理连接进行排布以形成二维阵列。可以理解的是图4中所示两种类型处理电路的排布方式仅仅是示例性地而非限制性地,本领域技术人员根据本披露的教导也可以想到其他的排布方式。例如,可以将多个第一类型处理电路排布在阵列的左右两侧,而将多个第二类型处理电路排布在阵列的中间区域。又例如,多个第一类型处理电路可以布置在阵列的中间区域,而多个第二类型处理电路可以排布于阵列的四周区域。再例如,还可以将多个第一类型处理电路和第二类型处理电路间隔穿插地排布于阵列中。根据计算场景的不同,本披露的处理电路的类型也可以不限于图中所示的两种,而是可以具有更多类型的处理电路,以实现不同类型的运算操作。As can be seen from FIG. 4 , the plurality of processing circuits of the present disclosure may include, for example, a plurality of first-type processing circuits and a plurality of second-type processing circuits (as shown in the figure, the processing circuits with different color backgrounds have different types) ). The plurality of processing circuits may be arranged through physical connections to form a two-dimensional array. It can be understood that the arrangement of the two types of processing circuits shown in FIG. 4 is merely exemplary and not limiting, and other arrangements may be conceived by those skilled in the art based on the teachings of the present disclosure. For example, a plurality of processing circuits of the first type may be arranged on the left and right sides of the array, and a plurality of processing circuits of the second type may be arranged in the middle area of the array. For another example, a plurality of first type processing circuits may be arranged in the middle area of the array, and a plurality of second type processing circuits may be arranged in the surrounding areas of the array. For another example, a plurality of first-type processing circuits and second-type processing circuits may also be interspersed in an array. According to different computing scenarios, the types of processing circuits disclosed in the present disclosure may not be limited to the two shown in the figures, but may have more types of processing circuits to implement different types of computing operations.
如图中所示,所述二维阵列中有M行N列(表示为M*N)个第一类型处理电路(如图中示出的浅色背景的处理电路104),其中M和N是大于0的正整数。所述第一类型处理电路可以用于执行算术运算和逻辑运算,例如可以包括加法、减法和乘法等线性运算、比较运算和与或非等非线性运算,或者前述各类运算的任意多种组合。进一步,在M*N个第一类型处理电路阵列的外围的左、右两侧各有两列、共(M*2+M*2)个第二类型处理电路,并且在其外围的下侧有两行、共(N*2+8)个第二类型处理电路,即该处理电路阵列共有(M*2+M*2+N*2+8)个第二类型处理电路(如图中示出的深色背景的处理电路104)。在一个实施例中,所述第二类型处理电路可以用于对接收到的数据执行例如比较运算、查表运算或移位操作等非线性运算。As shown in the figure, there are M rows and N columns (denoted as M*N) of first type processing circuits (processing circuits 104 with a light background shown in the figure) in the two-dimensional array, where M and N is a positive integer greater than 0. The first type of processing circuit can be used to perform arithmetic operations and logical operations, for example, can include linear operations such as addition, subtraction and multiplication, comparison operations and non-linear operations such as AND-OR, or any combination of the aforementioned types of operations. . Further, on the left and right sides of the periphery of the M*N first type processing circuit arrays, there are two columns, a total of (M*2+M*2) second type processing circuits, and on the lower side of the periphery thereof There are two rows and a total of (N*2+8) second-type processing circuits, that is, the processing circuit array has a total of (M*2+M*2+N*2+8) second-type processing circuits (as shown in the figure). The processing circuit 104 is shown with a dark background). In one embodiment, the second type of processing circuit may be used to perform non-linear operations such as comparison operations, table lookup operations or shift operations on the received data.
在一些应用场景中,第一类型处理电路与第二类型处理电路二者所应用的存储电 路可以具有不同的存储规模和存储方式。例如,第一类型处理电路中的谓词存储电路可以利用多个经过编号的寄存器存储谓词信息。进一步,第一类型处理电路可以根据接收到的解析后的指令中指定的寄存器编号来存取对应编号的寄存器中的谓词信息。又例如,第二类型处理电路可以采用静态随机存取存储器(“SRAM”)的方式对谓词信息进行存储。具体来说,所述第二类型处理电路可以根据接收到的解析后的指令中指定的该谓词信息所在位置的偏移量来确定所述谓词信息在静态随机存取存储器(“SRAM”)中的存储地址,并且可以对该存储地址中的谓词信息进行预定的读出或写入操作。In some application scenarios, the storage circuits applied to both the first type of processing circuit and the second type of processing circuit may have different storage scales and storage modes. For example, the predicate storage circuit in the first type of processing circuit may utilize a plurality of numbered registers to store predicate information. Further, the first-type processing circuit can access the predicate information in the register of the corresponding number according to the register number specified in the received parsed instruction. As another example, the second type of processing circuit may store the predicate information in a static random access memory ("SRAM"). Specifically, the second type processing circuit may determine that the predicate information is stored in static random access memory ("SRAM") according to the offset of the location of the predicate information specified in the received parsed instruction storage address, and can perform predetermined read or write operations on the predicate information in the storage address.
图5a,5b,5c和5d是示出根据本披露实施例的多个处理电路的多种连接关系的示意图。本披露的多个处理电路之间可以一维或多维阵列的拓扑结构进行连接。当多个处理电路之间以多维阵列进行连接时,所述多维阵列可以是二维阵列,并且位于所述二维阵列中的所述处理电路可以在其行方向、列方向或对角线方向的至少一个方向上,以预定的二维间隔模式与同行、同列或同对角线上的其余一个或多个所述处理电路连接。其中所述预定的二维间隔模式可以与所述连接中间隔的处理电路的数目相关联。图5a至图5c示例性示出多个处理电路之间的多种形式的二维阵列的拓扑结构。5a, 5b, 5c and 5d are schematic diagrams illustrating various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure. The multiple processing circuits of the present disclosure may be connected in a one-dimensional or multi-dimensional array topology. When a plurality of processing circuits are connected in a multi-dimensional array, the multi-dimensional array may be a two-dimensional array, and the processing circuits located in the two-dimensional array may be arranged in a row direction, a column direction or a diagonal direction thereof. In at least one direction, it is connected with the remaining one or more of the processing circuits in the same row, the same column or the same diagonal line in a predetermined two-dimensional interval pattern. wherein the predetermined two-dimensional spacing pattern may be associated with the number of processing circuits spaced in the connection. Figures 5a to 5c exemplarily show the topology of various forms of two-dimensional arrays between a plurality of processing circuits.
如图5a所示,五个处理电路(每个以方框表示)连接形成一个简单的二维阵列。具体来说,以一个处理电路作为二维阵列的中心,向相对于该处理电路的水平和垂直的四个方向上各连接一个处理电路,从而形成一个具有三行和三列大小的二维阵列。进一步,由于位于二维阵列中心的处理电路分别与同行的前一列和后一列相邻的处理电路、与同列的上一行和下一行相邻的处理电路直接连接,从而间隔的处理电路的数目(简称“间隔数目”)为0。As shown in Figure 5a, five processing circuits (each represented by a box) are connected to form a simple two-dimensional array. Specifically, one processing circuit is used as the center of the two-dimensional array, and one processing circuit is connected to each of the four horizontal and vertical directions relative to the processing circuit, thereby forming a two-dimensional array with three rows and three columns. . Further, since the processing circuits located in the center of the two-dimensional array are respectively directly connected with the processing circuits adjacent to the previous and next columns of the same row, and the processing circuits adjacent to the previous row and the next row of the same column, the number of spaced processing circuits ( abbreviated as "Number of Intervals") is 0.
如图5b所示,四行四列的处理电路可以连接形成一个二维Torus阵列,其中每个处理电路分别与其相邻的前一行和后一行、前一列和后一列的处理电路进行连接,即相邻处理电路连接的间隔数目均为0。进一步,位于该二维Torus阵列中每行或每列的第一个处理电路还与该行或该列的最后一个处理电路相连,每行或每列首尾相连的处理电路之间的间隔数目均为2。As shown in Figure 5b, four rows and four columns of processing circuits can be connected to form a two-dimensional Torus array, where each processing circuit is connected to its adjacent processing circuits in the preceding and following rows, and the preceding and following columns, namely, The number of intervals connected to adjacent processing circuits is all zero. Further, the first processing circuit located in each row or column in the two-dimensional Torus array is also connected to the last processing circuit of the row or column, and the number of intervals between the processing circuits connected end to end in each row or column is equal to is 2.
如图5c所示,四行四列的处理电路还可以连接形成一个相邻处理电路之间的间隔数目为0、不相邻处理电路之间的间隔数目为1的二维阵列。具体地,该二维阵列中同行或同列相邻的处理电路直接连接,即间隔数目为0,而同行或同列不相邻的处理电路与间隔数目为1的处理电路进行连接。可以看出,当多个处理电路连接形成二维阵列时,图5b和图5c示出的同行或同列的处理电路之间可以有不同的间隔数目。类似地,在一些场景中,也可以不同的间隔数目与对角线方向上的处理电路进行连接。As shown in FIG. 5c , the processing circuits with four rows and four columns may also be connected to form a two-dimensional array in which the number of intervals between adjacent processing circuits is 0, and the number of intervals between non-adjacent processing circuits is 1. Specifically, in the two-dimensional array, adjacent processing circuits in the same row or in the same column are directly connected, that is, the number of intervals is 0, and the processing circuits in the same row or in the same column that are not adjacent are connected to the processing circuit in the number of intervals. It can be seen that when a plurality of processing circuits are connected to form a two-dimensional array, there may be different numbers of intervals between the processing circuits in the same row or in the same column shown in FIG. 5b and FIG. 5c. Similarly, in some scenarios, different numbers of intervals may also be connected to the processing circuits in the diagonal direction.
如图5d所示,利用四个如图5b示出的二维Torus阵列,可以按照预定的间隔排列成四层二维Torus阵列进行连接,以形成一个三维Torus阵列。该三维Torus阵列在二维Torus阵列的基础上,利用与行间、列间类似的间隔模式进行层间连接。例如,首先将相邻层同行同列的处理电路直接相连,即间隔数目为0。接着,将第一层和最后一层同行同列的处理电路进行连接,即间隔数目为2。最终可以形成四层四行四列的三维Torus阵列。As shown in Fig. 5d, using four two-dimensional Torus arrays as shown in Fig. 5b, four layers of two-dimensional Torus arrays can be arranged at predetermined intervals for connection to form a three-dimensional Torus array. The three-dimensional Torus array is based on the two-dimensional Torus array, and uses a spacing pattern similar to that between rows and columns for interlayer connection. For example, firstly, the processing circuits in the same row and the same column of adjacent layers are directly connected, that is, the number of intervals is 0. Next, connect the processing circuits of the first layer and the last layer in the same column, that is, the number of intervals is 2. Finally, a three-dimensional Torus array with four layers, four rows and four columns can be formed.
通过上面这些示例,本领域技术人员可以理解处理电路的其他多维阵列的连接关系可以在二维阵列的基础上,通过增加新的维度和增加处理电路的数目来形成。在一 些应用场景中,本披露的方案也可以通过使用配置指令来对处理电路配置逻辑连接。换句话说,尽管处理电路之间可能存在硬线连接,但本披露的方案也可以通过配置指令来选择性地令一些处理电路连接,或者选择性地旁路一些处理电路,以形成一个或多个逻辑连接。在一些实施例中,还可以根据实际运算的需求(例如数据类型的转换)来调整前述的逻辑连接。进一步,针对于不同的计算场景,本披露的方案可以对处理电路的连接进行配置,包括例如配置成矩阵或者配置成一个或多个闭合的计算环路。Through the above examples, those skilled in the art can understand that the connection relationship of other multi-dimensional arrays of processing circuits can be formed on the basis of two-dimensional arrays by adding new dimensions and increasing the number of processing circuits. In some application scenarios, the solutions of the present disclosure may also configure logical connections to processing circuits by using configuration instructions. In other words, although there may be hard-wired connections between processing circuits, the disclosed solution may selectively connect some processing circuits or selectively bypass some processing circuits through configuration instructions to form one or more processing circuits. a logical connection. In some embodiments, the aforementioned logical connections can also be adjusted according to actual operation requirements (eg, data type conversion). Further, for different computing scenarios, the solutions of the present disclosure can configure the connection of the processing circuits, including, for example, configuring into a matrix or configuring into one or more closed computing loops.
图6a,6b,6c和6d是示出根据本披露实施例的多个处理电路的另外多种连接关系的示意图。从图中可以看出,图6a至图6d是在图5a至图5d示出的多个处理电路形成的多维阵列的又一种示例性连接关系。鉴于此,结合图5a至图5d所描述的技术细节也同样适用于图6a至图6d所示出的内容。6a, 6b, 6c and 6d are schematic diagrams illustrating further various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure. As can be seen from the figures, FIGS. 6 a to 6 d are still another exemplary connection relationship of a multi-dimensional array formed by a plurality of processing circuits shown in FIGS. 5 a to 5 d . In view of this, the technical details described in conjunction with Figs. 5a to 5d also apply to the content shown in Figs. 6a to 6d.
如图6a所示,二维阵列的处理电路包括位于二维阵列中心的中心处理电路和与该中心处理电路同行和同列的四个方向上分别连接的三个处理电路。因此,该中心处理电路与其余处理电路之间连接的间隔数目分别是0、1和2。如图6b所示,二维阵列的处理电路包括位于二维阵列中心的中心处理电路、和与该处理电路同行的两个相对方向上的三个处理电路,以及与该处理电路同列的两个相对方向上的一个处理电路。因此,中心处理电路与同行的处理电路之间连接的间隔数目分别为0和2,与同列的处理电路之间连接的间隔数目均为0。As shown in Fig. 6a, the processing circuit of the two-dimensional array includes a central processing circuit located in the center of the two-dimensional array and three processing circuits respectively connected to the central processing circuit in four directions in the same row and in the same column. Therefore, the number of bays connected between the central processing circuit and the remaining processing circuits is 0, 1 and 2, respectively. As shown in Fig. 6b, the processing circuit of the two-dimensional array includes a central processing circuit located in the center of the two-dimensional array, three processing circuits in two opposite directions parallel to the processing circuit, and two processing circuits in the same column as the processing circuit A processing circuit in the opposite direction. Therefore, the number of intervals between the central processing circuit and the processing circuits in the same row is 0 and 2 respectively, and the number of intervals between the central processing circuit and the processing circuits in the same column is all 0.
正如前文结合图5d所示出的,多个处理电路形成的多维阵列可以由多个层构成的三维阵列。其中所述三维阵列的每个层可以包括沿其行方向和列方向排列的多个所述处理电路的二维阵列。进一步,位于所述三维阵列中的所述处理电路可以在其行方向、列方向、对角线方向和层方向的至少一个方向上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一个或多个处理电路连接。进一步,所述预定的三维间隔模式与所述连接中相互间隔的处理电路的数目可以和间隔的层数目相关。下面将结合图6c与图6d对三维阵列的连接方式做出进一步描述。As previously shown in conjunction with FIG. 5d, a multi-dimensional array formed by a plurality of processing circuits may be a three-dimensional array formed by a plurality of layers. Wherein each layer of the three-dimensional array may comprise a two-dimensional array of a plurality of the processing circuits arranged along its row and column directions. Further, the processing circuits located in the three-dimensional array may be in a predetermined three-dimensional spaced pattern with a row, column, diagonal or The remaining one or more processing circuits on different layers are connected. Further, the predetermined three-dimensional spacing pattern and the number of mutually spaced processing circuits in the connection may be related to the number of spaced layers. The connection mode of the three-dimensional array will be further described below with reference to FIG. 6c and FIG. 6d.
图6c示出多个处理电路连接形成的多层多行多列的三维阵列。以位于第l层、第r行、第c列(表示为(l,r,c))的处理电路为例,其位于阵列中心位置,并且分别与同层同行的前一列(l,r,c-1)处的处理电路和后一列(l,r,c+1)处的处理电路、同层同列的前一行(l,r-1,c)处的处理电路和后一行(l,r+1,c)处的处理电路,以及同行同列不同层的前一层(l-1,r,c)处的处理电路和后一层(l+1,r,c)处的处理电路进行连接。进一步,(l,r,c)处的处理电路与其他处理电路在行方向、列方向和层方向上连接的间隔数目均为0。Figure 6c shows a multi-layer, multi-row and multi-column three-dimensional array formed by connecting a plurality of processing circuits. Taking the processing circuit located at the lth layer, the rth row, and the cth column (represented as (l, r, c)) as an example, it is located at the center of the array, and is in the same layer as the previous column (l, r, The processing circuit at c-1) and the processing circuit at the next column (l, r, c+1), the processing circuit at the previous row (l, r-1, c) of the same layer and the same column and the processing circuit at the next row (l, r-1, c) The processing circuit at r+1, c), and the processing circuit at the previous layer (l-1, r, c) and the processing circuit at the next layer (l+1, r, c) of different layers in the same column to connect. Further, the number of intervals at which the processing circuit at (l, r, c) is connected to other processing circuits in the row direction, the column direction and the layer direction are all zero.
图6d示出当多个处理电路之间在行方向、列方向、和层方向上连接的间隔数目均为1时的三维阵列。以位于阵列中心位置(l,r,c)的处理电路为例,其分别与同层同行不同列的前后各间隔一列的(l,r,c-2)和(l,r,c+2)处的处理电路、同层同列不同行的前后各间隔一行的(l,r-2,c)和(l,r+2,c)处的处理电路进行连接。进一步,其与同行同列不同层的前后各间隔一层的(l-2,r,c)和(l+2,r,c)处的处理电路进行连接。类似地,其余的同层同行间隔一列的(l,r,c-3)与(l,r,c-1)处的处理电路彼此进行连接,而(l,r,c+1)与(l,r,c+3)处的处理电路彼此进行连接。接着,同层同列间隔一行的(l,r-3,c)与(l,r-1,c)处的处理电路彼此进行连接、(l,r+1,c)与(l,r+3,c)处的处理电路彼此进行连接。另外,同行同列 间隔一层的(l-3,r,c)与(l-1,r,c)处的处理电路彼此进行连接、而(l+1,r,c)与(l+3,r,c)处的处理电路彼此进行连接。FIG. 6d shows a three-dimensional array when the number of spaces connected between a plurality of processing circuits in the row direction, the column direction, and the layer direction is all one. Taking the processing circuit located at the center of the array (l, r, c) as an example, it is separated from (l, r, c-2) and (l, r, c+2) by one column before and after different columns in the same layer, respectively. ), and the processing circuits at (1, r-2, c) and (1, r+2, c) at the same layer and the same column and different rows are connected. Further, it is connected with the processing circuits at (l-2, r, c) and (l+2, r, c) at the same row and different layers in the same row before and after each other. Similarly, the processing circuits at (l, r, c-3) and (l, r, c-1) at the same level and one column apart are connected to each other, and (l, r, c+1) and ( The processing circuits at l, r, c+3) are connected to each other. Then, the processing circuits at (l, r-3, c) and (l, r-1, c) in the same layer and the same column are connected to each other, (l, r+1, c) and (l, r+ 3. The processing circuits at c) are connected to each other. In addition, the processing circuits at (l-3, r, c) and (l-1, r, c) in the same row and one layer are connected to each other, and (l+1, r, c) and (l+3) The processing circuits at , r, c) are connected to each other.
上文对多个处理电路形成的多维阵列的连接关系进行了示例性描述,下文将结合图7-图8对多个处理电路形成的不同环路结构做出进一步示例性说明。The connection relationship of the multi-dimensional array formed by a plurality of processing circuits has been exemplarily described above, and different loop structures formed by a plurality of processing circuits will be further exemplarily described below with reference to FIGS. 7-8 .
图7a,7b、7c和7d是分别示出根据本披露实施例的处理电路的多种环路结构的示意图。根据不同的应用场景,多个处理电路不仅可以物理连接关系来进行连接,也可以根据接收到的解析后的指令配置成以逻辑关系来进行连接。所述多个处理电路可以配置成利用所述逻辑连接关系进行连接以形成闭合的环路。7a, 7b, 7c and 7d are schematic diagrams respectively illustrating various loop structures of processing circuits according to embodiments of the present disclosure. According to different application scenarios, a plurality of processing circuits can not only be connected in a physical connection relationship, but also can be configured to be connected in a logical relationship according to the received parsed instruction. The plurality of processing circuits may be configured to be connected using the logical connection relationship to form a closed loop.
如图7a所示,四个相邻的处理电路顺序编号为“0、1、2和3”。接着,从处理电路0开始按照顺时针方向将该四个处理电路顺序相连,并且处理电路3与处理电路0进行连接,以使四个处理电路串联形成一个闭合的环路(简称“成环”)。在该环路中,处理电路的间隔数目为0或2,例如处理电路0与1之间间隔数目为0,而处理电路3与0之间间隔数目为2。进一步,所示环路中的四个处理电路的物理地址可以为0-1-2-3,而其逻辑地址同样为0-1-2-3。需要注意的是,图7a所示出的连接顺序仅仅是示例性的而非限制性的,本领域技术人员根据实际计算需要,也可以以逆时针方向对四个处理电路进行串联连接以形成闭合的环路。As shown in Figure 7a, the four adjacent processing circuits are sequentially numbered "0, 1, 2 and 3". Next, the four processing circuits are sequentially connected in a clockwise direction from processing circuit 0, and processing circuit 3 is connected with processing circuit 0, so that the four processing circuits are connected in series to form a closed loop (referred to as "looping" for short). ). In this loop, the number of intervals between processing circuits is 0 or 2, eg, the number of intervals between processing circuits 0 and 1 is 0, and the number of intervals between processing circuits 3 and 0 is 2. Further, the physical addresses of the four processing circuits in the illustrated loop may be 0-1-2-3, while their logical addresses are also 0-1-2-3. It should be noted that the connection sequence shown in FIG. 7a is only exemplary and non-limiting, and those skilled in the art can also connect the four processing circuits in a counterclockwise direction in series to form a closed circuit according to actual calculation needs. the loop.
在一些实际场景中,当一个处理电路支持的数据位宽不能满足运算数据的位宽要求时,可以利用多个处理电路组合成一个处理电路组以表示一个数据。例如,假设一个处理电路可以处理8位数据。当需要处理32位的数据时,则可以将4个处理电路进行组合成为一个处理电路组,以便对4个8位数据进行连接以形成一个32位数据。进一步,前述4个8位处理电路形成的一个处理电路组可以充当图7b中示出的一个处理电路104,从而可以支持更高位宽的运算操作。In some practical scenarios, when the data bit width supported by one processing circuit cannot meet the bit width requirement of the operation data, a plurality of processing circuits may be combined into a processing circuit group to represent one data. For example, suppose a processing circuit can handle 8-bit data. When 32-bit data needs to be processed, four processing circuits can be combined into a processing circuit group, so that four 8-bit data can be connected to form a 32-bit data. Further, one processing circuit group formed by the aforementioned four 8-bit processing circuits can serve as one processing circuit 104 shown in FIG. 7b, so that higher bit-width arithmetic operations can be supported.
从图7b中可以看出,其所示出的处理电路的布局与图7a示出的类似,但图7b中处理电路之间连接的间隔数目与图7a不同。图7b示出以0、1、2和3顺序编号的四个处理电路按顺时针方向从处理电路0开始,顺序连接处理电路1、处理电路3和处理电路2,并且处理电路2连接至处理电路0,从而串联形成一个闭合的环路。从该环路中可以看出,图7b中示出的处理电路的间隔数目为0或1,例如处理电路0与1之间间隔为0,而处理电路1与3之间间隔为1。进一步,所示闭合环路中的四个处理电路的物理地址可以为0-1-2-3,而逻辑地址则为0-1-3-2。因此,当需要对高比特位宽的数据进行拆分以分配给不同的处理电路时,可以根据处理电路的逻辑地址对数据顺序进行重新排列和分配。It can be seen from Fig. 7b that the layout of the processing circuits shown is similar to that shown in Fig. 7a, but the number of intervals of connections between the processing circuits in Fig. 7b is different from that of Fig. 7a. Figure 7b shows four processing circuits numbered sequentially 0, 1, 2 and 3 starting from processing circuit 0 in a clockwise direction, connecting processing circuit 1, processing circuit 3 and processing circuit 2 in sequence, and processing circuit 2 connected to processing circuit 2. circuit 0, thus forming a closed loop in series. It can be seen from this loop that the number of intervals of the processing circuits shown in FIG. 7b is 0 or 1, eg, the interval between processing circuits 0 and 1 is 0, and the interval between processing circuits 1 and 3 is 1. Further, the physical addresses of the four processing circuits in the illustrated closed loop may be 0-1-2-3, while the logical addresses may be 0-1-3-2. Therefore, when data of high bit width needs to be split to be allocated to different processing circuits, the data sequence can be rearranged and allocated according to the logical addresses of the processing circuits.
上述的拆分和重新排列的操作可以由结合图2描述的前处置电路来执行。特别地,该前处置电路可以根据多个处理电路的物理地址和逻辑地址来对输入数据进行重新排列,以用于满足数据运算的要求。假设四个顺序排列的处理电路0至处理电路3如图7a中所示出的连接,由于连接的物理地址和逻辑地址都为0-1-2-3,因此前处置电路可以将输入数据(例如像素数据)aa0、aa1、aa2和aa3依次传送到对应的处理电路中。然而,当前述的四个处理电路按图7b所示出的连接时,其物理地址保持0-1-2-3不变,而逻辑地址变为0-1-3-2,此时前处置电路需要将输入数据aa0、aa1、aa2和aa3重新排列为aa0-aa1-aa3-aa2,以传送到对应的处理电路中。基于上述的输入数据重排列,本披露的方案可以保证数据运算顺序的正确性。类似地,如果前述获得的四个运 算输出结果(例如是像素数据)的顺序是bb0-bb1-bb3-bb2,可以利用结合图2描述的后处置电路将运算输出结果的顺序还原调整为bb0-bb1-bb2-bb3,以用于保证输入数据和输出结果数据之间的排列一致性。The operations of splitting and rearranging described above may be performed by the pre-processing circuit described in conjunction with FIG. 2 . In particular, the pre-processing circuit can rearrange the input data according to the physical addresses and logical addresses of the plurality of processing circuits, so as to meet the requirements of data operation. Assuming that four sequentially arranged processing circuits 0 to 3 are connected as shown in Figure 7a, since the physical and logical addresses of the connections are both 0-1-2-3, the pre-processing circuit can convert the input data ( For example, pixel data) aa0, aa1, aa2 and aa3 are sequentially transmitted to the corresponding processing circuits. However, when the aforementioned four processing circuits are connected as shown in Figure 7b, their physical addresses remain unchanged from 0-1-2-3, while their logical addresses become 0-1-3-2. The circuit needs to rearrange the input data aa0, aa1, aa2 and aa3 into aa0-aa1-aa3-aa2 for transmission to the corresponding processing circuit. Based on the above-mentioned rearrangement of the input data, the solution of the present disclosure can ensure the correctness of the data operation sequence. Similarly, if the order of the four operation output results (for example, pixel data) obtained above is bb0-bb1-bb3-bb2, the post-processing circuit described in conjunction with FIG. 2 can be used to restore and adjust the order of the operation output results to bb0- bb1-bb2-bb3, to ensure the consistency of arrangement between input data and output result data.
图7c和图7d示出更多的处理电路分别以不同方式进行排列和连接,以形成闭合的环路。如图7c所示,以0,1…15顺序编号的16个处理电路104从处理电路0开始,顺序地每两个处理电路进行连接和组合,以形成一个处理电路组。例如,如图中所示,处理电路0与处理电路1连接形成一个处理电路组……。以此类推,处理电路14与处理电路15连接以形成一个处理电路组,最终形成八个处理电路组。进一步,该八个处理电路组也可以类似于前述的处理电路的连接方式进行连接,包括按照例如预定的逻辑地址来进行连接,以形成一个处理电路组的闭合的环路。Figures 7c and 7d show that more processing circuits are arranged and connected in different ways, respectively, to form a closed loop. As shown in FIG. 7c, the 16 processing circuits 104 numbered in the order of 0, 1 . . . 15, starting from processing circuit 0, are sequentially connected and combined every two processing circuits to form a processing circuit group. For example, as shown in the figure, processing circuit 0 is connected with processing circuit 1 to form a processing circuit group . . . By analogy, the processing circuit 14 is connected with the processing circuit 15 to form one processing circuit group, and finally eight processing circuit groups are formed. Further, the eight processing circuit groups can also be connected in a manner similar to the aforementioned processing circuits, including connection according to, for example, predetermined logical addresses, so as to form a closed loop of the processing circuit groups.
如图7d所示,多个处理电路104以不规则或者说不统一的方式来连接,以形成一个闭合的环路。具体来说,在图7d中示出处理电路之间可以间隔数目为0或3来形成闭合的环路,例如处理电路0可以分别与处理电路1(间隔数目为0)和处理电路4(间隔数目为3)相连。As shown in FIG. 7d, the plurality of processing circuits 104 are connected in an irregular or non-uniform manner to form a closed loop. Specifically, it is shown in FIG. 7d that the number of intervals between the processing circuits can be 0 or 3 to form a closed loop, for example, the processing circuit 0 can be respectively connected with the processing circuit 1 (the interval number is 0) and the processing circuit 4 (the interval number is 0) The number is 3) connected.
由上述结合图7a、7b、7c和7d的描述可知,本披露的处理电路可以间隔有不同数目的处理电路,以便连接成闭合的环路。当处理电路总数变化时,也可以选择任意的中间间隔数目进行动态配置,从而连接成闭合的环路。还可以将多个处理电路组合成为处理电路组,并连接成处理电路组的闭合的环路。另外,多个处理电路的连接可以是硬件构成的硬连接方式,或者可以是软件配置的软连接方式。As can be seen from the above description in conjunction with FIGS. 7a, 7b, 7c and 7d, the processing circuit of the present disclosure may be spaced by different numbers of processing circuits so as to be connected in a closed loop. When the total number of processing circuits changes, any number of intermediate intervals can also be selected for dynamic configuration, thereby connecting into a closed loop. It is also possible to combine a plurality of processing circuits into a processing circuit group and connect them into a closed loop of the processing circuit group. In addition, the connection of the plurality of processing circuits may be a hard connection formed by hardware, or may be a soft connection configured by software.
图8a,8b和8c是示出根据本披露实施例的处理电路的另外多种环路结构的示意图。正如结合图6所示出的多个处理电路可以形成一个闭合的环路,并且所述闭合的环路中的每个处理电路可以配置有各自的逻辑地址。进一步,由结合图2描述的前处置电路可以配置成根据运算数据的类型(例如32bit数据,16bit数据或8bit数据)和逻辑地址,将所述运算数据进行相应的拆分并将拆分后获得的多个子数据分别传递至环路中对应的各个处理电路中以用于后续运算。Figures 8a, 8b and 8c are schematic diagrams illustrating further various loop structures of processing circuits according to embodiments of the present disclosure. As shown in conjunction with FIG. 6, multiple processing circuits may form a closed loop, and each processing circuit in the closed loop may be configured with a respective logical address. Further, the pre-processing circuit described in conjunction with FIG. 2 can be configured to perform corresponding splitting of the operational data and obtain after the splitting according to the type of the operational data (such as 32bit data, 16bit data or 8bit data) and the logical address. The multiple sub-data of , are respectively transferred to the corresponding processing circuits in the loop for subsequent operations.
图8a上图示出四个处理电路连接形成一个闭合环路,并且该四个处理电路按从右到左顺序的物理地址(在本披露的上下文中也可以称为物理坐标)可以表示为0-1-2-3。图8a下图示出前述所述环路中的四个处理电路从右到左顺序的逻辑地址表示为0-3-1-2。例如,图8a下图所示出的逻辑地址为“3”的处理电路具有图8a上图示出的物理地址“1”。The upper diagram of FIG. 8a shows that four processing circuits are connected to form a closed loop, and the physical addresses (which may also be referred to as physical coordinates in the context of this disclosure) of the four processing circuits in right-to-left order can be represented as 0 -1-2-3. The lower diagram of Figure 8a shows that the logical addresses of the four processing circuits in the aforementioned loop are represented as 0-3-1-2 in order from right to left. For example, the processing circuit with the logical address "3" shown in the lower diagram of Fig. 8a has the physical address "1" shown in the upper diagram of Fig. 8a.
在一些应用场景中,假设操作数据的粒度是输入数据的低128bit,例如图中的原始序列“15,14,……2,1,0”(每个数字对应8bit数据),并且设定该16个8bit数据的逻辑地址从低到高编号依次是0~15。进一步,按照如图8a下图所示出的逻辑地址,所述前处置电路可以根据不同的数据类型,对数据采用不同的逻辑地址进行编码或排列。In some application scenarios, it is assumed that the granularity of the operation data is the lower 128 bits of the input data, such as the original sequence "15, 14, ... 2, 1, 0" in the figure (each number corresponds to 8 bits of data), and set this The logical addresses of the 16 8-bit data are numbered from low to high in order from 0 to 15. Further, according to the logical addresses shown in the lower figure of Fig. 8a, the pre-processing circuit can use different logical addresses to encode or arrange the data according to different data types.
当处理电路操作的数据位宽为32bit时,逻辑地址分别为(3,2,1,0),(7,6,5,4),(11,10,9,8)和(15,14,13,12)的4个数可以分别表示第0~3个32bit数据。所述前处置电路可以将第0个32bit数据传送至逻辑地址为“0”的处理电路中(对应的物理地址为“0”),可以将第1个32bit数据传送至逻辑地址为“1”的处理电路中(对应的物理地址为“2”),可以将第2个32bit数据传送至逻辑地址为“2”的处理电路中(对应的物理 地址为“3”),可以将第3个32bit数据传送至逻辑地址为“3”的处理电路中(对应的物理地址为“1”)。通过数据的重新排列,以用于满足处理电路的后续运算需求。因此最终数据的逻辑地址与物理地址之间的映射关系为(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0)->(11,10,9,8,7,6,5,4,15,14,13,12,3,2,1,0)。When the data bit width of the processing circuit operation is 32bit, the logical addresses are (3,2,1,0), (7,6,5,4), (11,10,9,8) and (15,14) , 13, 12) can represent the 0th to 3rd 32bit data respectively. The preprocessing circuit can transmit the 0th 32-bit data to the processing circuit whose logical address is "0" (the corresponding physical address is "0"), and can transmit the first 32-bit data to the logical address "1". In the processing circuit (corresponding physical address is "2"), the second 32-bit data can be transferred to the processing circuit whose logical address is "2" (corresponding physical address is "3"), and the third The 32bit data is sent to the processing circuit whose logical address is "3" (the corresponding physical address is "1"). Through the rearrangement of data, it is used to meet the subsequent operation requirements of the processing circuit. Therefore, the mapping relationship between the logical address and the physical address of the final data is (15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)-> (11,10,9,8,7,6,5,4,15,14,13,12,3,2,1,0).
当处理电路操作的数据位宽为16bit时,逻辑地址分别为(1,0),(3,2),(5,4),(7,6),(9,8),(11,10),(13,12)和(15,14)的8个数可以分别表示第0~7个16bit数据。所述前处置电路可以将第0个和第4个16bit数据传送至逻辑地址为“0”的处理电路中(对应的物理地址为“0”),可以将第1个和第5个16bit数据传送至逻辑地址为“1”的处理电路中(对应的物理地址为“2”),可以将第2个和第6个16bit数据传送至逻辑地址为“2”的处理电路中(对应的物理地址为“3”),可以将第3个和第7个16bit数据传送至逻辑地址为“3”的处理电路中(对应的物理地址为“1”)。因此最终数据的逻辑地址与物理地址之间的映射关系为:When the data bit width of the processing circuit operation is 16bit, the logical addresses are (1,0), (3,2), (5,4), (7,6), (9,8), (11,10) ), (13,12) and (15,14) 8 numbers can represent the 0th to 7th 16bit data respectively. The pre-processing circuit can transfer the 0th and 4th 16bit data to the processing circuit whose logical address is "0" (the corresponding physical address is "0"), and can transfer the 1st and 5th 16bit data. Transfer to the processing circuit with logical address "1" (corresponding physical address is "2"), the second and sixth 16bit data can be transferred to the processing circuit with logical address "2" (corresponding physical address is "2") The address is "3"), and the third and seventh 16-bit data can be transferred to the processing circuit whose logical address is "3" (the corresponding physical address is "1"). Therefore, the mapping relationship between the logical address and the physical address of the final data is:
(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0)->(13,12,5,4,11,10,3,2,15,14,7,6,9,8,1,0)。(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0)->(13,12,5,4,11,10,3, 2,15,14,7,6,9,8,1,0).
当处理电路操作的数据位宽为8bit时,逻辑地址为0~15的16个数可以分别表示第0~15个8bit数据。根据图8a所示出的连接,所述前处置电路可以将第0个、第4个、第8个和第12个8bit数据传送至逻辑地址为“0”的处理电路中(对应的物理地址为“0”);可以将第1个、第5个、第9个和第13个8bit数据传送至逻辑地址为“1”的处理电路中(对应的物理地址为“2”);可以将第2个、第6个、第10个和第14个8bit数据传送至逻辑地址为“2”的处理电路中(对应的物理地址为“3”);可以将第3个、第7个、第11和第15个8bit数据传送至逻辑地址为“3”的处理电路中(对应的物理地址为“1”)。因此最终数据的逻辑地址与物理地址之间的映射关系为:(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0)->(14,19,6,2,13,9,5,1,15,11,7,3,12,8,4,0)。When the data bit width operated by the processing circuit is 8 bits, the 16 numbers whose logical addresses are 0 to 15 can respectively represent the 0th to 15th 8bit data. According to the connection shown in FIG. 8a, the pre-processing circuit can transmit the 0th, 4th, 8th and 12th 8-bit data to the processing circuit whose logical address is “0” (the corresponding physical address is "0"); the 1st, 5th, 9th and 13th 8bit data can be transferred to the processing circuit whose logical address is "1" (the corresponding physical address is "2"); The 2nd, 6th, 10th and 14th 8bit data are transferred to the processing circuit with the logical address "2" (the corresponding physical address is "3"); the third, seventh, The 11th and 15th 8bit data are transferred to the processing circuit whose logical address is "3" (the corresponding physical address is "1"). Therefore, the mapping relationship between the logical address and the physical address of the final data is: (15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)- >(14,19,6,2,13,9,5,1,15,11,7,3,12,8,4,0).
图8b上图示出八个顺序编号的处理电路0至处理电路7连接形成一个闭合的环路,并且该八个处理电路的物理地址为0-1-2-3-4-5-6-7。图8b下图示出前述八个处理电路的逻辑地址为0-7-1-6-2-5-3-4。例如,图8b上图示出物理地址为“6”的处理电路对应于图8b下图示出的逻辑地址为“3”。Figure 8b shows that eight sequentially numbered processing circuits 0 to 7 are connected to form a closed loop, and the physical addresses of the eight processing circuits are 0-1-2-3-4-5-6- 7. The lower diagram of Fig. 8b shows that the logical addresses of the aforementioned eight processing circuits are 0-7-1-6-2-5-3-4. For example, the processing circuit with the physical address "6" shown in the upper diagram of Fig. 8b corresponds to the logical address "3" shown in the lower diagram of Fig. 8b.
图8b所示出的针对不同数据类型,所述前处置电路对数据进行重新排列后传送至对应的处理电路的操作与图8a类似,因此结合图8a所描述的技术方案也同样适用于图8b,此处不再对上述的数据重新排列操作过程进行赘述。进一步,图8b所示出的处理电路的连接关系与图8a所示出的类似,但图8b示出八个处理电路为图8a所示出的处理电路个数的两倍。由此,在根据不同数据类型进行操作的应用场景中,结合图8b所描述操作数据的粒度可以为结合图8a所描述操作数据的粒度的两倍。因此,相对于前面例子中输入数据的粒度为低128bit,本例中操作数据的粒度可以为输入数据的低256bit,例如图中示出的原始数据序列“31,30,……,2,1,0”,每个数字对应于8比特(“bit”)长度。For different data types shown in FIG. 8b, the pre-processing circuit rearranges the data and then transmits the data to the corresponding processing circuit. The operation is similar to that of FIG. 8a, so the technical solution described in conjunction with FIG. 8a is also applicable to FIG. 8b. , the above data rearrangement operation process will not be repeated here. Further, the connection relationship of the processing circuits shown in FIG. 8b is similar to that shown in FIG. 8a, but the eight processing circuits shown in FIG. 8b are twice the number of processing circuits shown in FIG. 8a. Thus, in an application scenario where operations are performed according to different data types, the granularity of the operational data described in conjunction with FIG. 8b may be twice that of the operational data described in conjunction with FIG. 8a. Therefore, compared with the granularity of the input data in the previous example, which is 128 bits lower, the granularity of the operation data in this example can be 256 bits lower than that of the input data, for example, the original data sequence "31, 30, . , 0", each digit corresponds to an 8-bit ("bit") length.
针对于上述原始数据序列,当处理电路操作的数据位宽分别是32bit、16bit和8bit时,图中还分别示出成环的处理电路中的数据的排列结果。例如,当操作的数据位宽是32bit时,逻辑地址为“1”的处理电路中的1个32bit数据为(7,6,5,4),该处理电路对应的物理地址为“2”。而当操作的数据位宽是16bit时,逻辑地址为“3”的处理电路中的2个16bit数据为(23,22,7,6),该处理电路对应的物理地址为“6”。当操作的数据 位宽是8bit时,逻辑地址为“6”的处理电路中的4个8bit数据为(30,22,14,6),该处理电路对应的物理地址为“3”。For the above-mentioned original data sequence, when the data bit widths operated by the processing circuits are 32 bits, 16 bits, and 8 bits, respectively, the figures also show the arrangement results of the data in the looped processing circuits. For example, when the data bit width of the operation is 32 bits, one 32-bit data in the processing circuit whose logical address is "1" is (7, 6, 5, 4), and the corresponding physical address of this processing circuit is "2". When the data bit width of the operation is 16 bits, the two 16-bit data in the processing circuit whose logical address is "3" is (23, 22, 7, 6), and the corresponding physical address of the processing circuit is "6". When the data bit width of the operation is 8 bits, the four 8-bit data in the processing circuit whose logical address is "6" is (30, 22, 14, 6), and the corresponding physical address of this processing circuit is "3".
上文结合图8a和图8b所示出的多个单个类型处理电路(如图3示出的第一类型处理电路)连接形成闭合环路的情形,针对不同数据类型的数据操作进行了描述。下文将结合图8c所示出的多个不同类型处理电路(如图4示出的第一类型处理电路和第二类型处理电路)进行连接形成闭合环路的情形,针对不同数据类型的数据操作做出进一步描述。The data operations of different data types are described above in conjunction with the case where a plurality of single-type processing circuits shown in FIG. 8a and FIG. 8b (the first type processing circuit shown in FIG. 3 ) are connected to form a closed loop. The following will be combined with a plurality of different types of processing circuits shown in FIG. 8c (the first type of processing circuit and the second type of processing circuit shown in FIG. 4) to form a closed loop. For data operations of different data types make a further description.
图8c上图示出,以0,1……19顺序编号的二十个多类型处理电路进行连接,以形成一个闭合的环路(图中示出的编号为处理电路的物理地址)。编号从0至15的十六个处理电路为第一类型处理电路,编号从16至19的四个处理电路为第二类型处理电路。类似地,该二十个处理电路中每个的物理地址,与图8c下图示出的对应处理电路的逻辑地址具有映射关系。The upper figure in FIG. 8c shows that twenty multi-type processing circuits numbered in the order of 0, 1 . . . 19 are connected to form a closed loop (the numbers shown in the figure are the physical addresses of the processing circuits). Sixteen processing circuits numbered from 0 to 15 are first type processing circuits, and four processing circuits numbered from 16 to 19 are second type processing circuits. Similarly, the physical address of each of the twenty processing circuits has a mapping relationship with the logical address of the corresponding processing circuit shown in the lower figure of FIG. 8c.
进一步,在对不同数据类型进行操作时,例如对于图中示出的80个8bit的原始序列,图8c还示出针对于处理电路支持的不同数据类型,对前述原始数据进行操作后的结果。例如,当操作的数据位宽是32bit时,逻辑地址为“1”的处理电路中的1个32bit数据为(7,6,5,4),该处理电路对应的物理地址为“2”。而当操作的数据位宽是16bit时,逻辑地址为“11”的处理电路中的2个16bit数据为(63,62,23,22),该处理电路对应的物理地址为“9”。而当操作的数据位宽是8bit时,逻辑地址为“17”的处理电路中的4个8bit数据为(77,57,37,17),该处理电路对应的物理地址为“18”。Further, when operating on different data types, for example, for the 80 8-bit original sequences shown in the figure, FIG. 8c also shows the result of operating the aforementioned original data for different data types supported by the processing circuit. For example, when the data bit width of the operation is 32 bits, one 32-bit data in the processing circuit whose logical address is "1" is (7, 6, 5, 4), and the corresponding physical address of this processing circuit is "2". When the data bit width of the operation is 16 bits, the two 16-bit data in the processing circuit whose logical address is "11" are (63, 62, 23, 22), and the corresponding physical address of the processing circuit is "9". When the data bit width of the operation is 8 bits, the four 8-bit data in the processing circuit whose logical address is "17" is (77, 57, 37, 17), and the corresponding physical address of the processing circuit is "18".
图9a,9b,9c和9d是示出根据本披露实施例的前处置电路所执行的数据拼接操作示意图。如前所述,本披露结合图2所描述的前处置电路还可以配置成根据解析后的指令从多种数据拼接模式中选择一种数据拼接模式,以对输入的两个数据执行拼接操作。关于多种数据拼接模式,在一个实施例中,本披露的方案通过对待拼接的两个数据按最小数据单元划分和编号,然后基于指定的规则来抽取数据的不同最小数据单元以形成不同的数据拼接模式。例如,可以基于编号的奇偶性或编号是否是指定数字的整数倍来进行例如交替式地抽取和摆放,从而形成不同的数据拼接模式。根据不同的计算场景(例如数据位宽的不同),这里的最小数据单元可以简单的就是1位或1比特数据,或者是2位、4位、8位、16位或32位或比特的长度。进一步,在抽取两个数据的不同编号部分时,本披露的方案既可以以最小数据单元来交替地抽取,也可以以最小数据单元的倍数来抽取,例如从两个数据中交替地一次抽取两个或三个最小数据单元的部分数据作为一组来按组进行拼接。9a, 9b, 9c and 9d are schematic diagrams illustrating data stitching operations performed by a preprocessing circuit according to an embodiment of the present disclosure. As mentioned above, the pre-processing circuit described in the present disclosure in conjunction with FIG. 2 can also be configured to select a data splicing mode from a plurality of data splicing modes according to the parsed instruction, so as to perform a splicing operation on the two input data. Regarding multiple data splicing modes, in one embodiment, the solution of the present disclosure divides and numbers the two data to be spliced according to the minimum data unit, and then extracts different minimum data units of the data based on specified rules to form different data. Stitching mode. For example, based on the parity of the number or whether the number is an integer multiple of a specified number, for example, alternately extracting and placing can be performed to form different data splicing patterns. According to different calculation scenarios (such as different data bit widths), the minimum data unit here can be simply 1-bit or 1-bit data, or 2-bit, 4-bit, 8-bit, 16-bit or 32-bit or the length of the bit . Further, when extracting different numbered parts of the two data, the scheme of the present disclosure can either extract alternately with the smallest data unit, or can extract in multiples of the smallest data unit, for example, alternately extract two data at a time from the two data. Partial data of one or three minimum data units are grouped together as a group.
基于上述数据拼接模式的描述,下面将结合图9a至图9c来以具体的例子示例性阐述本披露的数据拼接模式。在所示的图中,输入数据为In1和In2,当图中的每个方格代表一个最小数据单元时,两个输入数据都具有8个最小数据单元的位宽长度。如前所述,对于不同位宽长度的数据,该最小数据单元可以代表不同的位数(或比特数)。例如,对于位宽为8位的数据,最小数据单元代表1位数据,而对于位宽为16位的数据,最小数据单元代表2位数据。又例如,对于位宽为32位的数据,最小数据单元代表4位数据。Based on the above description of the data splicing mode, the following will illustrate the data splicing mode of the present disclosure with specific examples in conjunction with FIGS. 9 a to 9 c . In the figure shown, the input data are In1 and In2, and when each square in the figure represents a minimum data unit, both input data have a bit width length of 8 minimum data units. As mentioned above, for data of different bit widths and lengths, the minimum data unit may represent different number of bits (or number of bits). For example, for data with a bit width of 8 bits, the smallest data unit represents 1-bit data, and for data with a bit width of 16 bits, the smallest data unit represents 2-bit data. For another example, for data with a bit width of 32 bits, the minimum data unit represents 4-bit data.
如图9a所示,待拼接的两个输入数据In1和In2各由从右至左顺序编号为1,2,……,8的八个最小数据单元构成。按照编号由小到大、先In1后In2、先奇数编号后偶数编 号的奇偶交错原则进行数据拼接。具体而言,当操作的数据位宽为8bit时,数据In1和In2各表示一个8位数据,而每个最小数据单元代表1位数据(即一个方格代表1比特数据)。根据数据的位宽和前述的拼接原则,首先抽取数据In1编号为1、3、5和7的最小数据单元顺序布置于低位。接着,顺序布置数据In2的四个奇数编号的最小数据单元。类似地,顺序布置数据In1编号为2、4、6和8的最小数据单元和数据In2的四个偶数编号的最小数据单元。最终,由16个最小数据单元拼接形成1个16位或2个8位的新数据,如图9a中第二行方格所示出的。As shown in FIG. 9a, the two input data In1 and In2 to be spliced are each composed of eight minimum data units sequentially numbered 1, 2, . . . , 8 from right to left. Data splicing is performed according to the principle of parity interleaving with numbers from small to large, In1 followed by In2, and odd numbers followed by even numbers. Specifically, when the data bit width of the operation is 8 bits, the data In1 and In2 each represent one 8-bit data, and each minimum data unit represents 1-bit data (ie, one square represents 1-bit data). According to the bit width of the data and the aforementioned splicing principle, the minimum data units numbered 1, 3, 5 and 7 of the data In1 are first extracted and arranged in the lower order. Next, four odd-numbered minimum data units of the data In2 are sequentially arranged. Similarly, the smallest data units of the data In1 numbered 2, 4, 6, and 8 and the four even-numbered smallest data units of the data In2 are sequentially arranged. Finally, one 16-bit or two 8-bit new data is formed by splicing 16 minimum data units, as shown in the second row of squares in Figure 9a.
如图9b所示,在数据位宽为16bit时,数据In1和In2各表示一个16位数据,此时每个最小数据单元代表2位数据(即一个方格代表一个2比特数据)。根据数据的位宽和前述的交错拼接原则,可以先抽取数据In1编号为1、2、5和6的最小数据单元顺序布置于低位。然后,顺序布置数据In2编号为1、2、5和6的最小数据单元。类似地,顺序布置数据In1编号为3、4、7和8和数据In2相同编号的最小数据单元,以拼接形成最终的16个最小数据单元组成的1个32位或2个16位的新数据,如图9b中第二行方格所示出的。As shown in Figure 9b, when the data bit width is 16 bits, the data In1 and In2 each represent a 16-bit data, and each minimum data unit at this time represents 2-bit data (ie, a square represents a 2-bit data). According to the bit width of the data and the aforementioned principle of interleaving and splicing, the minimum data units numbered 1, 2, 5 and 6 of the data In1 can be extracted first and arranged in the lower order. Then, the smallest data units numbered 1, 2, 5, and 6 of the data In2 are sequentially arranged. Similarly, the minimum data units numbered 3, 4, 7 and 8 and the data In2 are sequentially arranged to form a 32-bit or 2 16-bit new data composed of the final 16 minimum data units. , as shown in the second row of squares in Figure 9b.
如图9c所示,在数据位宽为32bit时,数据In1和In2各表示一个32位数据,而每个最小数据单元代表4位数据(即一个方格代表一个4比特数据)。根据数据的位宽和前述的交错拼接原则,可以先抽取数据In1编号为1、2、3和4和数据In2相同编号的最小数据单元顺序布置于低位。然后,抽取数据In1编号为5、6、7和8与数据In2相同编号的最小数据单元顺序布置,从而拼接形成最终的16个最小数据单元组成的1个64位或2个32位的新数据。As shown in Figure 9c, when the data bit width is 32 bits, the data In1 and In2 each represent a 32-bit data, and each minimum data unit represents 4-bit data (ie, a square represents a 4-bit data). According to the bit width of the data and the aforementioned principle of interleaving and splicing, the smallest data units with the same numbers as the data In1 and the same numbers as the data In2 can be extracted and arranged in the lower order. Then, extract the smallest data units numbered 5, 6, 7, and 8 with the same numbers as the data In2 and arrange them in sequence, thereby splicing to form a 64-bit or two 32-bit new data consisting of the final 16 smallest data units .
上面结合图9a-图9c描述了本披露的示例性数据拼接方式。然而,可以理解的是在一些计算场景中,数据拼接并不涉及上述的交错排放,而仅仅是两个数据在保持各自原有数据位置不变情况下的简单排布,例如图9d中所示出的。从图9d中可看出,两个数据In1和In2并不执行如图9a-图9c中示出的交错排布,而仅仅是将数据In1的最后一个最小数据单元和In2的第一个最小数据单元进行串联,从而获得位宽增大(例如加倍)的新数据类型。在一些场景中,本披露的方案还可以基于数据属性进行成组的拼接。例如,可以将具有同一特征图的神经元数据或权值数据形成一组,然后进行排布,以构成拼接后数据的连续部分。Exemplary data splicing manners of the present disclosure are described above in conjunction with FIGS. 9a-9c. However, it can be understood that in some computing scenarios, data splicing does not involve the above-mentioned staggered arrangement, but only a simple arrangement of two data while keeping their original data positions unchanged, such as shown in Figure 9d out. It can be seen from Figure 9d that the two data In1 and In2 do not perform the interleaving as shown in Figures 9a-9c, but only the last minimum data unit of the data In1 and the first minimum data unit of In2 The data units are concatenated to obtain a new data type with an increased (eg doubled) bit width. In some scenarios, the solution of the present disclosure can also perform group stitching based on data attributes. For example, neuron data or weight data with the same feature map can be formed into a group and then arranged to form a continuous part of the spliced data.
图10a,10b和10c是示出根据本披露实施例的后处置电路所执行的数据压缩操作示意图。所述压缩操作可以包括利用掩码对数据进行筛选或通过给定阈值与数据大小的比较来进行压缩。关于数据压缩操作,可以对其按如前所述的最小数据单元进行划分和编号。与结合图9a-图9d所述的类似,最小数据单元可以例如是1位或1比特数据,或者是2位、4位、8位、16位或32位或比特的长度。下面将结合图10a至图10c针对不同的数据压缩模式做出示例性描述。10a, 10b and 10c are schematic diagrams illustrating data compression operations performed by post-processing circuits according to embodiments of the present disclosure. The compressing operation may include filtering the data with a mask or compressing by comparing a given threshold with the size of the data. Regarding data compression operations, it can be divided and numbered in the smallest data unit as previously described. Similar to that described in connection with Figures 9a-9d, the minimum data unit may be, for example, 1-bit or 1-bit data, or a length of 2, 4, 8, 16 or 32 bits or bits. Exemplary descriptions for different data compression modes will be made below in conjunction with Figures 10a to 10c.
如图10a所示,原始数据由从右至左顺序编号为1,2……,8的八个方格(即八个最小数据单元)依次排列组成,假设每个最小数据单元可以表示1比特数据。当根据掩码进行数据压缩操作时,所述后处置电路可以利用掩码对原始数据进行筛选以执行数据压缩操作。在一个实施例中,掩码的位宽与原始数据的最小数据单元的个数对应。例如,前述的原始数据具有8个最小数据单元,则掩码位宽为8位,并且编号为1的最小数据单元对应于掩码的最低位,编号为2的最小数据单元对应于掩码的次低 位。以此类推,编号为8的最小数据单元对应于掩码的最高位。在一个应用场景中,当8位掩码为“10010011”时,压缩原则可以设置为抽取与该掩码为“1”的数据位对应的原始数据中的最小数据单元。例如,对应掩码数值为“1”的最小数据单元的编号为1、2、5和8。由此,可以抽取编号为1、2、5和8的最小数据单元,并且按照编号从低到高的顺序依次排列,以形成压缩后的新数据,如图10a第二行所示。As shown in Figure 10a, the original data consists of eight squares (ie, eight minimum data units) sequentially numbered 1, 2..., 8 from right to left, assuming that each minimum data unit can represent 1 bit data. When performing the data compression operation according to the mask, the post-processing circuit may filter the original data by using the mask to perform the data compression operation. In one embodiment, the bit width of the mask corresponds to the number of minimum data units of the original data. For example, if the aforementioned original data has 8 minimum data units, the bit width of the mask is 8 bits, and the minimum data unit numbered 1 corresponds to the lowest bit of the mask, and the minimum data unit numbered 2 corresponds to the next low. And so on, the smallest data unit numbered 8 corresponds to the most significant bit of the mask. In an application scenario, when the 8-bit mask is "10010011", the compression principle may be set to extract the smallest data unit in the original data corresponding to the data bit whose mask is "1". For example, the numbers of the smallest data units corresponding to the mask value "1" are 1, 2, 5, and 8. Thus, the minimum data units numbered 1, 2, 5 and 8 can be extracted and arranged in order from low to high to form new compressed data, as shown in the second row of Figure 10a.
图10b示出与图10a类似的原始数据,并且从图10b的第二行中可以看出,经过后处置电路的数据序列维持原有的数据排列顺序和内容。由此可以理解,本披露的数据压缩也可以包括禁用模式或非压缩模式,以便在数据经过后处置电路时不执行压缩操作。Fig. 10b shows the original data similar to Fig. 10a, and it can be seen from the second row of Fig. 10b that the data sequence passed through the post-processing circuit maintains the original data arrangement order and content. It will thus be appreciated that the data compression of the present disclosure may also include a disabled mode or a non-compressed mode so that no compression operation is performed when the data passes through the post-processing circuit.
如图10c所示,原始数据由八个方格依次排列组成,每个方格上方的数字表示其编号,从右至左顺序编号为1,2……8,并且假设每个最小数据单元可以为8比特数据。进一步,每个方格中的数字表示该最小数据单元的十进制数值。以编号为1的最小数据单元为例,其十进制数值为“8”,对应的8比特数据为“00001111”。当根据阈值进行数据压缩操作时,假设阈值为十进制数据“8”,压缩原则可以设置为抽取原始数据中所有大于或等于该阈值“8”的最小数据单元。由此,可以抽取编号为1、4、7和8的最小数据单元。然后,将抽取得到的所有最小数据单元按照编号从低到高的顺序进行排列,以获得最终的数据结果,如图10c中的第二行所示。As shown in Figure 10c, the original data consists of eight squares arranged in sequence, the number above each square indicates its number, and the order from right to left is 1, 2...8, and it is assumed that each minimum data unit can be is 8-bit data. Further, the number in each square represents the decimal value of that smallest data unit. Taking the smallest data unit numbered 1 as an example, its decimal value is "8", and the corresponding 8-bit data is "00001111". When performing data compression operation according to the threshold, assuming that the threshold is decimal data "8", the compression principle can be set to extract all the smallest data units in the original data that are greater than or equal to the threshold "8". Thus, the smallest data units numbered 1, 4, 7 and 8 can be extracted. Then, arrange all the extracted minimum data units in descending order of numbers to obtain the final data result, as shown in the second row in Figure 10c.
图11是示出根据本披露实施例的使用计算装置来执行运算操作的方法1100的简化流程图,其中所述计算装置可以具有结合图1-图4所描述的硬件架构。11 is a simplified flow diagram illustrating a method 1100 of performing computational operations using a computing device, which may have the hardware architecture described in connection with FIGS. 1-4, according to an embodiment of the present disclosure.
如图11所示,在步骤1110处,方法1100可以利用所述控制电路来获取指令,并且可以对所述指令进行解析,并将解析后的指令发送至所述多个处理电路中的一个或多个处理电路。在一个实施例中,所述控制电路可以根据所述指令中的指令标识信息来确定执行操作的一个或多个处理电路,并且将所述解析后的指令发送至所述多个处理电路中的一个或多个,以执行所述解析后的指令指定的相应操作。As shown in FIG. 11, at step 1110, the method 1100 may utilize the control circuit to obtain an instruction, and may parse the instruction, and send the parsed instruction to one of the plurality of processing circuits or Multiple processing circuits. In one embodiment, the control circuit may determine one or more processing circuits that perform an operation according to the instruction identification information in the instruction, and send the parsed instruction to one of the plurality of processing circuits. one or more to perform the corresponding operation specified by the parsed instruction.
在一个或多个实施例中,在解析所述指令的过程中,所述控制电路可以对所述指令进行译码操作,根据所述译码的结果将所述解析后的指令发送到所述多个处理电路中的一个或多个。当多个处理电路都支持非特定的相同类型运算时,控制电路可以根据多个处理电路的操作状态,发送解析后的指令给使用占用率不高或处于空闲态的处理电路。进一步,根据应用场景的不同,所述解析后的指令也可以是未经控制电路译码的解析指令。而所述一个或多个处理电路中可以包含相应的译码电路对接收到的解析后的指令进行译码,以例如生成多个微指令,从而一个或多个处理电路可以根据所述微指令执行后续操作。In one or more embodiments, in the process of parsing the instruction, the control circuit may perform a decoding operation on the instruction, and send the parsed instruction to the instruction according to the decoding result. one or more of a plurality of processing circuits. When multiple processing circuits support non-specific operations of the same type, the control circuit can send parsed instructions to the processing circuits with low usage occupancy or in an idle state according to the operating states of the multiple processing circuits. Further, according to different application scenarios, the parsed instruction may also be an parsed instruction that has not been decoded by the control circuit. The one or more processing circuits may include corresponding decoding circuits to decode the received parsed instructions, for example, to generate multiple micro-instructions, so that one or more processing circuits can decode the received instructions according to the micro-instructions. Perform subsequent operations.
接着,流程可以前进至步骤1120,方法1100可以利用所述一个或多个处理电路来根据解析后的指令执行多线程操作。在一个实施例中,所述多个处理电路可以配置成以单指令多线程(“SIMT”)方式接收并执行所述解析后的指令。在另一个实施例中,多个处理电路可以一维或多维阵列的拓扑结构进行连接,并且经过所述连接而串接的多个处理电路阵列可以形成一个或多个闭合的环路。在又一个实施例中,多个处理电路可以根据接收到的所述解析后的指令中的信息(例如谓词信息)判断是否执行该解析后的指令指定的操作。Next, flow may proceed to step 1120, where method 1100 may utilize the one or more processing circuits to perform multi-threaded operations according to the parsed instructions. In one embodiment, the plurality of processing circuits may be configured to receive and execute the parsed instructions in a single instruction multithreading ("SIMT") fashion. In another embodiment, the plurality of processing circuits may be connected in a one-dimensional or multi-dimensional array topology, and the plurality of processing circuit arrays connected in series through the connection may form one or more closed loops. In yet another embodiment, a plurality of processing circuits may determine whether to execute the operation specified by the parsed instruction according to the received information (eg, predicate information) in the parsed instruction.
图12是示出根据本披露实施例的一种组合处理装置1200的结构图。如图12中 所示,该组合处理装置1200包括计算处理装置1202、接口装置1204、其他处理装置1206和存储装置1208。根据不同的应用场景,计算处理装置中可以包括一个或多个计算装置1210,该计算装置可以配置用于执行本文结合图1-图11所描述的操作。FIG. 12 is a structural diagram illustrating a combined processing apparatus 1200 according to an embodiment of the present disclosure. As shown in FIG. 12, the combined processing device 1200 includes a computing processing device 1202, an interface device 1204, other processing devices 1206, and a storage device 1208. According to different application scenarios, one or more computing devices 1210 may be included in the computing processing device, and the computing devices may be configured to perform the operations described herein in conjunction with FIG. 1 to FIG. 11 .
在不同的实施例中,本披露的计算处理装置可以配置成执行用户指定的操作。在示例性的应用中,该计算处理装置可以实现为单核人工智能处理器或者多核人工智能处理器。类似地,包括在计算处理装置内的一个或多个计算装置可以实现为人工智能处理器核或者人工智能处理器核的部分硬件结构。当多个计算装置实现为人工智能处理器核或人工智能处理器核的部分硬件结构时,就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。In various embodiments, the computing processing devices of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure.
在示例性的操作中,本披露的计算处理装置可以通过接口装置与其他处理装置进行交互,以共同完成用户指定的操作。根据实现方式的不同,本披露的其他处理装置可以包括中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)、人工智能处理器等通用和/或专用处理器中的一种或多种类型的处理器。这些处理器可以包括但不限于数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算处理装置和其他处理装置共同考虑时,二者可以视为形成异构多核结构。In an exemplary operation, the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user. According to different implementations, other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors. One or more types of processors. These processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. As mentioned above, only for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.
在一个或多个实施例中,该其他处理装置可以作为本披露的计算处理装置(其可以具体化为人工智能例如神经网络运算的相关运算装置)与外部数据和控制的接口,执行包括但不限于数据搬运、对计算装置的开启和/或停止等基本控制。在另外的实施例中,其他处理装置也可以和该计算处理装置协作以共同完成运算任务。In one or more embodiments, the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices. In other embodiments, other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.
在一个或多个实施例中,该接口装置可以用于在计算处理装置与其他处理装置间传输数据和控制指令。例如,该计算处理装置可以经由所述接口装置从其他处理装置中获取输入数据,写入该计算处理装置片上的存储装置(或称存储器)。进一步,该计算处理装置可以经由所述接口装置从其他处理装置中获取控制指令,写入计算处理装置片上的控制缓存中。替代地或可选地,接口装置也可以读取计算处理装置的存储装置中的数据并传输给其他处理装置。In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device. Further, the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into a control cache on the computing and processing device chip. Alternatively or alternatively, the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.
附加地或可选地,本披露的组合处理装置还可以包括存储装置。如图中所示,该存储装置分别与所述计算处理装置和所述其他处理装置连接。在一个或多个实施例中,存储装置可以用于保存所述计算处理装置和/或所述其他处理装置的数据。例如,该数据可以是在计算处理装置或其他处理装置的内部或片上存储装置中无法全部保存的数据。Additionally or alternatively, the combined processing device of the present disclosure may also include a storage device. As shown in the figure, the storage device is connected to the computing processing device and the other processing device, respectively. In one or more embodiments, a storage device may be used to store data of the computing processing device and/or the other processing device. For example, the data may be data that cannot be fully stored in an internal or on-chip storage device of a computing processing device or other processing device.
在一些实施例里,本披露还公开了一种芯片(例如图13中示出的芯片1302)。在一种实现中,该芯片是一种系统级芯片(System on Chip,SoC),并且集成有一个或多个如图12中所示的组合处理装置。该芯片可以通过对外接口装置(如图13中示出的对外接口装置1306)与其他相关部件相连接。该相关部件可以例如是摄像头、显示器、鼠标、键盘、网卡或wifi接口。在一些应用场景中,该芯片上可以集成有其他处理单元(例如视频编解码器)和/或接口模块(例如DRAM接口)等。在一些实施 例中,本披露还公开了一种芯片封装结构,其包括了上述芯片。在一些实施例里,本披露还公开了一种板卡,其包括上述的芯片封装结构。下面将结合图13对该板卡进行详细地描述。In some embodiments, the present disclosure also discloses a chip (eg, chip 1302 shown in FIG. 13 ). In one implementation, the chip is a System on Chip (SoC) and integrates one or more combined processing devices as shown in FIG. 12 . The chip can be connected with other related components through an external interface device (such as the external interface device 1306 shown in FIG. 13 ). The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface. In some application scenarios, other processing units (such as video codecs) and/or interface modules (such as DRAM interfaces) may be integrated on the chip. In some embodiments, the present disclosure also discloses a chip package structure including the above-mentioned chip. In some embodiments, the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 13 .
图13是示出根据本披露实施例的一种板卡1300的结构示意图。如图13中所示,该板卡包括用于存储数据的存储器件1304,其包括一个或多个存储单元1310。该存储器件可以通过例如总线等方式与控制器件1308和上文所述的芯片1302进行连接和数据传输。进一步,该板卡还包括对外接口装置1306,其配置用于芯片(或芯片封装结构中的芯片)与外部设备1312(例如服务器或计算机等)之间的数据中继或转接功能。例如,待处理的数据可以由外部设备通过对外接口装置传递至芯片。又例如,所述芯片的计算结果可以经由所述对外接口装置传送回外部设备。根据不同的应用场景,所述对外接口装置可以具有不同的接口形式,例如其可以采用标准PCIE接口等。FIG. 13 is a schematic structural diagram illustrating a board 1300 according to an embodiment of the present disclosure. As shown in FIG. 13 , the board includes a storage device 1304 for storing data, which includes one or more storage units 1310 . The storage device can be connected and data transferred with the control device 1308 and the chip 1302 described above through, for example, a bus. Further, the board also includes an external interface device 1306, which is configured for data relay or transfer function between the chip (or a chip in a chip package structure) and an external device 1312 (such as a server or a computer, etc.). For example, the data to be processed can be transmitted to the chip by an external device through an external interface device. For another example, the calculation result of the chip may be transmitted back to the external device via the external interface device. According to different application scenarios, the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.
在一个或多个实施例中,本披露板卡中的控制器件可以配置用于对所述芯片的状态进行调控。为此,在一个应用场景中,该控制器件可以包括单片机(Micro Controller Unit,MCU),以用于对所述芯片的工作状态进行调控。In one or more embodiments, the control device in the board of the present disclosure may be configured to regulate the state of the chip. To this end, in an application scenario, the control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.
根据上述结合图12和图13的描述,本领域技术人员可以理解本披露也公开了一种电子设备或装置,其可以包括一个或多个上述板卡、一个或多个上述芯片和/或一个或多个上述组合处理装置。According to the above description in conjunction with FIG. 12 and FIG. 13 , those skilled in the art can understand that the present disclosure also discloses an electronic device or device, which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。According to different application scenarios, the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment. The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph. The electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal. In one or more embodiments, the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的 描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。It should be noted that, for the purpose of simplicity, the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also focuses on the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行划分,而实际实现时也可以有另外的划分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。In terms of specific implementation, based on the disclosure and teaching of this disclosure, those skilled in the art can understand that several embodiments disclosed in this disclosure can also be implemented in other ways not disclosed herein. For example, as for each unit in the foregoing electronic device or apparatus embodiment, this article divides them on the basis of considering logical functions, and there may also be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions of a unit or component may be selectively disabled. As far as the connection relationship between different units or components is concerned, the connections discussed above in conjunction with the accompanying drawings may be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。In this disclosure, units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units. The aforementioned components or elements may be co-located or distributed over multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。In some implementation scenarios, the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure. The aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如CPU、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。In other implementation scenarios, the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors. In this regard, the various types of devices described herein (eg, computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
依据以下条款可更好地理解前述内容:The foregoing can be better understood in accordance with the following terms:
条款1、一种计算装置,包括控制电路和多个处理电路,其中: Clause 1. A computing device comprising a control circuit and a plurality of processing circuits, wherein:
所述控制电路配置成获取指令并对所述指令进行解析,并且将解析后的指令发送至所述多个处理电路中的一个或多个处理电路;以及the control circuit is configured to obtain and parse the instruction, and send the parsed instruction to one or more of the plurality of processing circuits; and
所述多个处理电路配置成以一维或多维阵列的结构进行连接,并且根据接收到的解析后的指令来执行多线程操作。The plurality of processing circuits are configured to be connected in a one-dimensional or multi-dimensional array structure and to perform multi-threaded operations in accordance with the received parsed instructions.
条款2、根据条款1所述的计算装置,其中在解析所述指令中,所述控制电路配置成: Clause 2. The computing device of clause 1, wherein in parsing the instruction, the control circuit is configured to:
获取所述指令中的指令标识信息;以及obtaining instruction identification information in the instruction; and
根据所述指令标识信息将所述解析后的指令发送至所述多个处理电路中的一个或多个。The parsed instruction is sent to one or more of the plurality of processing circuits according to the instruction identification information.
条款3、根据条款1所述的计算装置,其中在解析所述指令中,所述控制电路配置成: Clause 3. The computing device of clause 1, wherein in parsing the instruction, the control circuit is configured to:
对所述指令进行译码;以及decoding the instruction; and
根据所述译码的结果以及所述多个处理电路的操作状态,将所述解析后的指令发送到所述多个处理电路中的一个或多个。The parsed instruction is sent to one or more of the plurality of processing circuits according to the result of the decoding and the operating state of the plurality of processing circuits.
条款4、根据条款1所述的计算装置,其中所述多个处理电路被划分成多种类型的处理电路,以执行不同类型的数据处理。 Clause 4. The computing device of clause 1, wherein the plurality of processing circuits are divided into multiple types of processing circuits to perform different types of data processing.
条款5、根据条款1所述的计算装置,其中所述多个处理电路被划分成第一类型处理电路和第二类型处理电路,其中所述第一类型处理电路适于至少执行算术运算和逻辑运算中的至少一项,而所述第二类型处理电路适于至少执行比较运算和查表运算中的至少一项。 Clause 5. The computing device of clause 1, wherein the plurality of processing circuits are divided into first type processing circuits and second type processing circuits, wherein the first type processing circuits are adapted to perform at least arithmetic operations and logic at least one of an operation, and the second type of processing circuit is adapted to perform at least one of a comparison operation and a table look-up operation.
条款6、根据条款1所述的计算装置,其中所述多维阵列是二维阵列,并且位于所述二维阵列中的所述处理电路在其行方向、列方向或对角线方向的至少一个上以预定的二维间隔模式与同行、同列或同对角线的其余一个或多个所述处理电路连接。 Clause 6. The computing device of clause 1, wherein the multidimensional array is a two-dimensional array and the processing circuits located in the two-dimensional array are in at least one of a row, column, or diagonal direction thereof. The above is connected with the remaining one or more of the processing circuits in the same row, the same column or the same diagonal in a predetermined two-dimensional interval pattern.
条款7、根据条款6所述的计算装置,其中所述预定的二维间隔模式与所述连接中间隔的处理电路的数目相关联。 Clause 7. The computing device of clause 6, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.
条款8、根据条款1所述的计算装置,其中所述多维阵列是由多个层构成的三维阵列,其中每个层包括沿行方向、列方向和对角线方向排列的多个所述处理电路的二维阵列,其中: Clause 8. The computing device of clause 1, wherein the multidimensional array is a three-dimensional array of a plurality of layers, wherein each layer includes a plurality of the processes arranged in row, column, and diagonal directions A two-dimensional array of circuits, where:
位于所述三维阵列中的所述处理电路在其行方向、列方向、对角线方向和层方向的至少一个上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一个或多个处理电路连接。The processing circuits located in the three-dimensional array are in a predetermined three-dimensional spaced pattern in at least one of row, column, diagonal, and layer directions with those on the same row, column, diagonal, or different layer. The remaining one or more processing circuits are connected.
条款9、根据条款8所述的计算装置,其中所述预定的三维间隔模式与待连接的处理电路之间的间隔数目和间隔层数相关联。 Clause 9. The computing device of clause 8, wherein the predetermined three-dimensional spacing pattern is associated with a number of spacings and layers of spacing between processing circuits to be connected.
条款10、根据条款6-9的任意一项所述的计算装置,其中所述多个处理电路配置成通过逻辑连接来进行连接,以形成一个或多个闭合的环路。 Clause 10. The computing device of any of clauses 6-9, wherein the plurality of processing circuits are configured to be connected by logical connections to form one or more closed loops.
条款11、根据条款10所述的计算装置,其中所述多个处理电路配置成根据解析后的指令来确定是否通过逻辑连接来进行连接,以形成一个或多个闭合的环路。 Clause 11. The computing device of clause 10, wherein the plurality of processing circuits are configured to determine from the parsed instructions whether to connect by logical connections to form one or more closed loops.
条款12、根据条款1所述的计算装置,其中多个所述处理电路配置成根据接收到的数据的位宽来形成至少一个处理电路组,以对数据进行处理。 Clause 12. The computing device of clause 1, wherein a plurality of the processing circuits are configured to form at least one group of processing circuits to process data according to a bit width of the received data.
条款13、根据条款12所述的计算装置,其中当形成多个所述处理电路组以对数据进行处理时,所述多个处理电路组根据解析后的指令来通过逻辑连接进行连接,以形成一个或多个闭合的环路。 Clause 13. The computing device of clause 12, wherein when a plurality of the processing circuit groups are formed to process data, the plurality of processing circuit groups are connected by logical connections according to the parsed instructions to form One or more closed loops.
条款14、根据条款1所述的计算装置,其中每个所述处理电路包括: Clause 14. The computing device of clause 1, wherein each of the processing circuits comprises:
逻辑运算电路,其配置成在执行所述多线程操作时,根据所述解析后的指令和接 收到的数据执行逻辑运算;以及a logic operation circuit configured to perform a logic operation according to the parsed instruction and the received data when performing the multithreaded operation; and
存储电路,其包括数据存储电路,其中所述数据存储电路配置成存储所述处理电路的运算数据与中间运算结果中的至少一项。A storage circuit including a data storage circuit, wherein the data storage circuit is configured to store at least one of operation data and intermediate operation results of the processing circuit.
条款15、根据条款14所述的计算装置,其中所述存储电路还包括谓词存储电路,其中所述谓词存储电路配置成存储利用所述解析后的指令获取的每个所述处理电路的谓词存储电路序号和谓词信息。 Clause 15. The computing device of clause 14, wherein the storage circuit further comprises a predicate storage circuit, wherein the predicate storage circuit is configured to store a predicate storage for each of the processing circuits obtained using the parsed instruction Circuit number and predicate information.
条款16、根据条款15所述的计算装置,其中所述谓词存储电路还配置成: Clause 16. The computing device of clause 15, wherein the predicate storage circuit is further configured to:
根据所述解析后的指令对所述谓词信息进行更新;或者update the predicate information according to the parsed instruction; or
根据每个所述处理电路的运算结果对所述谓词信息进行更新。The predicate information is updated according to the operation result of each of the processing circuits.
条款17、根据条款15所述的计算装置,其中每个所述处理电路配置成: Clause 17. The computing device of clause 15, wherein each of the processing circuits is configured to:
根据所述解析后的指令中的所述谓词存储电路序号来获取对应于所述谓词存储电路的所述谓词信息;以及obtaining the predicate information corresponding to the predicate storage circuit according to the predicate storage circuit sequence number in the parsed instruction; and
根据所述谓词信息来确定该所述处理电路是否执行所述解析后的指令。Whether the processing circuit executes the parsed instruction is determined according to the predicate information.
条款18、根据条款1所述的计算装置,其中所述处理电路还包括算术运算电路,其配置成执行算术运算操作。 Clause 18. The computing device of clause 1, wherein the processing circuit further comprises an arithmetic operation circuit configured to perform arithmetic operation operations.
条款19、根据条款8所述的计算装置,还包括: Clause 19. The computing device of clause 8, further comprising:
数据处置电路,其包括前处置电路和后处置电路中的至少一个,其中所述前处置电路配置成在所述处理电路执行运算前对运算数据进行预处理操作,并且所述后处置电路配置成在所述处理电路执行运算后对运算结果进行后处理操作。A data processing circuit comprising at least one of a pre-processing circuit and a post-processing circuit, wherein the pre-processing circuit is configured to perform a preprocessing operation on operation data before the processing circuit performs the operation, and the post-processing circuit is configured to After the processing circuit performs the operation, a post-processing operation is performed on the operation result.
条款20、根据条款19所述的计算装置,其中所述闭合的环路中的多个处理电路中的每个配置有各自的逻辑地址,所述前处置电路配置成根据运算数据的类型和逻辑地址,将所述运算数据进行相应的拆分并将拆分后获得的多个子数据分别传递至环路中对应的各个处理电路中以便运算。Clause 20. The computing device of clause 19, wherein each of the plurality of processing circuits in the closed loop is configured with a respective logical address, the pre-processing circuits being configured to operate according to the type and logic of the data address, the operation data is divided accordingly, and the multiple sub-data obtained after the division are respectively transmitted to the corresponding processing circuits in the loop for operation.
条款21、根据条款19所述的计算装置,其中所述前处置电路还配置成根据解析后的指令从多种数据拼接模式中选择一种数据拼接模式,以对输入的两个数据执行拼接操作。Clause 21. The computing device of Clause 19, wherein the preprocessing circuit is further configured to select a data splicing mode from a plurality of data splicing modes according to the parsed instruction to perform a splicing operation on the two input data .
条款22、根据条款21所述的计算装置,其中所述后处置电路还配置成对数据执行压缩操作,所述压缩操作包括利用掩码对数据进行筛选或通过给定阈值与数据大小的比较来进行筛选。Clause 22. The computing device of clause 21, wherein the post-processing circuit is further configured to perform a compression operation on the data, the compression operation comprising filtering the data with a mask or by comparing a given threshold to a data size. to filter.
条款23、根据条款1所述的计算装置,还包括:Clause 23. The computing device of clause 1, further comprising:
主存储电路,所述主存储电路包括主存储模块和主缓存模块中的至少一个,其中所述主存储模块配置成存储用于处理电路中执行运算的数据与执行运算后的运算结果,并且所述主缓存模块配置成缓存所述处理电路中执行运算后的中间运算结果。A main storage circuit, the main storage circuit includes at least one of a main storage module and a main cache module, wherein the main storage module is configured to store the data used for performing the operation in the processing circuit and the operation result after the operation is performed, and the The main cache module is configured to cache the intermediate operation result after the operation is performed in the processing circuit.
条款24、根据条款1-9或11-23的任意一项所述的计算装置,其中所述多个处理电路配置成以SIMT方式接收并执行所述解析后的指令。Clause 24. The computing device of any of clauses 1-9 or 11-23, wherein the plurality of processing circuits are configured to receive and execute the parsed instructions in a SIMT manner.
条款25、一种集成电路芯片,包括根据条款1-24任意一项所述的计算装置。Clause 25. An integrated circuit chip comprising the computing device of any of clauses 1-24.
条款26、一种板卡,包括根据条款25所述的集成电路芯片。Clause 26. A board comprising the integrated circuit chip of clause 25.
条款27、一种使用计算装置来执行运算操作的方法,其中所述计算装置包括控制电路和以一维或多维阵列结构连接的多个处理电路,所述方法包括:Clause 27. A method of performing an arithmetic operation using a computing device, wherein the computing device includes a control circuit and a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, the method comprising:
利用所述控制电路来获取指令并对所述指令进行解析,并将解析后的指令发送至 所述多个处理电路中的一个或多个处理电路;以及Utilize the control circuit to obtain and parse the instruction, and send the parsed instruction to one or more of the plurality of processing circuits; and
利用所述一个或多个处理电路来根据解析后的指令执行多线程操作。The one or more processing circuits are utilized to perform multi-threaded operations in accordance with the parsed instructions.
条款28、根据条款27所述的方法,其中在解析所述指令中,所述方法利用所述控制电路来执行:Clause 28. The method of clause 27, wherein in parsing the instruction, the method utilizes the control circuit to perform:
获取所述指令中的指令标识信息;以及obtaining instruction identification information in the instruction; and
根据所述指令标识信息将所述解析后的指令发送至所述多个处理电路中的一个或多个。The parsed instruction is sent to one or more of the plurality of processing circuits according to the instruction identification information.
条款29、根据条款27所述的方法,其中在解析所述指令中,所述方法利用所述控制电路来执行:Clause 29. The method of clause 27, wherein in parsing the instruction, the method utilizes the control circuit to perform:
对所述指令进行译码;以及decoding the instruction; and
根据所述译码的结果以及所述多个处理电路的操作状态,将所述解析后的指令发送到所述多个处理电路中的一个或多个。The parsed instruction is sent to one or more of the plurality of processing circuits according to the result of the decoding and the operating state of the plurality of processing circuits.
条款30、根据条款27所述的方法,包括将所述多个处理电路划分成多种类型的处理电路,以执行不同类型的数据处理。Clause 30. The method of clause 27, comprising dividing the plurality of processing circuits into multiple types of processing circuits to perform different types of data processing.
条款31、根据条款27所述的方法,其中将所述多个处理电路划分成多种类型的处理电路包括将所述多个处理电路划分成第一类型处理电路和第二类型处理电路,其中所述第一类型处理电路适于至少执行算术运算和逻辑运算中的至少一项,而所述第二类型处理电路适于至少执行比较运算和查表运算中的至少一项。Clause 31. The method of clause 27, wherein dividing the plurality of processing circuits into a plurality of types of processing circuits comprises dividing the plurality of processing circuits into a first type of processing circuits and a second type of processing circuits, wherein The first type of processing circuit is adapted to perform at least one of an arithmetic operation and a logical operation, and the second type of processing circuit is adapted to perform at least one of a comparison operation and a table look-up operation.
条款32、根据条款27所述的方法,其中所述多维阵列是二维阵列,并且所述方法包括将位于所述二维阵列中的所述处理电路在其行方向、列方向或对角线方向的至少一个上以预定的二维间隔模式与同行、同列或同对角线的其余一个或多个所述处理电路进行连接。 Clause 32. The method of clause 27, wherein the multidimensional array is a two-dimensional array, and the method comprises placing the processing circuits located in the two-dimensional array in its row, column, or diagonal directions At least one of the directions is connected to the remaining one or more of the processing circuits in a row, column or diagonal in a predetermined two-dimensional spaced pattern.
条款33、根据条款32所述的方法,其中所述预定的二维间隔模式与所述连接中间隔的处理电路的数目相关联。Clause 33. The method of clause 32, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.
条款34、根据条款27所述的方法,其中所述多维阵列是由多个层构成的三维阵列,其中每个层包括沿行方向、列方向和对角线方向排列的多个所述处理电路的二维阵列,所述方法包括:Clause 34. The method of clause 27, wherein the multidimensional array is a three-dimensional array composed of a plurality of layers, wherein each layer includes a plurality of the processing circuits arranged in row, column, and diagonal directions The two-dimensional array, the method includes:
将位于所述三维阵列中的所述处理电路在其行方向、列方向、对角线方向和层方向的至少一个上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一个或多个处理电路进行连接。Aligning the processing circuits located in the three-dimensional array in at least one of the row, column, diagonal, and layer directions with a row, column, diagonal, or layer in a predetermined three-dimensional spacing pattern connected to one or more of the remaining processing circuits.
条款35、根据条款34所述的方法,其中所述预定的三维间隔模式与待连接的处理电路之间的间隔数目和间隔层数相关联。Clause 35. The method of clause 34, wherein the predetermined three-dimensional spacing pattern is associated with a number of spacings and layers of spacing between processing circuits to be connected.
条款36、根据条款32-35的任意一项所述的方法,其中包括将所述多个处理电路通过逻辑连接来进行连接,以形成一个或多个闭合的环路。Clause 36. The method of any of clauses 32-35, comprising connecting the plurality of processing circuits through logical connections to form one or more closed loops.
条款37、根据条款36所述的方法,其中所述方法包括根据解析后的指令来确定是否将所述多个处理电路通过逻辑连接来进行连接,以形成一个或多个闭合的环路。Clause 37. The method of clause 36, wherein the method comprises determining from the parsed instructions whether to connect the plurality of processing circuits by logical connections to form one or more closed loops.
条款38、根据条款27所述的方法,其中根据接收到的数据的位宽将多个所述处理电路配置成形成至少一个处理电路组,以对数据进行处理。Clause 38. The method of clause 27, wherein a plurality of said processing circuits are configured to form at least one group of processing circuits to process data according to a bit width of the received data.
条款39、根据条款38所述的方法,其中当形成多个所述处理电路组以对数据进行处理时,所述方法包括根据解析后的指令将所述多个处理电路组通过逻辑连接进行 连接,以形成一个或多个闭合的环路。Clause 39. The method of clause 38, wherein when a plurality of the processing circuit groups are formed to process data, the method comprises connecting the plurality of processing circuit groups by logical connections according to the parsed instructions , to form one or more closed loops.
条款40、根据条款27所述的方法,其中每个所述处理电路包括逻辑运算电路和存储电路,其中所述存储电路包括数据存储电路,其中所述方法包括在执行所述多线程操作时,利用所述逻辑运算电路来根据所述解析后的指令和接收到的数据执行逻辑运算,并且利用所述数据存储电路来存储所述处理电路的运算数据与中间运算结果中的至少一项。Clause 40. The method of clause 27, wherein each of the processing circuits includes a logic operation circuit and a storage circuit, wherein the storage circuit includes a data storage circuit, wherein the method includes when performing the multithreaded operation, The logic operation circuit is used to perform a logic operation according to the parsed instruction and the received data, and the data storage circuit is used to store at least one of operation data and an intermediate operation result of the processing circuit.
条款41、根据条款40所述的方法,其中所述存储电路还包括谓词存储电路,其中所述方法包括利用所述谓词存储电路来存储利用所述解析后的指令获取的每个所述处理电路的谓词存储电路序号和谓词信息。Clause 41. The method of clause 40, wherein the storage circuit further comprises a predicate storage circuit, wherein the method comprises using the predicate storage circuit to store each of the processing circuits fetched using the parsed instruction The predicate stores the circuit number and predicate information.
条款42、根据条款41所述的方法,其中还包括利用所述谓词存储电路来执行以下步骤:Clause 42. The method of clause 41, further comprising utilizing the predicate storage circuit to perform the following steps:
根据所述解析后的指令对所述谓词信息进行更新;或者update the predicate information according to the parsed instruction; or
根据每个所述处理电路的运算结果对所述谓词信息进行更新。The predicate information is updated according to the operation result of each of the processing circuits.
条款43、根据条款41所述的方法,其中还包括利用每个所述处理电路来执行以下步骤:Clause 43. The method of clause 41, further comprising utilizing each of said processing circuits to perform the following steps:
根据所述解析后的指令中的所述谓词存储电路序号来获取对应于所述谓词存储电路的所述谓词信息;以及obtaining the predicate information corresponding to the predicate storage circuit according to the predicate storage circuit sequence number in the parsed instruction; and
根据所述谓词信息来确定该所述处理电路是否执行所述解析后的指令。Whether the processing circuit executes the parsed instruction is determined according to the predicate information.
条款44、根据条款27所述的方法,其中所述处理电路还包括算术运算电路,所述方法包括利用所述算术运算电路来执行算术运算操作。Clause 44. The method of clause 27, wherein the processing circuit further comprises an arithmetic operation circuit, the method comprising utilizing the arithmetic operation circuit to perform an arithmetic operation operation.
条款45、根据条款34所述的方法,其中所述计算装置还包括数据处置电路,其包括前处置电路和后处置电路中的至少一个,其中所述方法包括在所述处理电路执行运算前,利用所述前处置电路对运算数据进行预处理操作,以及在所述处理电路执行运算后,利用所述后处置电路对运算结果进行后处理操作。Clause 45. The method of clause 34, wherein the computing device further comprises a data processing circuit comprising at least one of a pre-processing circuit and a post-processing circuit, wherein the method comprises, before the processing circuit performs an operation, The preprocessing circuit is used to perform a preprocessing operation on the operation data, and after the processing circuit performs the operation, the postprocessing circuit is used to perform a postprocessing operation on the operation result.
条款46、根据条款45所述的方法,其中所述闭合的环路中的多个处理电路中的每个配置有各自的逻辑地址,所述方法包括利用所述前处置电路来根据运算数据的类型和逻辑地址,将所述运算数据进行相应的拆分并将拆分后获得的多个子数据分别传递至环路中对应的各个处理电路中以便运算。Clause 46. The method of clause 45, wherein each of the plurality of processing circuits in the closed loop is configured with a respective logical address, the method comprising utilizing the pre-processing circuit to perform an operation according to an operation of the data. type and logical address, the operation data is divided accordingly, and the multiple sub-data obtained after the division are respectively transmitted to the corresponding processing circuits in the loop for operation.
条款47、根据条款45所述的方法,其中所述方法还包括利用所述前处置电路来根据解析后的指令从多种数据拼接模式中选择一种数据拼接模式,以对输入的两个数据执行拼接操作。Clause 47. The method of clause 45, wherein the method further comprises utilizing the pre-processing circuit to select a data splicing mode from a plurality of data splicing modes according to the parsed instruction, to perform an analysis of the input two data splicing modes. Perform the stitching operation.
条款48、根据条款47所述的方法,其中所述方法还包括利用所述后处置电路来对数据执行压缩操作,所述压缩操作包括利用掩码对数据进行筛选或通过给定阈值与数据大小的比较来进行筛选。Clause 48. The method of clause 47, wherein the method further comprises using the post-processing circuit to perform a compression operation on the data, the compression operation comprising filtering the data using a mask or passing a given threshold and a data size comparison to filter.
条款49、根据条款27所述的方法,其中所述计算装置还包括:主存储电路,所述主存储电路包括主存储模块和主缓存模块中的至少一个,其中所述方法包括利用所述主存储模块来存储用于处理电路中执行运算的数据与执行运算后的运算结果,并且利用所述主缓存模块来缓存所述处理电路中执行运算后的中间运算结果。Clause 49. The method of clause 27, wherein the computing device further comprises a main storage circuit comprising at least one of a main storage module and a main cache module, wherein the method comprises utilizing the main storage The storage module is used to store the data used for performing the operation in the processing circuit and the operation result after the operation is performed, and the main buffer module is used to cache the intermediate operation result after the operation is performed in the processing circuit.
条款50、根据条款27-49的任意一项所述的方法,其中所述方法包括利用所述多个处理电路来以SIMT方式接收并执行所述解析后的指令。Clause 50. The method of any of clauses 27-49, wherein the method comprises utilizing the plurality of processing circuits to SIMT receive and execute the parsed instructions.
虽然本文已经示出和描述了本披露的多个实施例,但对于本领域技术人员显而易见的是,这样的实施例只是以示例的方式来提供。本领域技术人员可以在不偏离本披露思想和精神的情况下想到许多更改、改变和替代的方式。应当理解的是在实践本披露的过程中,可以采用对本文所描述的本披露实施例的各种替代方案。所附权利要求书旨在限定本披露的保护范围,并因此覆盖这些权利要求范围内的等同或替代方案。While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes and substitutions may occur to those skilled in the art without departing from the spirit and spirit of this disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. The appended claims are intended to define the scope of the disclosure, and therefore to cover equivalents and alternatives within the scope of these claims.

Claims (34)

  1. 一种计算装置,包括控制电路和多个处理电路,其中:A computing device including a control circuit and a plurality of processing circuits, wherein:
    所述控制电路配置成获取指令并对所述指令进行解析,并且将解析后的指令发送至所述多个处理电路中的一个或多个处理电路;以及the control circuit is configured to obtain and parse the instruction, and send the parsed instruction to one or more of the plurality of processing circuits; and
    所述多个处理电路配置成以一维或多维阵列的结构进行连接,并且根据接收到的解析后的指令来执行多线程操作。The plurality of processing circuits are configured to be connected in a one-dimensional or multi-dimensional array structure and to perform multi-threaded operations in accordance with the received parsed instructions.
  2. 根据权利要求1所述的计算装置,其中在解析所述指令中,所述控制电路配置成:The computing device of claim 1, wherein in parsing the instruction, the control circuit is configured to:
    获取所述指令中的指令标识信息;以及obtaining instruction identification information in the instruction; and
    根据所述指令标识信息将所述解析后的指令发送至所述多个处理电路中的一个或多个;sending the parsed instruction to one or more of the plurality of processing circuits according to the instruction identification information;
    或者or
    在解析所述指令中,所述控制电路配置成:In parsing the instruction, the control circuit is configured to:
    对所述指令进行译码;以及decoding the instruction; and
    根据所述译码的结果以及所述多个处理电路的操作状态,将所述解析后的指令发送到所述多个处理电路中的一个或多个。The parsed instruction is sent to one or more of the plurality of processing circuits according to the result of the decoding and the operating state of the plurality of processing circuits.
  3. 根据权利要求1所述的计算装置,其中所述多个处理电路被划分成多种类型的处理电路,以执行不同类型的数据处理。The computing device of claim 1, wherein the plurality of processing circuits are divided into multiple types of processing circuits to perform different types of data processing.
  4. 根据权利要求1所述的计算装置,其中所述多个处理电路被划分成第一类型处理电路和第二类型处理电路,其中所述第一类型处理电路适于至少执行算术运算和逻辑运算中的至少一项,而所述第二类型处理电路适于至少执行比较运算和查表运算中的至少一项。11. The computing device of claim 1, wherein the plurality of processing circuits are divided into first type processing circuits and second type processing circuits, wherein the first type processing circuits are adapted to perform at least one of arithmetic and logical operations and the second type of processing circuit is adapted to perform at least one of a comparison operation and a table look-up operation.
  5. 根据权利要求1所述的计算装置,其中所述多维阵列是二维阵列,并且位于所述二维阵列中的所述处理电路在其行方向、列方向或对角线方向的至少一个上以预定的二维间隔模式与同行、同列或同对角线的其余一个或多个所述处理电路连接,其中所述预定的二维间隔模式与所述连接中间隔的处理电路的数目相关联。1. The computing device of claim 1, wherein the multidimensional array is a two-dimensional array, and wherein the processing circuits located in the two-dimensional array are more than A predetermined two-dimensional spacing pattern is connected to the remaining one or more of the processing circuits in a row, column or diagonal, wherein the predetermined two-dimensional spacing pattern is associated with the number of processing circuits spaced in the connection.
  6. 根据权利要求1所述的计算装置,其中所述多维阵列是由多个层构成的三维阵列,其中每个层包括沿行方向、列方向和对角线方向排列的多个所述处理电路的二维阵列,其中:10. The computing device of claim 1, wherein the multidimensional array is a three-dimensional array composed of a plurality of layers, wherein each layer includes a plurality of the processing circuits arranged in row, column, and diagonal directions 2D array where:
    位于所述三维阵列中的所述处理电路在其行方向、列方向、对角线方向和层方向的至少一个上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一个或多个处理电路连接,其中所述预定的三维间隔模式与待连接的处理电路之间的间隔数目和间隔层数相关联。The processing circuits located in the three-dimensional array are in a predetermined three-dimensional spaced pattern in at least one of row, column, diagonal, and layer directions with those on the same row, column, diagonal, or different layer. The remaining one or more processing circuits are connected, wherein the predetermined three-dimensional spacing pattern is associated with the number of spacings and spacing layers between the processing circuits to be connected.
  7. 根据权利要求5或6所述的计算装置,其中所述多个处理电路配置成根据解析后的指定来确定是否通过逻辑连接来进行连接。6. The computing device of claim 5 or 6, wherein the plurality of processing circuits are configured to determine whether to connect by logical connection based on the parsed designation.
  8. 根据权利要求1所述的计算装置,其中多个所述处理电路配置成根据接收到的数据的位宽来形成至少一个处理电路组,以对数据进行处理。11. The computing device of claim 1, wherein a plurality of the processing circuits are configured to form at least one group of processing circuits to process data according to a bit width of the received data.
  9. 根据权利要求8所述的计算装置,其中当形成多个所述处理电路组以对数据进行处理时,所述多个处理电路组根据解析后的指令来通过逻辑连接进行连接,以形成一个或多个闭合的环路。The computing device of claim 8, wherein when a plurality of the processing circuit groups are formed to process data, the plurality of processing circuit groups are connected by logical connections according to the parsed instructions to form an or Multiple closed loops.
  10. 根据权利要求1所述的计算装置,其中每个所述处理电路包括:The computing device of claim 1, wherein each of the processing circuits comprises:
    逻辑运算电路,其配置成在执行所述多线程操作时,根据所述解析后的指令和接收到的数据执行逻辑运算;以及a logic operation circuit configured to perform logic operations according to the parsed instructions and received data when performing the multithreaded operation; and
    存储电路,其包括数据存储电路和谓词存储电路,其中所述数据存储电路配置成存储所述处理电路的运算数据与中间运算结果中的至少一项,并且所述谓词存储电路配置成存储利用所述解析后的指令获取的每个所述处理电路的谓词存储电路序号和谓词信息。A storage circuit comprising a data storage circuit and a predicate storage circuit, wherein the data storage circuit is configured to store at least one of the operation data of the processing circuit and an intermediate operation result, and the predicate storage circuit is configured to store the data using the The predicate storage circuit serial number and predicate information of each of the processing circuits obtained by the parsed instruction.
  11. 根据权利要求10所述的计算装置,其中所述谓词存储电路还配置成:The computing device of claim 10, wherein the predicate storage circuit is further configured to:
    根据所述解析后的指令对所述谓词信息进行更新;或者update the predicate information according to the parsed instruction; or
    根据每个所述处理电路的运算结果对所述谓词信息进行更新。The predicate information is updated according to the operation result of each of the processing circuits.
  12. 根据权利要求10所述的计算装置,其中每个所述处理电路配置成:11. The computing device of claim 10, wherein each of the processing circuits is configured to:
    根据所述解析后的指令中的所述谓词存储电路序号来获取对应于所述谓词存储电路的所述谓词信息;以及obtaining the predicate information corresponding to the predicate storage circuit according to the predicate storage circuit sequence number in the parsed instruction; and
    根据所述谓词信息来确定该所述处理电路是否执行所述解析后的指令。Whether the processing circuit executes the parsed instruction is determined according to the predicate information.
  13. 根据权利要求6所述的计算装置,还包括:The computing device of claim 6, further comprising:
    数据处置电路,其包括前处置电路和后处置电路中的至少一个,其中所述前处置电路配置成在所述处理电路执行运算前对运算数据进行预处理操作,并且所述后处置电路配置成在所述处理电路执行运算后对运算结果进行后处理操作。A data processing circuit comprising at least one of a pre-processing circuit and a post-processing circuit, wherein the pre-processing circuit is configured to perform a preprocessing operation on operation data before the processing circuit performs the operation, and the post-processing circuit is configured to After the processing circuit performs the operation, a post-processing operation is performed on the operation result.
  14. 根据权利要求13所述的计算装置,其中所述闭合的环路中的多个处理电路中的每个配置有各自的逻辑地址,所述前处置电路配置成执行以下中的至少一项:14. The computing device of claim 13, wherein each of the plurality of processing circuits in the closed loop is configured with a respective logical address, the pre-processing circuit configured to perform at least one of the following:
    根据运算数据的类型和逻辑地址,将所述运算数据进行相应的拆分并将拆分后获得的多个子数据分别传递至环路中对应的各个处理电路中以便运算;以及According to the type and logical address of the operation data, the operation data is correspondingly split, and the multiple sub-data obtained after the split are respectively transferred to the corresponding processing circuits in the loop for operation; and
    根据解析后的指令从多种数据拼接模式中选择一种数据拼接模式,以对输入的两个数据执行拼接操作。Select a data splicing mode from multiple data splicing modes according to the parsed instruction to perform a splicing operation on the two input data.
  15. 根据权利要求14所述的计算装置,其中所述后处置电路还配置成对数据执行压缩操作,所述压缩操作包括利用掩码对数据进行筛选或通过给定阈值与数据大小的比较来进行筛选。15. The computing device of claim 14, wherein the post-processing circuit is further configured to perform a compression operation on the data, the compression operation comprising filtering the data with a mask or by comparing a given threshold to a size of the data .
  16. 根据权利要求1-15的任意一项所述的计算装置,其中所述多个处理电路配置成以SIMT方式接收并执行所述解析后的指令。16. The computing device of any of claims 1-15, wherein the plurality of processing circuits are configured to receive and execute the parsed instructions in a SIMT fashion.
  17. 一种集成电路芯片,包括根据权利要求1-16任意一项所述的计算装置。An integrated circuit chip, comprising the computing device according to any one of claims 1-16.
  18. 一种板卡,包括根据权利要求17所述的集成电路芯片。A board, comprising the integrated circuit chip according to claim 17 .
  19. 一种使用计算装置来执行计算操作的方法,其中所述计算装置包括控制电路和以一维或多维阵列结构连接的多个处理电路,所述方法包括:A method of performing computational operations using a computing device, wherein the computing device includes a control circuit and a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, the method comprising:
    利用所述控制电路来获取指令并对所述指令进行解析,并将解析后的指令发送至所述多个处理电路中的一个或多个处理电路;以及using the control circuit to obtain and parse the instruction, and send the parsed instruction to one or more of the plurality of processing circuits; and
    利用所述一个或多个处理电路来根据解析后的指令执行多线程操作。The one or more processing circuits are utilized to perform multi-threaded operations in accordance with the parsed instructions.
  20. 根据权利要求19所述的方法,其中在解析所述指令中,所述方法利用所述控制电路来执行:20. The method of claim 19, wherein in parsing the instruction, the method utilizes the control circuit to perform:
    获取所述指令中的指令标识信息;以及obtaining instruction identification information in the instruction; and
    根据所述指令标识信息将所述解析后的指令发送至所述多个处理电路中的一个 或多个;sending the parsed instruction to one or more of the plurality of processing circuits according to the instruction identification information;
    或者or
    在解析所述指令中,利用所述控制电路来执行:In parsing the instruction, the control circuit is used to execute:
    对所述指令进行译码;以及decoding the instruction; and
    根据所述译码的结果以及所述多个处理电路的操作状态,将所述解析后的指令发送到所述多个处理电路中的一个或多个。The parsed instruction is sent to one or more of the plurality of processing circuits according to the result of the decoding and the operating state of the plurality of processing circuits.
  21. 根据权利要求19所述的方法,包括将所述多个处理电路划分成多种类型的处理电路,以执行不同类型的数据处理。20. The method of claim 19 including dividing the plurality of processing circuits into multiple types of processing circuits to perform different types of data processing.
  22. 根据权利要求19所述的方法,其中将所述多个处理电路划分成多种类型的处理电路包括将所述多个处理电路划分成第一类型处理电路和第二类型处理电路,其中所述第一类型处理电路适于至少执行算术运算和逻辑运算中的至少一项,而所述第二类型处理电路适于至少执行比较运算和查表运算中的至少一项。19. The method of claim 19, wherein dividing the plurality of processing circuits into a plurality of types of processing circuits comprises dividing the plurality of processing circuits into a first type of processing circuits and a second type of processing circuits, wherein the The first type of processing circuit is adapted to perform at least one of an arithmetic operation and a logical operation, and the second type of processing circuit is adapted to perform at least one of a comparison operation and a table look-up operation.
  23. 根据权利要求19所述的方法,其中所述多维阵列是二维阵列,并且所述方法包括将位于所述二维阵列中的所述处理电路在其行方向、列方向或对角线方向的至少一个上以预定的二维间隔模式与同行、同列或同对角线的其余一个或多个所述处理电路进行连接,其中所述预定的二维间隔模式与所述连接中间隔的处理电路的数目相关联。20. The method of claim 19, wherein the multidimensional array is a two-dimensional array, and the method comprises placing the processing circuits located in the two-dimensional array in their row, column, or diagonal directions At least one is connected to the other one or more of the processing circuits in the same row, the same column or the same diagonal in a predetermined two-dimensional spacing pattern, wherein the predetermined two-dimensional spacing pattern is connected with the processing circuits spaced in the connection number associated.
  24. 根据权利要求19所述的方法,其中所述多维阵列是由多个层构成的三维阵列,其中每个层包括沿行方向、列方向和对角线方向排列的多个所述处理电路的二维阵列,所述方法包括:20. The method of claim 19, wherein the multidimensional array is a three-dimensional array composed of a plurality of layers, wherein each layer includes two of a plurality of the processing circuits arranged in row, column, and diagonal directions dimensional array, the method includes:
    将位于所述三维阵列中的所述处理电路在其行方向、列方向、对角线方向和层方向的至少一个上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一个或多个处理电路进行连接,其中所述预定的三维间隔模式与待连接的处理电路之间的间隔数目和间隔层数相关联。Aligning the processing circuits located in the three-dimensional array in at least one of the row, column, diagonal, and layer directions with a row, column, diagonal, or layer in a predetermined three-dimensional spacing pattern The remaining one or more processing circuits are connected, wherein the predetermined three-dimensional spacing pattern is associated with the number of spacings and the number of spacing layers between the processing circuits to be connected.
  25. 根据权利要求23或24所述的方法,其中根据解析后的指定来确定是否将所述多个处理电路通过逻辑连接来进行连接。24. The method of claim 23 or 24, wherein whether the plurality of processing circuits are logically connected is determined based on the parsed designation.
  26. 根据权利要求19所述的方法,其中根据接收到的数据的位宽将多个所述处理电路形成至少一个处理电路组,以对数据进行处理。19. The method of claim 19, wherein a plurality of the processing circuits are formed into at least one processing circuit group according to the bit width of the received data to process the data.
  27. 根据权利要求26所述的方法,其中当形成多个所述处理电路组以对数据进行处理时,所述方法包括根据解析后的指令将所述多个处理电路组通过逻辑连接进行连接,以形成一个或多个闭合的环路。27. The method of claim 26, wherein when a plurality of the processing circuit groups are formed to process data, the method comprises connecting the plurality of processing circuit groups through logical connections according to the parsed instructions to form one or more closed loops.
  28. 根据权利要求19所述的方法,其中每个所述处理电路包括逻辑运算电路和存储电路,并且所述存储电路包括数据存储电路和谓词存储电路,所述方法包括在执行所述多线程操作时,利用所述逻辑运算电路来根据所述解析后的指令和接收到的数据执行逻辑运算,并且利用所述数据存储电路来存储所述处理电路的运算数据与中间运算结果中的至少一项,并且利用所述谓词存储电路来存储利用所述解析后的指令获取的每个所述处理电路的谓词存储电路序号和谓词信息。20. The method of claim 19, wherein each of the processing circuits includes a logic operation circuit and a storage circuit, and the storage circuit includes a data storage circuit and a predicate storage circuit, the method comprising when performing the multithreaded operation , using the logic operation circuit to perform a logic operation according to the parsed instruction and the received data, and using the data storage circuit to store at least one of the operation data and the intermediate operation result of the processing circuit, And the predicate storage circuit is used to store the predicate storage circuit serial number and predicate information of each of the processing circuits obtained by using the parsed instruction.
  29. 根据权利要求28所述的方法,其中还包括利用所述谓词存储电路来执行以下步骤:29. The method of claim 28, further comprising utilizing the predicate storage circuit to perform the steps of:
    根据所述解析后的指令对所述谓词信息进行更新;或者update the predicate information according to the parsed instruction; or
    根据每个所述处理电路的运算结果对所述谓词信息进行更新。The predicate information is updated according to the operation result of each of the processing circuits.
  30. 根据权利要求28所述的方法,其中还包括利用每个所述处理电路来执行以下步骤:29. The method of claim 28, further comprising utilizing each of said processing circuits to perform the steps of:
    根据所述解析后的指令中的所述谓词存储电路序号来获取对应于所述谓词存储电路的所述谓词信息;以及obtaining the predicate information corresponding to the predicate storage circuit according to the predicate storage circuit sequence number in the parsed instruction; and
    根据所述谓词信息来确定该所述处理电路是否执行所述解析后的指令。Whether the processing circuit executes the parsed instruction is determined according to the predicate information.
  31. 根据权利要求24所述的方法,其中所述计算装置还包括数据处置电路,其包括前处置电路和后处置电路中的至少一个,其中所述方法还包括在所述处理电路执行运算前,利用所述前处置电路对运算数据进行预处理操作,并且在所述处理电路执行运算后,利用所述后处置电路对运算结果进行后处理操作。25. The method of claim 24, wherein the computing device further comprises a data processing circuit including at least one of a pre-processing circuit and a post-processing circuit, wherein the method further comprises, before the processing circuit performs an operation, utilizing The preprocessing circuit performs a preprocessing operation on the operation data, and after the processing circuit performs the operation, the postprocessing circuit is used to perform a postprocessing operation on the operation result.
  32. 根据权利要求31所述的方法,包括对所述闭合的环路中的多个处理电路中的每个配置各自的逻辑地址,并且利用所述前处置电路来执行以下中的至少一项:The method of claim 31 , comprising configuring each of a plurality of processing circuits in the closed loop with a respective logical address, and utilizing the pre-processing circuit to perform at least one of the following:
    根据运算数据的类型和逻辑地址,将所述运算数据进行相应的拆分并将拆分后获得的多个子数据分别传递至环路中对应的各个处理电路中以便运算;以及According to the type and logical address of the operation data, the operation data is correspondingly split, and the multiple sub-data obtained after the split are respectively transferred to the corresponding processing circuits in the loop for operation; and
    根据解析后的指令从多种数据拼接模式中选择一种数据拼接模式,以对输入的两个数据执行拼接操作。Select a data splicing mode from multiple data splicing modes according to the parsed instruction to perform a splicing operation on the two input data.
  33. 根据权利要求32所述的方法,其中还包括利用所述后处置电路来对数据执行压缩操作,所述压缩操作包括利用掩码对数据进行筛选或通过给定阈值与数据大小的比较来进行筛选。33. The method of claim 32, further comprising utilizing the post-processing circuit to perform a compression operation on the data, the compression operation including filtering the data with a mask or by comparing a given threshold to a size of the data .
  34. 根据权利要求19-33的任意一项所述的方法,其中包括利用所述多个处理电路以SIMT方式接收并执行所述解析后的指令。33. The method of any of claims 19-33, comprising receiving and executing the parsed instructions in a SIMT manner with the plurality of processing circuits.
PCT/CN2021/094468 2020-06-30 2021-05-18 Computing apparatus, integrated circuit chip, board and computing method WO2022001439A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010618120.3 2020-06-30
CN202010618120.3A CN113867790A (en) 2020-06-30 2020-06-30 Computing device, integrated circuit chip, board card and computing method

Publications (1)

Publication Number Publication Date
WO2022001439A1 true WO2022001439A1 (en) 2022-01-06

Family

ID=78981876

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/094468 WO2022001439A1 (en) 2020-06-30 2021-05-18 Computing apparatus, integrated circuit chip, board and computing method

Country Status (2)

Country Link
CN (1) CN113867790A (en)
WO (1) WO2022001439A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1261966A (en) * 1997-06-30 2000-08-02 博普斯公司 Manifold array processor
CN103020890A (en) * 2012-12-17 2013-04-03 中国科学院半导体研究所 Visual processing device based on multi-layer parallel processing
CN110163349A (en) * 2018-02-12 2019-08-23 上海寒武纪信息科技有限公司 A kind of calculation method and device of network model
US20190304054A1 (en) * 2017-04-24 2019-10-03 Intel Corporation Compute optimization mechanism
US20200201932A1 (en) * 2019-12-28 2020-06-25 Intel Corporation Apparatuses, methods, and systems for instructions of a matrix operations accelerator
US20200201612A1 (en) * 2015-04-23 2020-06-25 Google Llc Compiler for translating between a virtual image processor instruction set architecture (isa) and target hardware having a two-dimensional shift array structure

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1261966A (en) * 1997-06-30 2000-08-02 博普斯公司 Manifold array processor
CN103020890A (en) * 2012-12-17 2013-04-03 中国科学院半导体研究所 Visual processing device based on multi-layer parallel processing
US20200201612A1 (en) * 2015-04-23 2020-06-25 Google Llc Compiler for translating between a virtual image processor instruction set architecture (isa) and target hardware having a two-dimensional shift array structure
US20190304054A1 (en) * 2017-04-24 2019-10-03 Intel Corporation Compute optimization mechanism
CN110163349A (en) * 2018-02-12 2019-08-23 上海寒武纪信息科技有限公司 A kind of calculation method and device of network model
US20200201932A1 (en) * 2019-12-28 2020-06-25 Intel Corporation Apparatuses, methods, and systems for instructions of a matrix operations accelerator

Also Published As

Publication number Publication date
CN113867790A (en) 2021-12-31

Similar Documents

Publication Publication Date Title
US10949496B2 (en) Dimension shuffling using matrix processors
CN109189473A (en) Processing with Neural Network device and its method for executing vector exchange instruction
CN109032670A (en) Processing with Neural Network device and its method for executing vector duplicate instructions
CN111860807B (en) Fractal calculation device, fractal calculation method, integrated circuit and board card
WO2023045445A1 (en) Data processing device, data processing method, and related product
CN110059797B (en) Computing device and related product
CN111353598A (en) Neural network compression method, electronic device and computer readable medium
Huang et al. IECA: An in-execution configuration CNN accelerator with 30.55 GOPS/mm² area efficiency
CN112686379A (en) Integrated circuit device, electronic equipment, board card and calculation method
CN110059809B (en) Computing device and related product
WO2022001439A1 (en) Computing apparatus, integrated circuit chip, board and computing method
WO2022001457A1 (en) Computing apparatus, chip, board card, electronic device and computing method
WO2022001499A1 (en) Computing apparatus, chip, board card, electronic device and computing method
WO2022001500A1 (en) Computing apparatus, integrated circuit chip, board card, electronic device, and computing method
WO2022001454A1 (en) Integrated computing apparatus, integrated circuit chip, board card, and computing method
WO2022001498A1 (en) Computing apparatus, integrated circuit chip, board, electronic device and computing method
WO2022001456A1 (en) Computing apparatus, integrated circuit chip, board card, electronic device and computing method
CN114692844A (en) Data processing device, data processing method and related product
CN112766471A (en) Arithmetic device and related product
WO2022134872A1 (en) Data processing apparatus, data processing method and related product
WO2022111013A1 (en) Device supporting multiple access modes, method and readable storage medium
CN112395002B (en) Operation method, device, computer equipment and storage medium
CN112394990A (en) Floating point to half precision floating point instruction processing device and method and related products
JP2023532573A (en) Computing Devices, Integrated Circuit Chips, Board Cards, Electronics and Computing Methods
CN111291884A (en) Neural network pruning method and device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21832835

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21832835

Country of ref document: EP

Kind code of ref document: A1