CN117093263A

CN117093263A - Processor, chip, board card and method

Info

Publication number: CN117093263A
Application number: CN202210513914.2A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2022-05-11
Filing date: 2022-05-11
Publication date: 2023-11-21

Abstract

The present disclosure discloses a processor, a chip, a board card and a corresponding method. The processor may be included as computing means in a combined processing means, which may also include interface means and other processing means. The computing device interacts with other processing devices to collectively perform a user-specified computing operation. The combined processing means may further comprise storage means connected to the computing means and the other processing means, respectively, for storing data of the computing means and the other processing means. The scheme of the disclosure provides a hardware architecture for executing a fused operation instruction, which can improve the processing efficiency of a machine.

Description

Processor, chip, board card and method

Technical Field

The present disclosure relates generally to the field of processors. More particularly, the present disclosure relates to a processor, chip, board, and method for performing fusion operations using a processor that support fusion operations.

Background

Conventional processors, such as Central Processing Units (CPUs), graphics Processors (GPUs), digital Signal Processors (DSPs), etc., can only complete one operation by one instruction when performing data operations. For example, the add instruction completes one addition result=a+b and the multiply instruction completes one multiplication result=a+b. When a complex operation is required, for example, result= (a+b) c, two operation instructions are required to complete the operation, i.e. the first instruction completes tmp=a+b and the second instruction completes result=tmp×c.

The above solution will lead to frequent memory read/write problems when processing fusion operations. In addition, one instruction can only use one operation unit, and the resources of the operation unit cannot be fully utilized, so that the calculation force is wasted. Therefore, a processor capable of efficiently supporting the fusion operation is needed.

Disclosure of Invention

To at least partially solve one or more of the technical problems mentioned in the background, the solution of the present disclosure provides a processor, a chip, a board, and a method for performing a fusion operation using the processor.

In a first aspect, the present disclosure discloses a processor comprising a memory, a controller, and an operator, wherein: the memory is used for storing source data and a final operation result required by operation; the controller is used for decoding a fusion operation instruction and controlling the memory and the arithmetic unit to execute the fusion operation instruction, wherein the fusion operation instruction at least indicates to execute fusion operation comprising a plurality of operators on source data; and the arithmetic unit is used for acquiring source data from the memory under the control of the controller to execute the fusion operation and writing a final operation result back to the memory, wherein in the fusion operation, operation data required by an operator comprise intermediate operation results from the arithmetic unit.

In a second aspect, the present disclosure provides a chip comprising a processor of any one of the embodiments of the first aspect described above.

In a third aspect, the present disclosure provides a board comprising the chip of any one of the embodiments of the second aspect.

In a fourth aspect, the present disclosure provides a method of performing a fusion operation by a processor of any of the embodiments of the first aspect described above.

Through the processor, the chip, the board and the method provided by the embodiment of the disclosure, the hardware architecture for effectively supporting the fusion operation and the scheme for executing the fusion operation on the hardware architecture are provided, so that the read-write operation on the memory can be reduced, and the power consumption of the read-write memory can be reduced. Further, in some scenarios, the arithmetic circuits in the processor may be scheduled in a pipelined manner, so that the resources of the arithmetic circuits are fully utilized, and the processing efficiency of the machine is improved through parallel operation.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 illustrates a block diagram of a board of an embodiment of the present disclosure;

FIG. 2 illustrates a block diagram of a combination processing device according to an embodiment of the present disclosure;

FIG. 3 illustrates an internal structural schematic diagram of a computing device of an embodiment of the present disclosure;

FIG. 4 shows a schematic block diagram of a processor according to another embodiment of the present disclosure;

FIG. 5 shows an example of a chain operation structure and a multi-branch operation structure in a fusion operation;

FIGS. 6 a-6 b illustrate a process flow of a processor for a chain operation architecture according to an embodiment of the present disclosure; and

FIG. 7 illustrates an exemplary process for a processor employing a pipeline control fusion arithmetic process in accordance with an embodiment of the present disclosure.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," and the like, as may appear in the claims, specification and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context.

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is applied to the cloud intelligent field in a large quantity, and the cloud intelligent application has the remarkable characteristics of large input data quantity and high requirements on the storage capacity and the computing capacity of a platform, and the board card 10 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.

The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 of this embodiment. As shown in fig. 2, the combined processing means 20 comprises computing means 201, interface means 202, processing means 203 and storage means 204.

The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.

The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.

The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.

The storage 204 is used to store data to be processed, which may be DRAM, is DDR memory, and is typically 16G or larger in size, for storing data for the computing device 201 and/or the processing device 203.

Fig. 3 shows a schematic diagram of the internal structure of the computing device 201. The computing device 301 is configured to process input data such as computer vision, voice, natural language, data mining, etc., and the computing device 301 includes three modules: a control module 31 (also referred to as a controller), an arithmetic module 32 (also referred to as an operator), and a storage module 33 (also referred to as a memory).

The control module 31 is used for coordinating and controlling the operation of the operation module 32 and the storage module 33 to complete the task of deep learning, and comprises a fetch unit (instruction fetch unit, IFU) 311 and an instruction decode unit (instruction decode unit, IDU) 312. The instruction fetching unit 311 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 312 decodes the fetched instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit 322 is responsible for the core computation of the deep learning algorithm, i.e., matrix multiplication and convolution.

The storage module 33 is used for storing or carrying related data, and includes a neuron storage unit (NRAM) 331, a weight storage unit (weight RAM) 332, and a direct memory access module (direct memory access, DMA) 333.NRAM 331 is to store input neurons, output neurons, and calculated intermediate results; WRAM 332 is configured to store a convolution kernel, i.e., a weight, of the deep learning network; DMA 333 is coupled to DRAM 204 via bus 34 and is responsible for data handling between computing device 301 and DRAM 204.

Based on the foregoing hardware environment, in one aspect, the disclosed embodiments provide a processor that includes a memory, a controller, and an operator to support fused arithmetic operations, and may be implemented as the aforementioned computing device 201.

As mentioned in the background, existing processor solutions have several problems in handling fusion operations.

First, the read-write memory is excessively operated, resulting in high power consumption of the read-write memory. For example, when a fusion operation of N consecutive operators needs to be completed, such as result= ((IN) ₀ op ₀ IN ₁ )op ₁ IN ₂ )…op _N-1 IN _N Wherein IN ₀ …IN _N Representing input data, i.e. source data required for operator operations, op ₀ …op _N-1 The operator is represented, and N calculation instructions are needed to complete the fusion operation, and 2N times of memory reading operations and N times of memory writing operations are needed at the same time, namely, each calculation instruction needs to read two operands (2 times of reading operations) and write back one result number (1 time of writing operations).

Second, the resources of the arithmetic circuit in the processor cannot be fully utilized. In a processor such as CPU, GPU, DSP, there are generally various kinds of arithmetic circuits, for example, a multiplication circuit performs multiplication, an addition circuit performs addition, a comparison circuit performs comparison, and the like. When executed according to the prior art, an instruction can only utilize one arithmetic circuit, and the rest of the arithmetic circuits are in an idle state. For example, assuming there are M arithmetic circuits, then there are M-1 arithmetic circuits in an idle state each time they are executed, resulting in a significant waste of processor computational power.

In view of this, in the processor according to the embodiments of the present disclosure, when the processor is configured to perform a fusion operation, the memory therein is configured to store source data and a final operation result required for the operation; the controller is used for decoding the fusion operation instruction and controlling the memory and the arithmetic unit to execute the fusion operation instruction, wherein the fusion operation instruction at least indicates to execute fusion operation comprising a plurality of operators on source data; the arithmetic unit is used for acquiring the source data from the memory under the control of the controller to execute the fusion operation and writing the final operation result back to the memory. In this fusion operation, the operation data required for the operator includes intermediate operation results from the operators.

As can be seen from the above-described scheme, when the operation data required for the operators in the fusion operation relates to non-source data, for example, an intermediate operation result relating to the operators, the operators directly acquire the intermediate operation result without acquiring from a memory. In other words, the intermediate operation result of the operator is not written back to the memory, and the final operation result is written back to the memory. As the write-back operation of the intermediate operation result is reduced and the read operation of the subsequent operation on the intermediate operation result is correspondingly reduced, the read-write operation on the memory can be reduced, and the problem of overhigh power consumption of the memory caused by frequent reading and writing of the memory is avoided.

When the operational data required to fuse operators in an operation relates to source data, the source data may be conveyed in a variety of ways. In one example, the source data may be sent directly from the memory to the operator. In another example, the source data may be forwarded within the operator after being sent out from the memory. The detailed description will be made later with reference to examples.

Fig. 4 shows a schematic block diagram of a processor 400 according to another embodiment of the present disclosure. As shown in fig. 4, a processor 400 may be used to execute the fusion calculation instruction, which may include a controller 41, an operator 42, a memory 43, and a connection circuit 44. The functions of the controller 41, the operator 42 and the memory 43 are similar to those of the control module 31, the operation module 32 and the storage module 33 in fig. 3 and will not be repeated here.

In some embodiments, decoding occurs when the controller 41 receives a fused instruction. The controller 41 generates a read/write control signal for reading/writing the memory based on the decoding result, and transmits the read/write control signal to the memory 43, and also generates an arithmetic control signal and transmits the arithmetic control signal to the arithmetic unit 42. When the memory 43 receives the read/write control signal, the corresponding data is sent to the arithmetic unit 42. The arithmetic unit 42 receives the data from the memory and the arithmetic control signal from the controller, completes the arithmetic operation, and writes back the final arithmetic result to the memory.

The arithmetic unit 42 may include a plurality of arithmetic circuits, which may be of the same type or different types, so as to perform corresponding operations. The types of arithmetic circuits may include, for example, but not limited to, multiplication circuits, addition circuits, comparison circuits, shift circuits, revolution circuits, transcendental function circuits, and various fusion arithmetic circuits, such as multiplication and addition circuits, and the like. Several arithmetic circuits 421-427 are shown by way of example in fig. 4.

The plurality of arithmetic circuits 421 to 427 of the arithmetic unit 42 and the plurality of arithmetic circuits 421 to 427 and the memory 43 exchange data with each other via the connection circuit 44. By the connection circuit 44, any one of the plurality of arithmetic circuits 421 to 427 can receive data from any other arithmetic circuit and the memory 43 through the connection circuit 44, and can transmit an arithmetic result to any other arithmetic circuit and the memory 43 through the connection circuit 44.

The connection circuit 44 may be implemented in various forms. In one example, the connection circuit 44 may be a cross bus (crossbar), also known as a cross-matrix switch. For example, when M arithmetic circuits are included in the operator 42, the cross bus may be a cross matrix switch of (m+1) × (m+1). In another example, the connection circuit 44 may be a Full-Mesh circuit (Full-Mesh) with a direct connection between any two nodes (arithmetic circuits or memories).

If multiple arithmetic circuits are connected in a fixed order, for example, a multiplication circuit precedes an addition circuit, multiplication is also required in the fusion operations that can be handled by the multiplication circuit before addition, which limits the fusion operations that can be supported by the processor. And through the connecting circuit, the data interaction between any two circuits in the plurality of operation circuits and the memory can be supported, so that the transmission of the intermediate operation result of the fusion operation between the operation circuits is supported. Further, since data interaction can be realized between any two circuits, the specific order of each operator in the fusion operation is not limited when the fusion operation is processed, and the fusion operation of any operation combination can be supported as long as the operation circuit can support the operators.

Fusion operations are typically combined operations comprising a plurality of operators, the form of which may be varied. Roughly, the combination of operators can be divided into two categories: chain operation structure and multi-branch operation structure.

Fig. 5 shows an example of a chain operation structure and a multi-branch operation structure in a fusion operation.

As shown, the chain operation structure 510 refers to a structure of multiple operators 511-51n connected in series by a single chain. In the chain structure, the output of the previous operator is used as the input of the next operator, and a data dependency relationship exists between the front operator and the rear operator. The chain operation structure as a whole may also be referred to as a single-branch operation structure. The chain operation structure can be generally expressed as: result= ((IN) ₀ op ₀ IN ₁ )op ₁ IN ₂ )…op _N-1 IN _N Wherein IN ₀ …IN _N Representing input data, i.e. source data required for operator operations, op ₀ …op _N-1 Representing the operator. As a specific example, for the calculation result=max ((a+b) ×c, d), it is developed into a chain calculation structure as: result= ((a+b) ×c) max d, three operators are involved in sequence, addition, multiplication and comparison.

The multi-branch operation structure refers to that a plurality of operation branches exist in the fusion operation, the operation branches do not have a dependency relationship, and the outputs of the plurality of operation branches can be provided for the same operator.

Two exemplary multi-branch operation structures are shown in fig. 5. In the multi-branch arithmetic structure 520, two arithmetic branches are included, and each arithmetic branch includes an operator 521 and an operator 522. The outputs of these two branches of the operation are provided to an operator 523, which operates as its input and outputs the final result. The multi-branch operation structure 520 may represent, for example, a specific operation: result= (a+b) × (c+d), wherein two additions are two branches of operations, respectively, with no data dependency between the two branches.

In the multi-branch operation structure 530, the outputs of the 4 th operator, the 1 st operator 531 are simultaneously provided to the 2 nd operator 532 and the 3 rd operator 533, and the outputs of the 2 nd operator and the 3 rd operator are simultaneously provided to the 4 th operator 534. It can be seen that the multi-branch arithmetic structure 530 likewise comprises two arithmetic branches (operator 2 532 and operator 3 533, respectively), but both receive as input the output of the same operator 531 before. The multi-branch operation structure 530 may represent, for example, a specific operation: result=max ((a+b) ×c, (a+b) ×d), where the two multiplications are two branches of operations, respectively, with no data dependency between them, but both depend on the result of the previous operator 531.

It will be appreciated that the above illustration is merely an example description and that multi-branch operation structures may also include more complex structures such as branch nesting, multiple occurrences of multiple branches, multiple operators included in each branch, and the like, as the embodiments of the present disclosure are not limited in this respect. It will also be appreciated that if multiple branches are combined to be considered as one operator, the multi-branch operation structure may also be converted into a chain operation structure. The processors of the disclosed embodiments may support fused operations of any operational structure.

FIG. 6a illustrates an exemplary process flow for a chained operation architecture for a processor in accordance with an embodiment of the present disclosure. IN this example, assume that a fusion operation of a batch of data, consisting of N Operators (OPs), is required to complete, the fusion operation being of a chain operation structure, expressed as result= ((IN) ₀ op ₀ IN ₁ )op ₁ IN ₂ )…op _N-1 IN _N Wherein IN ₀ …IN _N Representing the input data, alsoI.e. the source data required for the operator operation, op ₀ …op _N-1 Representing the operator. Assuming that the ith operator is allocated to the ith arithmetic circuit CC _i And executing on the computer. In this example, for source data required for an operator in a fusion operation, a manner of directly transmitting from a memory to a corresponding operation circuit is adopted. The whole process flow is as follows:

First step S611, the controller sends a read IN ₀ ,IN ₁ Read-write control signal of (a) to the memory, send op ₀ And calculating a control signal to the arithmetic unit. Memory read IN according to read/write control signal ₀ Sum IN ₁ To the operator. Op in the arithmetic unit according to the operation control signal ₀ Corresponding arithmetic circuit CC ₀ Is activated, e.g. op ₀ For multiplication, the multiplication circuit is activated. Arithmetic circuit CC ₀ Receiving IN from memory through connection circuitry ₀ Sum IN ₁ Finish op ₀ Calculating to obtain tmp ₁ ＝IN ₀ op ₀ IN ₁ 。

Second step S612, the controller sends a read IN ₂ Read-write control signal of (a) to the memory, send op ₁ And calculating a control signal to the arithmetic unit. Memory read IN ₂ Send to the arithmetic unit, op in the arithmetic unit ₁ Corresponding arithmetic circuit CC ₁ Is activated, the arithmetic circuit CC ₁ Receiving IN from memory from connection circuit ₂ And from the last arithmetic circuit CC ₀ (i.e. multiplication circuit) output tmp ₁ Completion of tmp ₂ ＝tmp ₁ op ₁ IN ₂ 。

And so on until the nth step S61N, the controller sends a read IN _N Read-write control signal of (a) to the memory, send op _N-1 And calculating a control signal to the arithmetic unit. Memory read IN _N Send to the arithmetic unit, op in the arithmetic unit _N-1 Corresponding arithmetic circuit CC _N-1 Is activated, which receives IN from the memory from the connection circuit _N And from the last arithmetic circuit CC _N-2 Output tmp of (2) _N-1 Finish op _N-1 Calculation to obtain result=tmp _N-1 op _N-1 IN _N . The final operation result is written back to the memory through the connection circuit.

As can be seen from the above operation procedure, the operation circuit executing the intermediate operator in the chain operation structure receives the corresponding source data for the corresponding operator from the memory via the connection circuit, receives the intermediate operation result from the operation circuit executing the previous operator, then executes the operation of the corresponding operator on the corresponding source data and the intermediate operation result, and sends the operation result to the operation circuit executing the next operator.

The arithmetic circuit executing the first operator in the chain arithmetic structure receives the corresponding source data for the first operator from the memory only through the connecting circuit because the operands are all source data and do not comprise intermediate arithmetic results, executes the operation of the first operator on the source data, obtains the intermediate arithmetic results and sends the intermediate arithmetic results to the arithmetic circuit executing the next operator.

The arithmetic circuit executing the last operator in the chain arithmetic structure is similar to the arithmetic circuit executing the intermediate operator, receives the corresponding source data for the last operator from the memory via the connection circuit, receives the intermediate arithmetic result from the arithmetic circuit executing the previous operator, and then executes the operation of the last operator on the corresponding source data and the intermediate arithmetic result. The difference is that since the operation result at this time is the final operation result, the final operation result is written back into the memory via the connection circuit after the operation of the last operator is performed.

According to the above analysis, the conventional processor processes N operator fusion operations of one data, and needs to perform 2N read memory operations (reading two operands of each operator) and N write memory operations (writing back the operation result of each operator). In contrast, the processors supporting the fusion operation of the embodiments of the present disclosure need only perform (n+1) read memory operations (reading the source data for each operator) and 1 write memory operation (writing back the final operation result). Therefore, when the processor of the embodiment of the disclosure executes the fusion operation, the number of times of reading and writing the memory can be greatly reduced, so that the power consumption of the memory is reduced.

FIG. 6b illustrates another exemplary process flow for a chained operation architecture for a processor in accordance with an embodiment of the present disclosure. This example differs from the example of fig. 6a only in the way the source data is transmitted, other assumed conditions being similar. In this example, the source data required for the current operator operation is forwarded via the connection circuitry through the operation circuitry executing the previous operator, rather than being fetched directly from memory. The whole process flow is as follows:

first step S621, the controller sends and reads all source data IN ₀ ～IN _N Read-write control signal of (a) to the memory, send op ₀ And calculating a control signal to the arithmetic unit. Memory read IN according to read/write control signal ₀ ～IN ₁ To the operator. Op in the arithmetic unit according to the operation control signal ₀ Corresponding arithmetic circuit CC ₀ Is activated, e.g. op ₀ For multiplication, the multiplication circuit is activated. Arithmetic circuit CC ₀ Receiving IN from memory through connection circuitry ₀ ～IN _N And utilize IN therein ₀ Sum IN ₁ Completion op ₀ Calculating to obtain tmp ₁ ＝IN ₀ op ₀ IN ₁ 。

Second step S622, the controller sends op ₁ And calculating a control signal to the arithmetic unit. Op in arithmetic unit ₁ Corresponding arithmetic circuit CC ₁ Is activated, the arithmetic circuit CC ₁ Receiving the signal from the last arithmetic circuit CC from the connection circuit ₀ (i.e., multiplication circuit) data including CC ₀ Output tmp of (2) ₁ Other unused source data IN ₂ ～IN _N And utilize corresponding source data IN ₂ And output tmp ₁ Completion of tmp ₂ ＝tmp ₁ op ₁ IN ₂ 。

And so on until the N-th step S62N, the controller sends op _N-1 And calculating a control signal to the arithmetic unit. Op in arithmetic unit _N-1 Corresponding arithmetic circuit CC _N-1 Is activated, which receives from the last arithmetic circuit CC from the connection circuit _N-2 Output tmp of (2) _N-1 Remaining unused source data IN _N Finish op _N-1 Calculation to obtain result=tmp _N-1 op _N-1 IN _N . The final operation result is written back to the memory through the connection circuit.

As can be seen from the above operation procedure, the operation circuit executing the intermediate operator in the chain operation structure receives the intermediate operation result and all source data required by the previous operator and the subsequent operator from the operation circuit executing the previous operator via the connection circuit, then executes the operation of the corresponding operator on the corresponding source data and the intermediate operation result, and transmits the operation result and all source data required by the subsequent operator to the operation circuit executing the next operator.

The arithmetic circuit executing the first operator in the chain arithmetic structure receives the source data of all operators in the fusion operation from the memory only through the connecting circuit because the operands are all source data and do not comprise intermediate arithmetic results, executes the operation of the first operator on the source data required by the operator to obtain intermediate arithmetic results, and sends the intermediate arithmetic results and the rest of source data to the arithmetic circuit executing the next operator.

The arithmetic circuit executing the last operator in the chain arithmetic structure is similar to the arithmetic circuit executing the intermediate operator, receives the intermediate arithmetic result and the corresponding source data thereof from the arithmetic circuit executing the previous operator via the connection circuit, and then executes the operation of the last operator on the corresponding source data and the intermediate arithmetic result. The difference is that since the operation result at this time is the final operation result, the final operation result is written back into the memory via the connection circuit after the operation of the last operator is performed.

It can be seen that in this embodiment, in the case where the data amount is not too large, the number of times of reading of the memory can be further reduced by reading the source data at one time, thereby reducing the power consumption of the memory.

It will be appreciated that although in the above example it is assumed that the ith operator is allocated in the ith arithmetic circuitry CC _i Executing on, but depending on the type and number of actual fused operators and arithmetic circuits, multiple operators of the same arithmetic type may be distributed on the same arithmetic circuit (i.eCC _i Where duplicates exist) or may be distributed over multiple different operational circuits of the same type, the processor of embodiments of the present disclosure may support any implementation. It is further understood that when the two operators are both allocated to be executed on the same operation circuit, the operation result of the former operator is not required to be transmitted through the connection circuit, because the operation result is already in the operation circuit for executing the latter operator.

In the embodiment of the disclosure, the operation processing process can be further optimized for the specific operation structure and the distribution of operators to the operation circuits, so that the processing efficiency is improved. For example, in some scenarios, it may be necessary to perform fusion operation processing on multiple batches of data, and when processing multiple batches of data, pipeline control may be used for the entire processing procedure according to the allocation situation of the operation circuit, so as to fully utilize the computation power of the operation circuit.

An exemplary process for a processor employing a pipeline control fusion arithmetic process is shown in accordance with an embodiment of the present disclosure. IN this example, it is assumed that a fusion operation needs to be performed on T batches of data, and the fusion operation is a chain operation structure, result= ((IN) ₀ op ₀ IN ₁ )op ₁ IN ₂ )…op _N-1 IN _N The fusion operation can be distributed to N operation circuits according to the data dependency relationship, wherein T>1，N>1. The controller controls the memory to send corresponding source data in the T batches of data so as to support N arithmetic circuits to execute corresponding operators on the data of different batches simultaneously. It will be appreciated that there is no overlap of the N arithmetic circuits herein in use.

Specifically, as shown, the overall processing pipeline includes a memory 710 and a plurality of arithmetic circuits arranged in data dependency relationship. For simplicity, assume that a fusion operation is performed: result= ((a op) ₀ b)op ₁ c)op ₂ d, thus requiring 3 arithmetic circuits CC ₀ ～CC ₂ 711 to 713, wherein the latter arithmetic circuit CC _i The former operation circuit CC is required _i-1 As input. The left side of fig. 7 shows a timeline, with steps flowing in time sequence.

First step S71, the memory sends a of lot 1 data ₀ And b ₀ Arithmetic circuit CC ₀ Execution tmp0_0=a ₀ op ₀ b ₀ 。

Second step S72, memory sends batch 1 data c ₀ Arithmetic circuit CC ₁ Executing tmp0_1=tmp0_0op ₁ c ₀ The method comprises the steps of carrying out a first treatment on the surface of the At the same time, the memory transmits a of batch 2 data ₁ And b ₁ Arithmetic circuit CC ₀ Execution tmp1_0=a ₁ op ₀ b ₁ 。

Third step S73, memory sends d of batch 1 data ₀ Arithmetic circuit CC ₂ Execution of tmp0_2=tmp0_1op ₂ d ₀ The method comprises the steps of carrying out a first treatment on the surface of the At the same time, the memory transmits batch 2 data c ₁ Arithmetic circuit CC ₁ Executing tmp1_1=tmp1_0op ₁ c ₁ The method comprises the steps of carrying out a first treatment on the surface of the At the same time, the memory transmits a of the 3 rd batch data ₂ And b ₂ Arithmetic circuit CC ₀ Execution tmp2_0=a ₂ op ₂ b ₂ . Due to the arithmetic circuit CC ₂ The operator executed is the last operator in the chain operation structure, so that the operation result is the final operation result result=tmp0_2, and the final operation result is written back into the memory.

Similarly, the whole process flow can be controlled to run in a pipelined manner until the data of T batches are processed.

As can be seen from the above procedure, for a certain operation period, the controller may control the ith operation circuit executing the ith operator to execute the operation of the ith operator for the jth batch of corresponding source data from the memory and the intermediate operation result from one operation period on the ith-1 operation circuit, where 1< i < n,0< j < t+1; meanwhile, the (i+1) th operation circuit for executing the (i+1) th operator is controlled to execute the (i+1) th operator for the (j-1) th batch of corresponding source data from the memory and the intermediate operation result of one operation period from the (i) th operation circuit.

The pipeline execution mode can fully utilize an arithmetic unit, and fusion operation of T batch data can be completed by using T+N-1 steps. In contrast, if a conventional processor is used, then t×n steps are required to be performed, and the performance of the processor of the disclosed embodiment is improved by a factor of t×n/(t+n-1), and when T is much greater than N, the performance is improved by a factor of N. That is, when N different arithmetic circuits exist in the arithmetic unit, the performance of the processor supporting the fusion operation of the embodiments of the present disclosure may be improved by N times at most compared to the conventional processor.

It should be appreciated that in the example of fig. 7, for the source data, a manner is taken in which it is directly sent from the memory to the corresponding arithmetic circuit. From the foregoing description, one skilled in the art can directly derive the case that the source data is forwarded in the operator after being sent out from the memory, that is, the source data required for the operation of the current operator is forwarded through the connection circuit by the operation circuit executing the previous operator, instead of being directly obtained from the memory. And will not be described in detail herein.

It will be appreciated that the above described chained architecture performed in a pipelined manner may exist in a variety of scenarios.

In one example, the entire fusion operation is a chain operation structure, so the entire fusion operation can apply the above-described pipeline control.

In another example, the fusion operation includes a plurality of operation branches, at least one of which includes a chained operation structure, so that pipeline control may be applied to this operation branch.

In yet another example, although multiple operation branches are included in the fusion operation, the operation branches may be combined into one fusion sub-operator, so that the whole fusion operation may be converted into a chained operation structure, and on the basis, the pipeline control is applied.

Furthermore, there is no overlap between the N arithmetic circuits in the pipeline when in use, so as to support the N arithmetic circuits to perform operations on different batches of data simultaneously. For example, for result=max ((a+b) ×c, d), such a fusion operation requires addition circuits, multiplication circuits, and comparison circuits, which have no overlapping relationship with each other, and pipeline control can be applied when performing operations of a plurality of batch data. For example, if only one adder circuit is provided in the arithmetic unit for the arithmetic unit, the arithmetic circuits used overlap, and thus the pipeline control cannot be performed. However, if there are two adder circuits in the arithmetic unit, the arithmetic can be distributed to the two adder circuits, and thus the pipeline control can be applied, so that the arithmetic power of the arithmetic unit can be fully utilized, the processing time can be shortened, and the efficiency can be improved.

Alternatively or additionally, in some embodiments of the present disclosure, for fusion operations with multi-branch operation structures, the processing efficiency may also be improved in a parallel processing manner according to the characteristics of multiple branches.

Specifically, in one embodiment, assuming that the fusion operation includes a plurality of operation branches, the operation allocation of each operation branch is performed on M sets of operation circuits, M >1, at which time the controller may control the memory to send corresponding source data to support the M sets of operation circuits to simultaneously execute operators of the respective operation branches. For example, for the operation: result= (a+b) × (c+d), where two add branches may be allocated to be performed simultaneously on two add circuits. By the parallel operation, the operation efficiency can be greatly improved.

It can be understood that if a chain operation structure exists in the branches, when processing multi-batch data, parallel operation between the branches and pipeline control in the branches can be performed simultaneously, so that the operation efficiency is further improved.

Embodiments of the present disclosure also provide a chip that may include a processor of any of the embodiments described above in connection with the accompanying drawings. Further, the present disclosure also provides a board that may include the foregoing chip. The embodiment of the disclosure also provides a method for executing fusion operation by using the processor. Those skilled in the art will appreciate that the method steps of the processor performing the fusion operation correspond to the respective circuits and functions of the processor described above in connection with the drawings, and thus the features described above are equally applicable to the method steps and are not repeated here.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.

In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are split in consideration of the logic function, and there may be another splitting manner when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.

In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.

In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as central processing units, GPU, FPGA, DSP, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.

The foregoing may be better understood in light of the following clauses:

a processor of clause 1, comprising a memory, a controller, and an operator, wherein:

the memory is used for storing source data and a final operation result required by operation;

the controller is used for decoding a fusion operation instruction and controlling the memory and the arithmetic unit to execute the fusion operation instruction, wherein the fusion operation instruction at least indicates to execute fusion operation comprising a plurality of operators on source data; and

the arithmetic unit is used for acquiring source data from the memory under the control of the controller to execute the fusion operation and writing a final operation result back to the memory, wherein in the fusion operation, operation data required by an operator comprise intermediate operation results from the arithmetic unit.

The processor of clause 2, wherein the operator comprises a plurality of operational circuits, data interaction between the plurality of operational circuits and the memory is performed by a connection circuit, the operational circuits being configured to execute operators in the fusion operation.

The processor of clause 3, wherein the fusion operation comprises a chain operation structure of a plurality of operators, in the operator,

The arithmetic circuit executing the intermediate operator in the chain arithmetic structure is used for: receiving corresponding source data for a corresponding operator via the connection circuit, receiving an intermediate operation result from an operation circuit executing a previous operator, executing operation of the corresponding operator on the corresponding source data and the intermediate operation result, and transmitting the operation result to an operation circuit executing a next operator.

The processor of clause 4, clause 3, wherein when the fusion operation is performed on T lot data and the fusion operation is distributed to be performed on N operational circuits, T >1, N >1, the controller is further to:

and controlling the memory to send corresponding source data in the T batches of data so as to support the N operation circuits to execute corresponding operators on different batches of data simultaneously.

Clause 5, the processor of clause 4, wherein the controller is configured to, during the current operational cycle,

controlling an ith operation circuit for executing an ith operator to execute operation of the ith operator for the jth batch of corresponding source data and an intermediate operation result of an operation period on the ith-1 operation circuit, wherein 1< i < N,0< j < T+1;

Meanwhile, the (i+1) th operation circuit for executing the (i+1) th operator is controlled to execute the (i+1) th operator for the (j-1) th batch of corresponding source data from the memory and the intermediate operation result of one operation period from the (i) th operation circuit.

Clause 6, the processor of any of clauses 2-5, wherein the operational data required by the current operator further comprises:

source data from the memory via the connection circuit; or alternatively

Source data forwarded by an arithmetic circuit executing a previous operator via the connection circuit.

Clause 7, the processor of clause 2, wherein the fusion operation comprises a plurality of operation branches, and at least one of the operation branches comprises a chained operation structure therein.

The processor of clause 8, any of clauses 2-7, wherein the fusion operation comprises a plurality of operation branches, the operation allocation of each operation branch being performed on M sets of operation circuits, M >1, the controller further being configured to:

and controlling the memory to send corresponding source data to support the M groups of operation circuits to simultaneously execute operators of corresponding operation branches.

The processor of clause 9, any of clauses 2-8, wherein the connection circuit is a cross-bus.

The processor of clause 10, any of clauses 2-9, wherein the plurality of arithmetic circuits comprises any one or more of the following:

the device comprises a multiplication circuit, an addition circuit, a comparison circuit, a shift circuit, a revolution circuit, an overrun function circuit and a fusion operation circuit.

Clause 11, a chip comprising a processor according to any of clauses 1-10.

Clause 12, a board card comprising the chip of clause 11.

Clause 13, a method of performing a fusion operation using the processor of any of clauses 1-10.

The foregoing has described in detail embodiments of the present disclosure, with specific examples being employed herein to illustrate the principles and implementations of the present disclosure, the above examples being provided solely to assist in the understanding of the methods of the present disclosure and their core ideas; also, as will be apparent to those of ordinary skill in the art in light of the present disclosure, there are variations in the detailed description and the scope of the application, which in light of the foregoing description should not be construed to limit the present disclosure.

Claims

1. A processor comprising a memory, a controller, and an operator, wherein:

2. The processor of claim 1, wherein the operator comprises a plurality of operational circuits, data interactions between the plurality of operational circuits and the memory are performed by a connection circuit, the operational circuits to perform operators in the fusion operation.

3. The processor of claim 2, wherein the fusion operation comprises a chain operation structure of a plurality of operators, in which,

4. The processor of claim 3, wherein when the fusion operation is performed on T batches of data and the fusion operation is distributed to be performed on N operational circuits, T >1, N >1, the controller is further to:

5. The processor of claim 4, wherein the controller is configured to, during a current operational cycle,

meanwhile, the (i+1) th operation circuit for executing the (i+1) th operator is controlled to execute the (i+1) th operator for the (j-1) th batch of corresponding source data and the intermediate operation result of an operation period from the (i) th operation circuit.

6. The processor of any of claims 2-5, wherein the operational data required by the current operator further comprises:

Source data from the memory via the connection circuit; or alternatively

7. The processor of claim 2, wherein the fusion operation comprises a plurality of operation branches, and at least one of the operation branches comprises a chained operation structure therein.

8. The processor of any of claims 2-7, wherein the fusion operation includes a plurality of operation branches, an operation allocation for each operation branch being performed on M sets of operation circuits, M >1, the controller further to:

9. The processor of any of claims 2-8, wherein the connection circuit is a cross bus.

10. The processor of any one of claims 2-9, wherein the plurality of arithmetic circuits includes any one or more of the following:

11. A chip comprising a processor according to any of claims 1-10.

12. A board card comprising the chip of claim 11.

13. A method of performing a fusion operation using the processor of any of claims 1-10.